Spring is here. Clocks move forward. The Sakura (cherry blossom) festival in Japan marks the celebration of the new season. In India, the holi festival of colors ushers in the new harvest season. It’s a time for renewal and new ways of doing things.
This month, we are pleased to debut our newest set of SQL features in BigQuery to help our analysts and data engineers spring forward. It’s time to set aside the old ways of doing things and instead look at these new ways of storing and analyzing all your data using BigQuery SQL.
Higher precision and more flexible functions to manage your ever-expanding data in BigQuery
BIGNUMERIC data type (GA)
We live in an era where intelligent devices and systems ranging from driverless vehicles to global stock and currency trading systems to high speed 5G networks are driving nearly all aspects of modern life. These systems rely on large amounts of precision data to perform real time analysis. To support these analytics, BigQuery is pleased to announce the general availability of BIGNUMERIC data type which supports 76 digits of precision and 38 digits of scale. Similar to NUMERIC, this new data type is available in all aspects of BigQuery from clustering to BI Engine and is also supported in the JDBC/ODBC drivers and client libraries.
Here is an example that demonstrates the additional precision and scale using BIGNUMERIC applied to the various powers of e, Euler’s number and the base of natural logarithms. Documentation
As an aside, did you know that the world record, as of December 5, 2020, for the maximum number of digits to represent e stands at 10π trillion digits?
JSON extraction functions (GA)
BigQuery has long supported JSON data and JSON functions to query and transform JSON data before they became a part of the ANSI SQL standard in 2016. JSON extraction functions typically take two parameters: JSON field, which contains the JSON document and JSONPath, which points to the specific element or array of elements that need to be extracted. If JSONPath references an element or elements containing reserved characters, such as dot(.), dollar($) or star(*) characters, they need to be escaped so that they can be treated as strings instead of being interpreted as JSONPath expressions. To support escaping, BigQuery supports two types of JSON extraction functions: Standard and Legacy. The Standard (ANSI compliant and recommended) way of escaping these reserved characters is by enclosing the reserved characters in double quotes (” “). The Legacy (pre-ANSI) way is to enclose them in square brackets and single quotes ([‘ ‘]).
Here’s a quick summary of existing and the new (highlighted in bold) JSON extraction functions:
TABLESAMPLE clause (preview)
With the convergence into and growth of all types of data within BigQuery, customers want to maintain control over query costs especially when analysts and data scientists are performing ad hoc analysis of data in large tables. We are pleased to introduce the TABLESAMPLE clause in queries which allows users to sample a subset of the data, specified as a percentage of a table, instead of querying the entire data from large tables. This SQL clause can sample data from native BigQuery tables or external tables, stored in storage buckets in Google Cloud Storage, by randomly selecting a percentage of data blocks from the table and reading all of the rows in the selected blocks, lowering query costs when trying ad hoc queries. Documentation
More commands and capabilities in SQL to allow you to evolve your data as your analytics needs change.
Dataset (SCHEMA) operations (GA)
In BigQuery, a dataset is the top level container entity that contains the data and program objects, such as tables, views, procedures. Creating, maintaining and dropping these datasets have been supported thus far in BigQuery using API, cli and UI. Today, we’re pleased to offer full SQL support (CREATE, ALTER and DROP) for dataset operations using SCHEMA, the ANSI standard keyword for the collection of logical objects in a database or a data warehouse. These operations greatly simplify data administrators’ ability to provision and manage schema across their BigQuery projects. Documentation for CREATE, ALTER and DROP SCHEMA syntax
Object creation DDL from INFORMATION_SCHEMA (preview)
Data administrators provision empty copies of production datasets to allow loading of fictitious data so that developers can test out new capabilities before they are added to production datasets; new hires can train themselves on production-like datasets with test data. To help data administrators generate the data definition language (DDL) for objects, the TABLES view in INFORMATION_SCHEMA in BigQuery now has a new column called DDL which contains the exact object creation DDL for every table, view and materialized view within the dataset. In combination with dynamic SQL, data administrators can quickly generate and execute the creation DDL commands for a specific object or all objects of particular type, e.g. MATERIALIZED VIEW or all data objects within a specified dataset with a single SQL statement without having to manually reconstruct all options and elements associated with the schema object(s). Documentation
DROP COLUMN support (preview)
In October 2020, BigQuery introduced ADD COLUMN support in SQL to allow users to add columns using SQL to existing tables. As data engineers and analysts expand their tables to support new data, some columns may become obsolete and need to be removed from the tables. BigQuery now supports the DROP COLUMN clause as a part of the ALTER TABLE command to allow users to remove one or more of these columns. During the Preview period, note that there are certain restrictions on DROP COLUMN operations that will remain in effect. See Documentation for more details.
Longer column names (GA)
BigQuery now allows you to have longer column names upto 300 characters within tables, views and materialized views instead of the previous limit of 128 characters. Documentation
Storage usage analysis for partitioned and unpartitioned tables
INFORMATION_SCHEMA.PARTITIONS view for tables (preview)
Customers store their analytical data in tables within BigQuery and use the flexible partitioning schemes on large tables in BigQuery to organize their data for improved query efficiency. To provide data engineers with better insight on storage and the record count for tables, partitioned and unpartitioned, we are pleased to introduce PARTITIONS view as a part of BigQuery INFORMATION_SCHEMA. This view provides up-to-date information on tables or partitions of a table, such as the size of the table (logical and billable bytes), number of rows, the last time the table (or partition) was updated and whether the specific table (or partition) or is active or has aged out into cheaper long term storage. Partition entries for tables are identified by their PARTITION_ID while unpartitioned tables have a single NULL entry for PARTITION_ID.
Querying INFORMATION_SCHEMA views is more cost-efficient compared to querying base tables. Thus, the PARTITIONS view can be used in conjunction with queries to filter the query to specific partitions, e.g. finding data in the most recently updated partition or the maximum value of a partition key, as shown in the example below. Documentation
Quick launch summary
We’re making some updates to the home page in the Admin console. You may notice:
- A new card-based interface which includes more information and links to provide quick access to common tasks.
- Expandable (or collapsible) cards for Users, Billing, and Domains, with quick links to common items in those areas.
- Reordered items to make it easier to find the most used sections and complete common tasks.
- Admins: You’ll see the new interface when you log into the Admin console. Visit the Help Center to learn more about using the Admin console.
- End users: No end user impact.
- Rapid Release and Scheduled Release domains: Gradual rollout (up to 15 days for feature visibility) starting on March 31, 2021.
- Available to all Google Workspace customers, as well as G Suite Basic and Business customers
Quick launch summary
- TV Series
- TV Episode
- Admins: There is no admin control for this feature.
- End users: You can find and use the citations tool in the Tools menu. Visit the Help Center to learn more about citations in Google Docs.
- Available to all Google Workspace customers, as well as G Suite Basic and Business customers.
Google Cloud is offering no-cost training opportunities to help you gain the latest AI and machine learning skills. You’ll have a chance to learn more about the new Document AI along with Explainable AI, Looker, BigQuery ML, and Dialogflow CX.
The new Document AI (DocAI) platform, a unified console for document processing, became available for preview in November. Join me on April 22 to learn how to set up the Document AI Platform, process sample documents in an AI Platform notebook, and use the Procurement DocAI solution to intelligently process your unstructured data or “dark data” such as PDFs, images and handwritten forms to reduce the manual overhead of your document workflows. Save your spot to learn about DocAI here. Can’t join on April 22? The training will be available on-demand after April 22.
Join Lak Lakshmanan, Google Cloud’s Director of Data Analytics and AI Solutions, on April 16 to explore Explainable AI, a set of tools and frameworks to help you understand and interpret predictions made by your machine learning models. Lak will go through a hands-on lab to teach you several canonical methods to explain predictions and illustrate how the Google Cloud AI Platform serves predictions and explanations. Reserve your seathereto learn about Explainable AI. Can’t join on April 16? The training will be available on-demand after April 16.
Looker is a modern business intelligence (BI) and analytics platform that is now a part of Google Cloud. If you have no machine learning experience, we recommend you check out our technical deep dive to learn how to use Looker to automate the data pipeline building process and generate deeper data insights. Sign uphereto get access.
BigQuery ML lets you create and execute machine learning models in BigQuery using standard SQL queries.
In our “Real-time credit card fraud detection” webinar, you’ll learn how to build an end-to-end solution for real-time fraud detection. You’ll discover how trained models in BigQuery ML, predictions from Google Cloud’s AI Platform, streaming pipelines in Dataflow, notifications on Pub/Sub, and operational management in Data Studio can all come together to identify and fight credit card fraud. Registerhereto watch the webinar.
To find out how to build demand forecasting models with BigQuery ML, sign uphere. In this webinar, a Google Cloud expert will walk through how to train, evaluate and forecast inventory demand on retail sales data. He’ll also demonstrate how to schedule model retraining on a regular basis so that your forecast models can stay up-to-date.
Understand the ins and outs of Dialogflow CX, lifelike conversational AI with virtual agents (chat and voice bots), when you register here. This webinar shows you the newest ways to build intelligent virtual agents. A Google Cloud expert will demonstrate how to get these agents ready for production and improve customer experience using analytics. She’ll also share best practices for deploying prebuilt agents.
We hope these training resources help you grow your AI and machine learning knowledge. Stay tuned for new learning opportunities throughout the year.
When looking at data, business decision makers are often blocked by an intermediate question of “What should I take away from this data?” Beyond putting together the numbers and building the results, data analysts and data scientists play a critical role in helping answer this question.
Organizations big and small depend on data analysts and data scientists to help “translate from words to numbers, and then back to words” as sports analytics pioneer Dean Oliver once said. But beyond creating tables, graphs, and even interactive dashboards for their audience, data scientists and analysts often need to be able to automate further the communication piece, delivering faster, more clear insights for the business.
Data cloud solutions like BigQuery help data practitioners scale and automate their analysis and accelerate their time to insight. With BigQuery’s self-tuning, serverless infrastructure, data analysts are empowered to focus on their analytics without worrying about infrastructure management, as IT operations are automated. Enabling analysts and saving them time in this way allows shifting resources, perhaps to finding the stats or trends that may not be as easily automated, as well as building out more metrics to be made available for the automated insight generation pipeline.
In a previous post, we gave an overview of how Major League Baseball (MLB) partnered with Google Cloud to scale its process of generating “game notes”—statistical insights related to players or teams involved in a particular matchup. With the 2021 MLB season upon us, we’ll go into further detail on how various BigQuery functionality comes together in the process to automate the creation of game notes from Statcast data. We’ll discuss how to go from relatively raw data in BigQuery to hundreds or thousands of insights consisting of a sentence, picture, and contextual table like the ones in the original post.
While baseball serves as a fun playground, parts of the process outlined here can be used across all industries that have big data and want to turn it into timely, contextual, succinct bits of valuable information with vastly improved efficiency relying on automation. Let’s step up to the plate…
Construct a Repeatable Pattern for Certain Types of Insights
Before going into the technical aspects, it’s important to think about which forms of insights can be automated and scaled from the given data, and how. In the MLB case, “stat leaderboard”-type notes like the one above can be thought of as compositions of a few elements:
A statistic of interest, e.g. number of hard-hit balls or average fastball velocity.
A time span that the leaderboard covers, e.g. last 2 regular seasons or this postseason.
An “entity” of interest, usually either teams or players.
A “facet” of the game, representing which “side” of the ball to calculate certain stats for, e.g. hitting or pitching.
A ranking qualification criteria, which represents a minimum # of opportunities for a stat to be reasonable for ranking (mostly used for player “rate” stats where a small denominator can lead to outlier results, like a .700 batting average over 10 at-bats).
Putting together these pieces can give a lot of potential notes, and then we can use some “impressiveness” criteria (e.g. if the player ranks in the Top 5 or 10) and relevance criteria (e.g. is the player’s team involved in an upcoming game?) to surface interesting notes and attach them to games. Without automation, an MLB content expert could generate some of the same game notes with tools like Baseball Savant, but would have to run dozens of manual searches looking for specific players or teams, find ones that actually have that entity in the top 5 or 10, and then manually type out the note text and/or accompanying table. Even if each note only takes a few minutes to generate, that can easily add up to several hours when trying to cover multiple players across all 30 teams.
Set Up Metrics Metadata in Tables
In addition to tables that hold the actual data from which insights are to be derived, it’s helpful to set up BigQuery tables that contain “metadata” about some of the composable elements that form the insights. In the MLB game notes case, one of these tables has information about each statistic (e.g. “Hard-Hit Rate”) that is considered for a game note, like its display names, definition, and whether ascending or descending ranking is “good” for that stat.
Another metadata table has information about each span (e.g. “2020 postseason”) for which notes can be constructed, including relevant dates and game types.
These tables are used in various pieces of the note generation process—filtering data to the correct dates by span, ranking players or teams in each stat the correct way, and having consistent text to use in both the sentences and tables that comprise the final notes.
Create Views to Filter and Prepare Specific Data for Repeated Use
Another way to modularize some aspects of the insight generation process is to use BigQuery views. In some analytics workflows, views can be used to store results of certain queries that are intermediaries for multiple other steps. Since views are virtual, sitting “on top” of tables (or other views), they are always fresh with the latest data in the underlying tables. Compared to a batch-oriented ETL system (e.g. one replying on MapReduce), using BigQuery tables and views together is more efficient in modularizing intermediate data processing steps like these.
In the MLB game notes code, there are views that filter pitch-level event data to specific spans, like the one below for the 2020 postseason.
We also employ more complex views that get each team’s next scheduled game (so we know which games are available to “attach” notes to), the updated roster for each team (so we know which players are on which teams), and as intermediate steps in the note creation process outlined below. Big picture, views serve as versatile building blocks in various pieces of the insight generation process.
Use Temporary Tables and Dynamic SQL for Intermediate Outputs
Eventually, the insight generation process moves to steps that can be executed with a series of SQL queries—some repetitive, but different enough that they can’t just be 1 big query, others relying on similar logic but applied to parallel tables. BigQuery features like temporary tablesand dynamic SQLare useful in executing these steps while reducing the amount of unnecessary data being stored and potentially duplicative code to manage. When going from automating notes off a few stats to a couple dozen to over one hundred, being able to add new stats or spans with minimal code changes—i.e., not one new SQL query per stat—is extremely valuable.
One example incorporating both concepts from the MLB game notes generation process is the following dynamic SQL statement. This creates a temporary table SpanSituationEvents from a variable SPAN_SITUATION_EVENTS_TABLE that is a string representing the name of a specific table (or view) with pitch-level data:
This piece of code is used to turn data on a set of particular table of events—think of the 2020 postseason view described above—and then instantiates it as a table that can be referred to multiple times within a script, but doesn’t need to exist outside the script (hence, temporary).
Dynamic SQL allows us to use table names as variables, getting around restrictions of BigQuery parameters, which don’t allow this. Multiple SQL queries of the same form referring to different tables (in this case, different sets of MLB events) can be represented in a single pattern. This process of “using code to generate SQL” is a key technique in many data science projects, often done in Python or R. One advantage of using dynamic SQL in BigQuery—i.e. using SQL to write SQL—is that everything can be written “in one place,” without having to rely on other languages, environments, client libraries, etc.
Use Scripting and Stored Procedures to Implement More Complex Business Logic
As we dive deeper into the code to generate insights, the business logic can become more intricate to generalize over different types of metrics (ratios, sums, more complex calculations) and different levels of aggregation. Instead of writing a separate query for each one of the metrics and aggregation levels under consideration, we can use BigQuery scripting and stored procedures to write SQL “templates” that keep as much as possible the same, while using variables and control flow to handle different use cases of queries with the same fundamental structure. Used in this way, stored procedures are a SQL-based way of using functional programming to modularize and string together pieces of a bigger algorithm.
In the case of calculating various stats to be used in MLB game notes, there are “counting” stats that are cumulative, summing over a single field (e.g. hard-hit balls), and there are “rate” stats that involve summing a numerator and denominator (e.g. hard-hit rate) and other further filtering by a qualification criteria (e.g. minimum 75 batted balls). Stats for pitchers should be aggregated over the “pitcherId” field; stats for hitters (sometimes the same stats, just from the opposite perspective) should be aggregated over the “hitterId” field. Generalizing this format of calculation was achieved by using a large stored procedure, the beginning of which is shown below.
The snippet below shows how the ENTITY_TYPE and STAT_GAME_FACET fields are used in a multi-level case statement to allow aggregation of any stat to either the team or player level, as well as from the pitching or hitting side.
The actual procedure takes up a couple hundred lines of code, and makes use of dynamic SQL and temporary tables as mentioned above. It was constructed in such a way to generalize the calculation of dozens of “parallel” stat leaderboards, and enables the calculation and storage of various stat leaderboards in “long” format as opposed to “wide” format. Storing each player or team stat on its own row, as opposed to multiple stats for a given player or team on the same row (as in “traditional” sports statistics displays), enables subsequent steps like ranking and note generation to proceed more seamlessly within BigQuery (essentially, “stat leaderboard” becomes another “GROUP BY” variable).
Like functions in other languages, BigQuery stored procedures can be nested and chained together, enabling the execution of complex data pipelines from directly within the data warehouse. In the MLB game notes case, a procedure called “run_stat_combo_calcs” takes a set of stat leaderboards to calculate, runs the “calc_stat_span_situation” procedure referenced above over 100 times (once per leaderboard), then passes the resulting “long” table to a function that adds all rankings and uploads the resulting table into BigQuery for additional processing.
Create Contextual Ranking Info Using Arrays and Structs
A long table with entities (teams or players), stats, and rankings can be joined with the metadata tables with stat/span information and the team roster view (both referenced above) to create the “sentence” piece of many game notes. An additional piece of context that can be useful—perhaps more often in sports, but not exclusive to that realm—is a table showing the entity of interest along with others around that entity in the rankings. Using a direct MLB example, if a player ranks second in a given stat, it’s interesting to see a table of who ranks first, third, maybe even the entire top five or ten.
From a BigQuery standpoint, generating this part requires going back into the long leaderboard table and getting more than just the info present on the single row for one team or player—we need the other entities “around” them. The implementation to get that information from “surrounding” rows is enabled by using BigQuery’s array functionality, including arrays of structs. To demonstrate with some stripped down “pseudo-code” from the game notes generation process:
This query aggregates to 1 row per stat leaderboard (e.g. hard-hit ball leaders on a given date), nesting the player or teams that rank in the top or bottom 5 or 10 of a given leaderboard in an array. The array is made up of structs, a record format that allows multiple fields to be stored together—in this case, each entity’s rankings, id, name, current team, and formatted stat value comprise a single record in the array.
Organizations are increasingly investing in digital businesses ecosystem strategies to foster innovation and operate efficiently. These ecosystems connect various stakeholders–such as partners, developers, and customers–via application programming interfaces, or APIs. APIs allow various software systems to interface, and are thus the primary mechanism of value exchange within these ecosystems.
For example, Bank BRI, one of the largest banks in Indonesia, drove over $50 million in new revenue by creating an online marketplace with more than 50 monetized open APIs that let over 70 ecosystem partners leverage the bank’s credit scoring, business assessment, and risk management capabilities. Similarly, AccuWeather, the world’s leading weather media and big data company, makes APIs available to more than 70,000 registered developers who’ve used the company’s data and functionality to create over 30,000 apps.
Scaling up digital business ecosystems can unlock new markets, grow regional partnerships, and connect distributed workforces–but all of this starts with scaling up the API program. To help customers globally scale API programs, we are pleased to bring the power of Google’s networking capabilities to API management.
Expand global reach and deliver high performance with Apigee and Cloud CDN
Apigee X, the latest release of Google Cloud’s full lifecycle API management platform, makes it simple and easy to apply Cloud Content Delivery Network (CDN) to APIs. Working in tandem, the two solutions let enterprises not only secure and manage their APIs but also make them available across a global ecosystem of stakeholders.
Specifically, Apigee lets enterprises apply security to APIs, control how and by whom they’re used, publish them for consumption, monitor and analyze their usage, monetize them, and perform other aspects of API product management. Cloud CDN helps these APIs and the services they support to be performant, regardless of how many ecosystem participants are calling the API or where those ecosystem participants are located.
Cloud CDN runs on Google Cloud’s globally-distributed edge network and lets organizations serve content globally. This reduces latency both by leveraging Google’s massive network infrastructure, which supports services such as Gmail and Google Search, and by caching content closer to the users, improving performance and availability for peak traffic seasons. Because digital assets can be served from Google’s global edge instead of an organization’s backend systems, web pages and apps can run even faster and offer a smoother experience.
By caching often-accessed data at the network edge, as close to the customers and end users, as quickly as possible, Cloud CDN also helps organizations seamlessly handle seasonal spikes in traffic, such as those that may occur during the holiday or back-to-school seasons. In addition to improving ecosystem experiences and reliability, this approach to caching can also minimize web server load, compute usage, and ultimately costs.
Better digital experiences lead to bigger digital ecosystems
Whether it’s interacting with customers, partners, or third-party developers, an enterprise’s ability to expand its digital ecosystem is limited by the quality of the digital experiences it creates. If apps load too slowly or services are not consistently available, ecosystem participants will leave. If APIs and the apps they power are not secure, participants will leave. Each link in the digital ecosystem value chain relies on API management and network management to keep interactions flowing–and with the combined power of Apigee X and Cloud CDN, we’re pleased to help our customers meet these challenges and expand their businesses. To try Apigee X for free, click here, and to learn more about Cloud CDN, click here. Check out our latest demo of Apigee and Cloud CDN working together in the below video.
Organizations are realizing that experience management and analysis are important aspects of understanding needs and providing the best level of service to customers, employees, and vendors. Surveys are a powerful vehicle within the experience management space for data collection within organizations of all shapes and sizes. According to Verified Market Research, Experience Management, which includes tools like surveys, is a USD $17.5B market that is expected to grow 16.8% annually (CAGR) from 2020 to 2027 (Source).
Tools like Google Forms, Survey Monkey, Qualtrics, and TypeForm allow companies to get answers fast from groups and organizations with whom they interact. The growth in technology options and the ease and effectiveness of survey platforms means that many companies create a lot of surveys. Oftentimes, these surveys are used once to solve a specific problem, the results are analyzed and shared, and then the survey and resultant data are forgotten. A natural opportunity exists for companies to instead capture and store those results in a place where they can be used for survey-over-survey analysis and comparison against other first and third party data to better understand cause and potential options for improvement.
So, what barriers exist to creating this experience management data warehouse? Surveys by nature are flexible vehicles, and many survey systems provide data in a question-answer, column-row format, or as delimited results. This data shape, while good for human consumption, is not great for wide-scale analytics, and the process for getting it to a better format can be daunting. Over the course of this blog, we’ll demonstrate how Google Cloud and Trifacta have partnered to create a design pattern to easily shape, display, and use this data.
Focusing on scale and flexibility
Survey datasets often require pivoting or parsing of data so that questions can be compared, analyzed, and augmented in a more structured format. This survey analytics pattern walks through the process for transforming the data, putting it into a data warehouse, and using that warehouse to analyze and share findings. This pattern can be extended to a variety of survey types and different iterations of the surveys, providing you with a platform that can be used to do deeper and more consistent analysis.
To illustrate this pattern, we’ll leverage Google Forms as our capture mechanism (Note: although we’re using Google Forms for this example, the concepts are transferable to other survey engines that have similar export schemas). Google Forms allows users to structure questions in a variety of ways, from multiple checkboxes to a ranked list of items to a single, freeform answer, each producing slightly a different output. Based on the question category, the survey analytics pattern provides a specific approach to structure the data and load it in BigQuery tables.
For example, with multiple choice questions, the results may appear as a list of values with semicolon separator (e.g. “Resp 1; Resp 4; Resp 6”). Using Google Cloud Dataprep by Trifacta, a data preparation service found on the Google Platform, we can take those values and parse the extract into a cleaner format for analysis where each response is a new row.
Transformed Export and Results in the Data Warehouse
On the Google Cloud Platform, your experience management data warehouse will leverage three main technologies: Cloud Dataprep by Trifacta, BigQuery, and Data Studio or Looker. As mentioned above, Cloud Dataprep is the key technology involved in normalizing, cleaning and enriching data for analytic use. BigQuery, Google Cloud’s enterprise data warehouse, will store your survey data alongside other data sources to allow stakeholders to dig deeper and uncover hidden trends. With Data Studio or Looker, users can explore and visualize the data, taking immediate action on the insights they uncover. You can also take your analysis further by combining this data preparation with AI/ML to more easily identify and respond to anomalies or trends in your data (to learn more, please take a look at Google Cloud’s AI Platform and some of our AI/ML reference patterns for analytics).
The benefits of an Experience Management Data Warehouse
Building an experience management data warehouse is one giant step toward a culture of more robust analytics. At a high level, an experience management data warehouse allows companies to:
Compare multiple surveys or responses (without manual combination)
Find new insights by isolating responses from a variety of surveys—for example, identifying a team’s satisfaction correlation to the size of teams or regions
Enrich survey results with other data in your warehouse—for example, combining survey results on product usage with sales data to surface new marketing segments with high conversion probability
Represent and visualize the survey results in new ways using Looker, Data Studio, or another BI tool
Share out survey results to a broader audience in a compelling and personalized visualization format
Build a scalable survey practice, that allows for iterating and improving over time
Apply more advanced analytics and decision making to survey data using Artificial Intelligence (AI) and Machine Learning (ML)
Use cases for the survey analytics pattern
So, how does this play out in the real world? Here’s a few ways companies can leverage an experience management data warehouse.
Employee Experience: An HR department might normally send out quarterly surveys to get a pulse check on the general well-being of employees. Yet with the onset of the pandemic, HR decides to send out surveys with more frequency. With the results of these employee satisfaction results in one place, this HR department can now compare satisfaction levels to pre-pandemic times, as well as isolate the surveys taken during the pandemic to see if satisfaction levels have improved over time.
Product Experience: A product team sends out surveys to track the use of specific products or features. With all of these surveys stored in a data warehouse, the product team may decide to enrich this data with sales data to understand if product usage correlates with sales success. Since sales data is already stored in BigQuery, it’s easy to access and combine this data.
Customer Experience: A market research team would like to better understand customer journeys. These surveys can help analysts understand user sentiment, and they can also be a lead generation source. By joining the results of the survey with other insightful datasets like CRM tools, marketers can use analytics to create targeted campaigns for users with a high propensity to buy.
Ready to take your surveying to the next level and build an experience management data warehouse? Get started today by visiting our Codelab, digging into the technical guide, or checking out our video walkthrough. You can access all of our patterns on our Smart Analytics Reference Patterns page.
Last November, we made relational database migration easier for MySQL users with our public preview of Database Migration Service (DMS). Today, we’ve officially made the product generally available, and bring the same easy-to-use migration functionality to PostgreSQL.
The thing I’ve appreciated the most about diving deep with DMS has been that it just works. Once you get your source instance and database(s) prepared, and establish the connectivity between source and destination, doing the migration is all handled. When it’s all finished, cutting over to using your Cloud SQL instance as your application’s primary database is as simple as clicking a button in the DMS UI.
Not to downplay the difficulty in database prep, or connectivity. I wrote a thorough blog post walking through the various connectivity options for DMS in great detail. Network topology can be incredibly complicated, and connecting two machines securely through the internet while serving an application with thousands or millions of users is anything but simple.
Today, I want to take a moment with you to cover preparing your source PostgreSQL instance and database(s) for migration using DMS and some gotchas I found so you don’t have to.
I’ll start by saying, the documentation and in-product UI guidance are both excellent for DMS. If you’re familiar with setting up replication infrastructure for PostgreSQL, you’re probably good to jump in, and check in on the documentation if needed. Having said that, it’s documentation, so here I’ll try to add a bit so it’s all in one place to get everything you need to prepare your source PostgreSQL instance and database(s).
Step one, be sure your source instance version is supported. Current list of supported versions can be found on the documentation page I linked above.
Next up is a schema piece: DMS doesn’t support migrating tables that don’t have a primary key. Starting a migration against a database that has tables without primary keys will still succeed, but it won’t bring over the data from a table that’s lacking a primary key, but the table will still be created. So if you want to bring the data over from a table that doesn’t have a primary key, you have a few options:
- You’ll have to add a primary key prior to starting the migration.
- You’ll need to bring the data over yourself after the initial migration. Keeping in mind of course that if you bring the data over yourself, even if you maintain the connection, DMS won’t replicate data for that table moving forward.
- You can export the table from the source instance and import it into the new instance.
- Finally, you can create a table with the same schema as the one you have that doesn’t have the primary key, give it a primary key (should use a sequence generator to autogenerate the key) and copy the source data into it. Then do the migration. DMS as part of doing the migration will create the non-PK table, it just doesn’t copy the data over. Then you can copy the data from the migrated primary key table, and finally delete the primary key table once you’ve verified the data. It sounds complicated, but it ensures you’re getting the same data at the point of migration as the rest of your data as long as you’ve got any new rows inserted into the non-primary key table also going into the primary key copy. If you’re worried about the data in that table changing during the migration, you can copy the data over right before promoting the destination instance to minimize that window.
DMS relies on pglogical for the migration work. This means that the pglogical extension has to be installed on each of the databases you want to migrate. Instructions for installing pglogical on your instance and database(s) can be found here. If you’re running on Linux, the repo’s installation page is helpful. To be sure I took one for the team, I decided to see how bad it might be to migrate a PostgreSQL database installed with Homebrew from MacOS to Cloud SQL. Turns out, shockingly not too bad! Installing pglogical from source:
1) Clone GitHub repo
2a) Get compilation error because
postgres.h not found
3) Find where Homebrew installed Postgres, find include folder, add all include folders to
make again, built!
sudo make install because pglogical documentation said I might need it (side note: don’t pre-optimize!)
5a) Fails without any good messages
7) Great success! Can’t quite test success yet, since now the instance and database(s) have to be configured to use pglogical and replication.
The next piece is pretty straightforward if you’ve done replication in PostgreSQL before. There are some configuration variables on the instance you need to set in order for the replication to succeed. There are two main ways to change these values. You can either adjust them while the instance is running with the
ALTER SYSTEM SET <variable> TO <value>; calls, or you can change them in the configuration file,
posgresql.conf. Either way, you’ll need to restart the instance for the changes to take effect.
If you want to change it in the configuration file, but don’t know where it lives, generally it lives in the data directory for the database. If you only have the credentials to log in to your database but don’t know where it lives, you can run
SHOW data_directory once connected to the database and it’ll give you the location of the data directory.
The variables you need to set are:
wal_level = logical # Needs to be set to logical
max_replication_slots = n # Number varies, see here for details
max_wal_senders = n # Should be max_replication_slots plust number of actively connected replicas.
max_worker_processes = n # Should be set to how many databases that are being replicated
Note that the
shared_preload_libraries variable is a comma delimited list. You need to be careful when you set it to check first to see if there are other libraries that are being preloaded to include them, otherwise you could unload required libraries by your setup and cause issues with the database.
Once you’ve restarted you can verify the changes by connecting and running
SHOW <variable> e.g.
SHOW wal_level should show
Quick example time:
Note that these numbers are for the DMS load only. If you already have these values set for other reasons, you need to take that into account. For example, if you’ve got
max_worker_processes set to 8 to handle higher parallel querying, then you may want to add more on top to accommodate the replication to avoid impacting performance.
Case 1: You’re just doing a migration and immediately promoting the Cloud SQL instance. There aren’t any other replicas setup on the source, and you only have a single database you’re migrating over. Then you’d want to set the values to:
# Technically we only need 1 for Cloud SQL subscriber and the default is
# set to 10, so you could just leave it alone. This is just illustrating
# that you could set it lower without any issues
max_replication_slots = 3
# Equal to max_replication_slots + 1 because we’ll only have one
# replica connected to the source instance
max_wal_senders = 4
# Technically we only need 1 here because we’re only bringing over
# one database, but always a good practice to have one as a buffer
# just in case there’s an issue so it doesn’t rely on
# only the one processor.
max_worker_processes = 2
Case 2: You have a setup where your on prem local instance is already set up with 5 replication slots to handle other replication you have in place, and there are 4 databases you want to migrate to the Cloud, you would want to set the variables up like:
# 5 for existing subscribers + 4 for each of source databases since pglogical
# requires 1 slot for each database
max_replication_slots = 9
# Equal to max_replication_slots + 6 because say we have 5 existing replicas,
# and we’ll be adding one more replica for DMS doing the migration
max_wal_senders = 15
# 4 databases we’re migrating, plus and extra as a buffer just in case
max_worker_processes = 5
Once you have your variables all set, if you changed them in the config file, now’s the time you need to restart your PostgreSQL instance.
You can verify it worked by logging into the instance and running
CREATE EXTENSION pglogical on one of the databases you’re planning on replicating over. As long as it works, you’ll need to connect to every database you want to be replicating and run that command on each one. And while you’re there on each database, you need to grant the user that you specified in the
Define a source step creating the migration certain privileges. These grants need to happen on each database you’re replicating as well as the postgres database:
# on all schemas (aside from the information schema and schemas starting with “pg_”) on each database to migrate, including pglogical
GRANT USAGE on SCHEMA <SCHEMA> to <USER>
# on all databases to get replication information from source databases.
GRANT SELECT on ALL TABLES in SCHEMA pglogical to <USER>
# on all schemas (aside from the information schema and schemas starting with “pg_”) on each database to migrate, including pglogical
GRANT SELECT on ALL TABLES in SCHEMA <SCHEMA> to <USER>
# on all schemas (aside from the information schema and schemas starting with “pg_”) on each database to migrate, including pglogical
GRANT SELECT on ALL SEQUENCES in SCHEMA <SCHEMA> to <USER>
# We’re not handling it in this blog post, but if you happen to be trying to replicate
# from RDS, it would be
GRANT rds_replication TO USER.
ALTER USER USER WITH REPLICATION
If your source database is earlier than version 9.6, there’s an extra step to follow because before that, PostgreSQL didn’t have replication delay monitoring by default. This is needed because DMS uses this to be able to watch if replication lag becomes too high. I’m not going to cover it in detail here since all versions before 9.6 are currently end of life, but if you need to do this, there’s information on what you need to do here.
Congratulations! Your PostgreSQL instance and database(s) are fully configured and ready for DMS! Another nicety of DMS, is when you’re all configured and ready to go, there’s a connectivity/configuration test in the UI that will tell you if everything is configured correctly or not before you hit the final “do it” button.
Remember I mentioned that I cover a lot of the nitty gritty details around connectivity between your source database and the Cloud SQL instance in the blog post I linked at the top of this post. It covers MySQL there, so I’ll add a pitfall I ran into with PostgreSQL here before I leave you.
Be sure to remember if you haven’t already, to enable your database to listen and accept connections from non-localhost locations. Two pieces to this, one, you need to change the
listen_address variable in your
postgresql.conf file. It defaults to localhost, which might work depending on how you’re managing connection to the database from your application, but won’t work for the migration. You also need to modify the
pg_hba.conf file to grant your user for the migration access to your local database from the Cloud. If you don’t do either of these, DMS is really good about giving you clear error messages from the PostgreSQL instance telling you that you messed up. Ask me how I know.
And there we have it. Everything in one place to get you ready to go bringing your PostgreSQL database into Cloud SQL. If you have any questions, suggestions or complaints, please reach out to me on Twitter, my DMs are open! Thanks for reading.
Editor’s note: We’re hearing today from Freedom Financial Network, provider of technology-based solutions that help consumers establish financial security and overcome debt. To speed their migration to Google Cloud SQL, they turned to our Database Migration Service.
Freedom Financial Network’s products and services have helped hundreds of thousands of consumers reduce and consolidate their debt. During a period of significant growth, we realized that we needed to drive growth by transitioning from a monolithic to a microservices architecture on Google Cloud, helping us expand our suite of consumer products. As part of that migration, Google’s Database Migration Service (DMS) accelerated a move to fully managed Cloud SQL that took hours, not days. We had initially planned for two to three hours of downtime—but DMS let us migrate a total of 1 TB of data with no more than ten minutes of downtime per application we moved.
Starting from an on-premises infrastructure
Before moving to Google’s data cloud, our system was hosted on Rackspace, with an architecture divided by three business units. Each unit had one highly available MySQL cluster of about 600 GB of space with a SAN at the end, with 1.8 TB of shared disk space split evenly across the three clusters. Each of the three clusters consisted of two servers running MySQL, though only one was active at a time. They were configured for auto-failover, so if one of the servers failed, it would switch to the other active one. The intention of the division of the three business units was that, if one needed a database, they’d set it up in their own cluster. So each unit was essentially a monolith, containing three to four systems.
These were mostly InnoDB databases on MySQL with a variety of sizes and uses of application, including many internal systems to support our call center agents. The largest of these systems had about 500 GB of usage in just one schema. We also had some supplementary systems and public-facing websites.
We’ve had those clusters since 2013, and the three business units were always managed by a small team of two of us. Each business unit was essentially a monolith, even if technically each was split into three to four services. Part of our drive to transition to a microservices architecture was to help us manage communications between our various business units.
With Rackspace, even something small like changing the size of the disk could take a while. We needed to submit a support ticket for them to allocate and configure the disk. It’s a manual intervention, after all, and it would take two to three weeks to update.
Considering a cloud migration
When it came to determining whether to migrate to the cloud, and which provider and tools to use, our clusters were a big early consideration. They were vastly over-provisioned, with way more resources than we needed for the reason described above.
With Cloud SQL, we really liked the ability to split the clusters and size them appropriately. We’d also have more flexibility, since we wanted to migrate to a higher MySQL version. Because of the structure of the clusters, upgrading in place would have required significant time and effort. We would have had to update one cluster at a time, with each cluster probably affecting 60-70 percent of our engineering teams. Just trying to coordinate that would have been a nightmare. So Cloud SQL gave us the ability to do it more slowly, team by team or app by app, once we moved them to different database instances.
Choosing Database Migration Service
At first, we didn’t consider an auto-migrating solution because there wasn’t a lot of complexity in the migration, especially for our small databases. We’d just have to bring the application down, export the database, import it to Cloud SQL, and bring it back up. That worked for most of our applications.
But toward the end of the process, two applications remained. The previous year, we’d tried the migration process with one of them, and it had stalled at the database, because we couldn’t make the dump and load process quick enough. It would have required 12–15 hours downtime, and that just wasn’t an option, as the application was the backbone of our business. That much downtime would have made a tangible impact on our business, not just financially. We needed a new solution.
Through conversations with our product team and our Google contact, we learned about the Google Cloud’s Database Migration Service (which was in private preview at the time), which provides a serverless migration experience from MySQL to Cloud SQL for MySQL with continuous data replication, at no additional cost. Before DMS, we’d been looking at other options—offline migration, external master (pre-DMS solution on Google Cloud), just to name a few, none of which would have worked for us.
Testing, then migrating
At first with DMS, we performed a test run with staging databases just to validate that it would work. Once we corrected some problems with our usage and got it all configured correctly, it worked exactly as it was supposed to. Then we started the process. For that one backbone application team, they launched their migration again, while in the background I set up the replicas of their production instance so that they could manage the staging.
In Rackspace, our applications were running on virtual machines (VMs) and connecting to our MySQL instances. Part of the move to Google Cloud was also to migrate applications to containers on Google Kubernetes Engine (GKE).
We performed three migrations with DMS, one of which involved four applications. Each time, the application teams deployed the applications to the future production environment on GKE, but they were not marked as live. We would test the applications with a Cloud SQL instance that had staging data that was brought in using DMS to test that the applications were running correctly. Then we would initiate migration with DMS and once the environments were in sync and we scheduled the cutover date, we would just bring down the applications on Rackspace and update the DNS records to point to Google Cloud.
With all three clusters, we migrated five logical databases of varying sizes, between 240–500 GB of data, for a total of around 1 TB of data.
Downtimes of minutes, not hours
Migrations were much faster with DMS than we expected. From the time that we told DMS to dump and load to the Cloud SQL instance, to completion, they were all done and fully synchronized within 12–13 hours. We’d kick one off in the afternoon, and by the time we got back the next morning, it was done. We’d actually been setting aside a few days for this task, so this was a great improvement.
Initially, when planning the migration, we figured that a downtime of 2–3 hours might have been workable—not ideal, but workable. But once we were up to speed with our competence on DMS, the actual downtime for each application from the database side was a maximum of ten minutes. This was a great improvement for every team in our organization.
DMS had step-by-step instructions that helped us perform the migrations successfully, without any loss of data. With DMS and Google Cloud’s help, we transformed our monolithic architecture to a microservices architecture, deployed on GKE and using the Cloud SQL Proxy in a sidecar container pattern, or the Go proxy library, to connect to Cloud SQL.
We now have a more secure, versatile, and powerful data cloud infrastructure, and because it’s fully managed by Google Cloud, our developers have more time to focus on expanding our customer products.
We’re excited to announce that Google Cloud’s Database Migration Service (DMS) is generally available, supporting MySQL and PostgreSQL migrations from on-premises and other clouds to Cloud SQL. Later this year we will introduce support for Microsoft SQL Server. You can get started with DMS today at no additional charge.
Enterprises are modernizing their business infrastructure with managed cloud services. They want to leverage the reliability, security, and cost-effectiveness of fully managed cloud databases like Cloud SQL. In November, we launched the new, serverless DMS as part of our vision for meeting these modern needs in an easy, fast, predictable, and reliable way.
We’ve seen accelerated adoption of DMS, including customers such as Accenture, Comoto, DoiT, Freedom Financial Network, Ryde and Samsung, who are migrating their MySQL and PostgreSQL production workloads to Cloud SQL. DMS provides these customers the power to migrate quickly and with minimal disruption to their services.
Freedom Financial Network quickly migrated their large MySQL databases to Cloud SQL. Christopher Testroet, their principal engineer, said “Initially, when planning the migration, we figured that a planned downtime of 2–3 hours might have been workable—not ideal, but workable. But once we were up to speed with our competence on DMS, the actual downtime for each application from the database side was a maximum of ten minutes. This was a great improvement for every team in our organization.”
We worked closely during the DMS preview period with DoiT, a company that specializes in helping their customers with cloud migrations. “We see many customers that either want to migrate their business from on-premises to the cloud or are already in the cloud and want to migrate to a different provider,” says Mike Royle, Staff Cloud Architect at DoiT International. “One of the key pain points that prevents customers from completing these migrations is downtime. PostgreSQL customers typically have very large databases, meaning they are facing hours of downtime, which for most customers is just not realistic. With DMS, we can support our customers in migrating their databases with close to zero downtime.”
Migrating your databases to Cloud SQL is a critical step in the journey to the cloud, and DMS provides a simple, serverless, and reliable path forward. “We are using Compute Engine for our servers, Google Vision for text recognition, Google Maps for validating addresses and calculating routes for our transfer services,” says Nicolas Candela Alvarez, IT Director at The Excellence Collection. “With DMS we moved our database to Cloud SQL and transitioned to a fully managed database that keeps up with our rapid business growth.”
Getting to know DMS
Customers are choosing DMS to migrate their MySQL and PostgreSQL databases because of its differentiated approach:
Lifting and shifting your database shouldn’t be complicated: database preparation documentation, secure connectivity setup, and migration validation should be built right into the flow. DMS delivered on this experience with MySQL migrations, and has expanded it to include PostgreSQL. “What makes this tool powerful is that it’s an easy gateway to Cloud SQL,” says Valeed Malik, Chief Technology Officer (CTO) at SaRA Health. “Not having a vast replication background was not a barrier since the documentation both inside and outside the product was rich, which you may not expect on other platforms.”
Migrating your database shouldn’t interfere with running your business. DMS migrations allow you to continuously replicate database changes from your source to Cloud SQL to allow for fast cutover and minimal database downtime. “We were tired of babysitting our PostgreSQL instances, maintaining patches, rotating backups, monitoring replication, etc. However, we needed to move to Cloud SQL with minimal downtime,” says Caleb Shay, Database Engineer at Comoto. “DMS allowed us to perform this migration quickly and without any disruption to our business.”
Reliable and complete
DMS’s unique migration method, which utilizes both MySQL and PostgreSQL’s native replication capabilities, maximizes security, fidelity and reliability. These like-to-like migrations to Cloud SQL are high-fidelity, and the destination database is ready to go after cutover, without the hassle of extra steps, and at no additional charge.
Serverless and secure
With DMS’ serverless architecture, you don’t need to worry about provisioning or managing migration-specific resources. Migrations are high performance, minimizing downtime no matter the scale. DMS also keeps your migrated data secure, supporting multiple methods of private connectivity between source and destination databases.
“Establishing connectivity is often viewed as hard. The in-product guidance DMS introduced allowed us to easily create a secure tunnel between the source and the new Cloud SQL instance and ensure our data is safe and secured,” says Andre Susanto, Database Engineer at Family Zone.
Getting started with Database Migration Service
You can start migrating your PostgreSQL and MySQL workloads today using DMS:
Navigate to the Database Migration area of your Google Cloud console, under Databases, and click Create Migration Job.
Choose the database type you want to migrate, and see what actions you need to take to set up your source for successful migration.
Create your source connection profile, which can later be used for additional migrations.
Create a Cloud SQL destination that fits your business needs.
Define how you want to connect your source and destination, with both private and public connectivity methods supported.
Test your migration job and make sure the test was successful as displayed below, and start it whenever you’re ready. See below image
Once historical data has been migrated to the new destination, DMS will keep up and replicate new changes as they happen. You can then promote the migration job, and your new Cloud SQL instance will be ready to go. You can monitor your migration jobs on the migration jobs list as shown in the image below:
Learn more and start your database journey
DMS is now generally available for MySQL and PostgreSQL migrations from all types of sources, both on-premises and in the cloud. Looking for SQL Server migrations? You can request access to participate in the SQL Server preview.
Customer stories provided through a TechValidate survey conducted February 2021.