Understanding the fundamentals of tagging in Data Catalog
Google Cloud Data Catalog is a fully managed and scalable metadata management service. Data Catalog helps your organization quickly discover, understand, and manage all your data from one simple interface, letting you gain valuable business insights out of your data investments. One of Data Catalog’s core concepts, called tag templates, helps you organize complex metadata while making it searchable under Cloud Identity and Access Management ( Cloud IAM) control. In this post, we’ll offer some best practices and useful tag templates (referred to as templates from here) to help you start your journey.
Understanding Data Catalog templates
tag template is a collection of related fields that represent your vocabulary for classifying data assets. Each field has a name and a type. The type can be a
datetime. When the type is an
enum, the template also stores the possible values for this field. The fields are stored as an unordered set in the template and each field is treated as optional unless marked as required. A required field means that a value must be assigned to this field each time the template is in use. An optional field means it can be left out when an instance of this template is created.
You’ll create instances of templates when tagging data resources, such as BigQuery tables and views. Tagging means associating a tag template with a specific resource and assigning values to the template fields to describe the resource. We refer to these tags as
structured tags because the fields in these tags are typed as instances of the template. Typed fields let you avoid common misspellings and other inconsistencies, a known pitfall with simple key value pairs.
Two common questions we hear about Data Catalog templates are: What kind of fields should go into a template and how should templates be organized? The answer to the first question really depends on what kind of metadata your organization wants to keep track of and how that metadata will be used. There are various metadata use cases, ranging from data discovery to data governance, and the requirements for each one should drive the contents of the templates.
Let’s look at a simple example of how you might organize your templates. Suppose the goal is to make it easier for analysts to discover data assets in a data lake because they spend a lot of time searching for the right assets. In that case, create a Data Discovery template, which would categorize the assets along the dimensions that the analysts want to search. This would include fields such as
creation_date, etc. If the data governance team wants to categorize the assets for data compliance purposes, you can create a separate template with governance-specific fields, such as
storage_location, etc. In other words, we recommend creating templates to represent a single concept, rather than placing multiple concepts into one template. This avoids confusing those who are using the templates and helps the template administrators maintain them over time.
Some clients create their templates in multiple projects, others create them in a central project, and still others use both options. When creating templates that will be used widely across multiple teams, we recommend creating them in a central project so that they are easier to track. For example, a data governance template is typically maintained by a central group. This group might meet monthly to ensure that the fields in each template are clearly defined and decide how to handle requirements for additional fields. Storing their template in a central project makes sense for maintainability. When the scope of the template is restricted to one team, such as a data discovery template that is customized to the needs of one data science team, then creating the template in that team’s project makes more sense. When the scope is even more restricted, say to one individual, then creating the template in their personal project makes more sense. In other words, choose the storage location of a template based on its scope.
Access control for templates
Data Catalog offers a wide range of permissions for managing access to templates and tags. Templates can be completely private, only visible to authorized users (through the
tag template viewer role), as well as visible and used by authorized users for creating tags (through the
tag template user role). When a template is visible, authorized users can not only view the contents of the template, but also search for assets that were tagged using the template (as long as they also have access to view those underlying assets). You can’t search for metadata if you don’t have access to the underlying data. To obtain read access to the cataloged assets, they would need to be granted the
Data Catalog Viewer role; alternately, the
BigQuery Metadata Viewer role can be used if the underlying assets are stored in BigQuery.
In addition to the viewer and user roles, there is also the concept of a template creator (via the tag template creator role) and template owner (via the
tag template owner role). The creator can only create new templates, while the owner has complete control of the template, which includes rights to delete it. Deleting a template has the ripple effect of deleting all the tags created from the template. For creating and modifying tags, use the
tag editor role. This role should be used in conjunction with a tag template role so that users can access the templates from which to tag.
Billing considerations for templates
There are two components to Data Catalog’s billing: metadata storage and API calls. For storage, projects in which templates are created incur the billing charges pertaining to templates and tags. They are billed for their templates’ storage usage even if the tags created from those templates are on resources that reside in different projects. For example, project A owns a Data Discovery template and project B uses this template to tag its own resources in BigQuery. Project A will incur the billing charges for Project B’s tags because the Data Discovery template resides in project A.
From an API calls perspective, the charges are billed to the project selected when the calls are made for searching, reading, and writing. More details on pricing are available from the product documentation page.
Another common question we hear from potential clients is: Do you have prebuilt templates to help us get started with creating our own? Due to the popularity of this request, we created a few examples to illustrate the types of templates being deployed by our users. You can find them in YAML format below and through a GitHub repo. There is also a script in the same repo that reads the YAML-based templates and creates the actual templates in Data Catalog.
Data governance template
The data governance template categorizes data assets based on their domain, environment, sensitivity, ownership, and retention details. It is intended to be used for data discovery and compliance with usage policies such as GDPR and CCPA. The template is expected to grow over time with the addition of new policies and regulations around data usage and privacy.
Derived data template
The derived data template is for categorizing derivative data that originates from one or more data sources. Derivative data is produced through a variety of means, including Dataflow pipelines, Airflow DAGs, BigQuery queries, and many others. The data can be transformed in multiple ways, such as aggregation, anonymization, normalization, etc. From a metadata perspective, we want to broadly categorize those transformations as well as keep track of the data sources that produced it. The
parents field in the template is for storing the uris of the origin data sources and is populated by the process producing the derived data. It is declared as a string because complex types are not supported by Data Catalog as of this writing.
Data quality template
The data quality template is intended to store the results of various quality checks to help in assessing the accuracy of the underlying data. Unlike the previous two templates, which are attached to a whole table, this one is attached to a specific column of a table. This would typically be an important numerical column that is used by critical business reports. As Data Catalog already ingests the schema of BigQuery tables through its technical metadata, this template omits the data type of the column and stores only the results of the quality checks. The quality checks are customizable and can easily be implemented in BigQuery.
Data engineering template
The data engineering template is also attached to individual columns of a table. It is intended for describing how those columns are mapped to the same data in a different storage system. Its goal is to support database replication scenarios such as warehouse migrations to BigQuery, continuous real-time replication to BigQuery, and replication to a data lake on Cloud Storage. In those scenarios, data engineers want to capture the mappings between the source and target columns of tables for two primary reasons: facilitate querying the replicated data, which usually has a different schema in BigQuery than the source; and capture how the data is being replicated so that replication issues can be more easily detected and resolved.
Related Google News:
- Run Data Science at Scale with Dataproc and Apache Spark February 23, 2021
- What’s in a name? Understanding the Google Cloud network “edge” February 22, 2021
- Architect your data lake on Google Cloud with Data Fusion and Composer February 19, 2021
- To the cloud and beyond! Planning a multi-year data center migration February 17, 2021
- Databricks on Google Cloud: an open integrated platform for data, analytics and machine learning February 17, 2021
- NOAA and Google Cloud: A data match made in the cloud February 12, 2021
- 3D Scene Understanding with TensorFlow 3D February 11, 2021
- Architecting a data lineage system for BigQuery February 9, 2021