Doing our part to share open data responsibly
This past weekend marked Open Data Day, an annual celebration of making data freely available to everyone. Communities around the world organized events, and we’re taking a moment here at Google to share our own perspective on the importance of open data. More accessible data can meaningfully help people and organizations, and we’re doing our part by opening datasets, providing access to APIs and aggregated product data, and developing tools to make data more accessible and useful.
Responsibly opening datasets
Sharing datasets is increasingly important as more people adopt machine learning through open frameworks like TensorFlow. We’ve released over 50 open datasets for other developers and researchers to use. These include YouTube 8M, a corpus of annotated videos used externally for video understanding; the HDR+ Burst Photography dataset, which helps others experiment with the technology that powers Pixel features like Portrait Mode; and Open Images, along with the Open Images Extended dataset which increases photo diversity.
Just because data is open doesn’t mean it will be useful, however. First, a dataset needs to be cleaned so that any insights developed from it are based on well-structured and accurate examples. Cleaning a large dataset is no small feat; before opening up our own, we spend hundreds of hours standardizing data and validating quality. Second, a dataset should be shared in a machine-readable format that’s easy for others to use, such as JSON rather than PDF. Finally, consider whether the dataset is representative of the intended content. Even if data is usable and representative of some situations, it may not be appropriate for every application. For instance, if a dataset contains mostly North American animal images, it may help you classify a deer, but not a giraffe. Tools like Facets can help you analyze the makeup of a dataset and evaluate the best ways to put it to use. We’re also working to build more representative datasets through interfaces like the Crowdsource application. To guide others’ use of your own dataset, consider publishing a data card which denotes authorship, composition and suggested use cases (here’s an example from our Open Images Extended release).
Making data findable and useful
It’s not enough to just make good data open, though—it also needs to be findable. Kaggle, a community of data scientists and machine learning enthusiasts, helps users store and query large datasets. Nonetheless, researchers, developers, journalists and other curious data-seekers often struggle to locate data scattered across the web’s thousands of repositories. Our Dataset Search tool helps people find data sources wherever they’re hosted, as long as the data is described in a way that search engines can locate. Since the tool launched a few months ago, we’ve seen the number of unique datasets on the platform double to 10 million, including contributions from the U.S. National Ocean and Atmospheric Administration (NOAA), National Institutes of Health (NIH), the Federal Reserve, the European Data Portal, the World Bank and government portals from every continent.
What makes data useful is how easily it can be analyzed. Though there’s more open data today, data scientists spend significant time analyzing it across multiple sources. To help solve that problem, we’ve created Data Commons. It’s a knowledge graph of data sources that lets users treat various datasets of interest—regardless of source and format—as if they are all in a single local database. Anyone can contribute datasets or build applications powered by the infrastructure. For people using the platform, that means less time engineering data and more time generating insights. We’re already seeing exciting use cases of Data Commons. In one UC Berkeley data science course taught by Josh Hug and Fernando Perez, students used Census, CDC and Bureau of Labor Statistics data to correlate obesity levels across U.S. cities with other health and economic factors. Typically, that analysis would take days or weeks; using Data Commons, students were able to build high-fidelity models in less than an hour. We hope to partner with other educators and researchers—if you’re interested, reach out to email@example.com.
There are trade-offs to opening up data, and we aim to balance various sensitivities with the potential benefits of sharing. One consideration is that broad data openness can facilitate uses that don’t align with our AI Principles. For instance, we recently made synthetic speech data available only to researchers participating in the 2019 ASVspoof Challenge, to ensure that the data can be used to develop tools to detect deepfakes, while limiting misuse.
Extreme data openness can also risk exposing user or proprietary information, causing privacy breaches or threatening the security of our platforms. We allow third party developers to build on services like Maps, Gmail and more via APIs, so they can build their own products while user data is kept safe. We also publish aggregated product data like Search Trends to share information of public interest in a privacy-preserving way.
While there can be benefits to using sensitive data in controlled and principled ways, like predicting medical conditions or events, it’s critical that safeguards are in place so that training machine learning models doesn’t compromise individual privacy. Emerging research provides promising new avenues to learn from sensitive data. One is Federated Learning, a technique for training global ML models without data ever leaving a person’s device, which we’ve recently made available open-source with TensorFlow Federated. Another is Differential Privacy, which can offer strong guarantees that training data details aren’t inappropriately exposed in ML models. Additionally, researchers are experimenting more and more with using small training datasets and zero-shot learning, as we demonstrated in our recent prostate cancer detection research and work on Google Translate.
We hope that our efforts will help people access and learn from clean, useful, relevant and privacy-preserving open data from Google to solve the problems that matter to them. We also encourage other organizations to consider how they can contribute—whether by opening their own datasets, facilitating usability by cleaning them before release, using schema.org metadata standards to increase findability, enhancing transparency through data cards or considering trade-offs like user privacy and misuse. To everyone who has come together over the past week to celebrate open data: we look forward to seeing what you build.
Related Google News:
- Architect your data lake on Google Cloud with Data Fusion and Composer February 19, 2021
- Introducing Model Search: An Open Source Platform for Finding Optimal ML Models February 19, 2021
- A new resource for coordinated vulnerability disclosure in open source projects February 17, 2021
- Expanding our testing in San Francisco February 17, 2021
- To the cloud and beyond! Planning a multi-year data center migration February 17, 2021
- Databricks on Google Cloud: an open integrated platform for data, analytics and machine learning February 17, 2021
- Mitigating Memory Safety Issues in Open Source Software February 17, 2021
- NOAA and Google Cloud: A data match made in the cloud February 12, 2021