Google News App
In December, Texas Attorney General Paxton filed a complaint about our ad tech business and hired contingency-fee plaintiff lawyers to handle the case. We look forward to showing in court why AG Paxton’s allegations are wrong. But given some of the misleading claims that have been circulating—in particular, the inaccurate portrayal of our well-publicized “Open Bidding” agreement with Facebook—we wanted to set the record straight.
About our ad services
Ad tech helps websites and apps make money and fund high-quality content. It also helps our advertising partners—most of whom are small merchants—reach customers and grow their businesses.
AG Paxton tries to paint Google’s involvement in this industry as nefarious. The opposite is true. Unlike some B2B companies in this space, a consumer internet company like Google has an incentive to maintain a positive user experience and a sustainable internet that works for all—consumers, advertisers and publishers.
For example, as we’ve built our ad tech products, we have given people granular controls over how their information is used to personalize ads and limited the sharing of personal data to safeguard people’s privacy. We’ve invested in detecting and blocking harmful ads that violate our policies. We also build tools that load content and ads faster; block scammy ad experiences like pop-ups; and reduce the number of intrusive, annoying ads through innovations like skippable ads. Those tools not only help people, but by building trust, promote the sustainability of the free and open internet.
We’ve worked to be open and upfront with the industry about the improvements we make to our technologies. We try to do the right thing as we balance the concerns of publishers, advertisers, and the people who use our services. Our ad tech rivals and large partners may not always like every decision we make—we’re never going to be able to please everybody. But that’s hardly evidence of wrongdoing and certainly not a credible basis for an antitrust lawsuit.
Here are just a few of the things AG Paxton’s complaint gets wrong:
Myth: Google “dominates the online advertising landscape for image-based web display ads.”
Fact: The ad tech industry is incredibly crowded and competitive.
Competition in online advertising has made ads more affordable and relevant, reduced ad tech fees, and expanded options for publishers and advertisers.
The online advertising space is famously crowded. We compete with household names like Adobe, Amazon, AT&T, Comcast, Facebook, Oracle, Twitter and Verizon. Facebook, for example, is the largest seller of display ads and Amazon last month surpassed us as the preferred ad buying platform for advertisers. We compete fiercely with those companies and others such as Mediaocean, Amobee, MediaMath, Centro, Magnite, The Trade Desk, Index Exchange, OpenX, PubMatic and countless more. A growing number of retail brands such as Walmart, Walgreens, Best Buy, Kroger and Target are also offering their own ad tech.
Myth: Google “extracts a very high … percent of the ad dollars otherwise flowing to online publishers.”
Fact: Our fees are actually lower than reported industry averages.
Our ad tech fees are lower than reported industry averages. Publishers keep about 70 percent of the revenue when using our products, and for some types of advertising, publishers keep even more—that’s more money in publishers’ pockets to fund their creation of high-quality content.
Myth: We created an alternative to header bidding that “secretly stacks the deck in Google’s favor.”
Fact: We created Open Bidding to address the drawbacks of header bidding.
Header bidding refers to running an auction among multiple ad exchanges for given ad space. You won’t read this in AG Paxton’s complaint, but the technology has real drawbacks: Header bidding auctions take place within the browser, on your computer or mobile phone, so they require the device to use more data in order to work. This can lead to problems like webpages taking longer to load and device batteries draining faster. And the multilayered complexity of header bidding can lead to fraud and other problems that can artificially increase prices for advertisers, as well as billing discrepancies that can hurt publisher revenue.
So we created an alternative to header bidding, called Open Bidding, which runs within the ad server instead of on your device. This solves many of the problems associated with header bidding. Open Bidding provides publishers access to demand from dozens of networks and exchanges. This helps increase demand for publisher inventory and competition for ad space, which enables publishers to drive more revenue. In fact, our data shows that publishers who decide to use Open Bidding on Ad Manager typically see double-digit revenue increases across our partners and exchange—and they can measure this for themselves.
Additionally, our publisher platform has always integrated with header bidding, so publishers have the choice to use their preferred bidding solution. Publishers can and do bring bids from non-Google header bidding tools into our platform.
Since we launched Open Bidding, traditional header bidding has continued to grow. In fact, a recent survey shows about 90 percent of publishers currently use header bidding for desktop and 60 percent use header bidding for mobile in-app or in-stream video. Amazon also launched an entirely new competitive header bidding solution, which uses the same server-side approach that we do. Header bidding is an evolving and growing space—and now, as a result of our work, there are alternatives to header bidding that improve the user experience.
Myth: Our Open Bidding agreement with Facebook harms publishers.
Fact: Facebook is one of over 25 partners in Open Bidding, and their participation actually helps publishers.
AG Paxton also makes misleading claims about Facebook’s participation in our Open Bidding program. Facebook Audience Network (FAN)’s involvement isn’t a secret. In fact, it was well-publicized and FAN is one of over 25 partners participating in Open Bidding. Our agreement with FAN simply enables them (and the advertisers they represent) to participate in Open Bidding. Of course we want FAN to participate because the whole goal of Open Bidding is to work with a range of ad networks and exchanges to increase demand for publishers’ ad space, which helps those publishers earn more revenue. FAN’s participation helps that. But to be clear, Open Bidding is still an extremely small part of our ad tech business, accounting for less than 4 percent of the display ads we place.
AG Paxton inaccurately claims that we manipulate the Open Bidding auction in FAN’s favor. We absolutely don’t. FAN must make the highest bid to win a given impression. If another eligible network or exchange bids higher, they win the auction. FAN’s participation in Open Bidding doesn’t prevent Facebook from participating in header bidding or any other similar system. In fact, FAN participates in several similar auctions on rival platforms.
And AG Paxton’s claims about how much we charge other Open Bidding partners are mistaken—our standard revenue share for Open Bidding is 5-10 percent.
Myth: AMP was designed to hurt header bidding.
Fact: AMP was designed in partnership with publishers to improve the mobile web.
AG Paxton’s claims about AMP and header bidding are just false. Engineers at Google designed AMP in partnership with publishers and other tech companies to help webpages load faster and improve the user experience on mobile devices—not to harm header bidding.
AMP supports a range of monetization options, including header bidding. Publishers are free to use both AMP and header bidding technologies together if they choose. The use of header bidding doesn’t factor into publisher search rankings.
Myth: We force partners to use Google tools.
Fact: Partners can readily use our tools and other technologies side by side.
This claim isn’t accurate either. Publishers and advertisers often use multiple technologies simultaneously. In fact, surveys show the average large publisher uses six different platforms to sell ads on its site, and plans to use even more this year. And the top 100 advertisers use an average of four or more platforms to buy ads.
All of this is why we build our technologies to be interoperable with more than 700 rival platforms for advertisers and 80 rival platforms for publishers.
AG Paxton’s complaint talks about the idea that we offer tools for both advertisers and publishers as if that’s unusual or problematic. But that reflects a lack of knowledge of the online ads industry, where serving both advertisers and publishers is actually commonplace. Many firms with competing ad tech businesses, such as AT&T, Amazon, Twitter, Verizon, Comcast and others, offer ad platforms and tools like ours that cater to both advertisers and publishers. We don’t require either advertisers or publishers to use our whole “stack,” and many don’t. Ultimately, advertisers and publishers can choose what works best for their needs.
Myth: “Google uses privacy concerns to advantage itself.”
Fact: Consumers expect us to secure their data—and we do.
There are many other things this complaint simply gets wrong. You can read more about our ad tech business by visiting our competition website.
We look forward to defending ourselves in court. In the meantime, we’ll continue our work to help publishers and advertisers grow with digital ads and create a sustainable advertising industry that supports free content for everyone.
I recently talked about orchestration versus choreography in connecting microservices and introduced Workflows for use cases that can benefit from a central orchestrator. I also mentioned Eventarc and Pub/Sub in the choreography camp for more loosely coupled event-driven architectures.
In this blog post, I talk more about the unified eventing experience by Eventarc.
What is Eventarc?
We announced Eventarc back in October as a new eventing functionality that enables you to send events to Cloud Run from more than 60 Google Cloud sources. It works by reading Audit Logs from various sources and sending them to Cloud Run services as events in CloudEvents format. It can also read events from Pub/Sub topics for custom applications.
Getting events to Cloud Run
There are already other ways to get events to Cloud Run, so you might wonder what’s special about Eventarc? I’ll get to this question, but let’s first explore one of those ways, Pub/Sub.
As shown in this Using Pub/Sub with Cloud Run tutorial, Cloud Run services can receive messages pushed from a Pub/Sub topic. This works if the event source can directly publish messages to a Pub/Sub topic. It can also work for services that have integration with Pub/Sub and publish their events through that integration. For example, Cloud Storage is one of those services and in this tutorial, I show how to receive updates from a Cloud Storage bucket using a Pub/Sub topic in the middle.
For other services with no integration to Pub/Sub, you have to either integrate them with Pub/Sub and configure Pub/Sub to route messages to Cloud Run or you need to find another way of sourcing those events. It’s possible but definitely not trivial. That’s where Eventarc comes into play.
Immediate benefits of Eventarc
Eventarc provides an easier path to receive events not only from Pub/Sub topics but from a number of Google Cloud sources with its Audit Log and Pub/Sub integration. Any service with Audit Log integration or any application that can send a message to a Pub/Sub topic can be event sources for Eventarc. You don’t have to worry about the underlying infrastructure with Eventarc. It is a managed service with no clusters to set up or maintain.
It also has some concrete benefits beyond the easy integration. It provides consistency and structure to how events are generated, routed, and consumed. Let’s explore those benefits next.
Simplified and centralized routing
Eventarc introduces the notion of a trigger. A trigger specifies routing rules from event sources to event sinks. For example, one can listen for new object creation events in Cloud Storage and route them to a Cloud Run service by simply creating an Audit Log trigger as follows:
If you want to listen for messages from Pub/Sub instead, that’s another trigger:
This trigger creates a Pub/Sub topic under the covers. Applications can send messages to that topic and those messages are routed to the specified Cloud Run service by Eventarc.
Users can also create triggers from Google Cloud Console under the triggers section of Cloud Run:
By having event routing defined as triggers, users can list and manage all their triggers in one central place in Eventarc. Here’s the command to see all created triggers:
gcloud beta eventarc triggers list
Consistency with eventing format and libraries
In Eventarc, different events from different sources are converted to CloudEvents compliant events. CloudEvents is a specification for describing event data in a common way with the goal of consistency, accessibility and portability.
A CloudEvent includes context and data about the event:
Event consumers can read these events directly. We also try to make it easier in various languages (Node.js, Python, Go, Java, C# and more) with CloudEvents SDKs to read the event and Google Events libraries to parse the date field.
Going back to our Cloud Storage example earlier, this is how you’d read Cloud Storage events via AuditLogs in Node.js using the two mentioned libraries:
Similarly, this is how you’d read messages from a Pub/Sub trigger in C#:
Long term vision
The long term vision of Eventarc is to be the hub of events from more sources and sinks, enabling a unified eventing story in Google Cloud and beyond.
In the future, you can expect to read events directly (without having to go through Audit Logs) from more Google Cloud sources (eg. Firestore, BigQuery, Storage), Google sources (eg. Gmail, Hangouts, Chat), 3rd party sources (eg. Datadog, PagerDuty) and send these events to more Google Cloud sinks (eg. Cloud Functions, Compute Engine, Pub/Sub) and custom sinks (any HTTP target).
Now that you have a better overall picture of the current state and future vision for Eventarc:
Check out Trigger Cloud Run with events from Eventarc for a hands-on codelab.
Send us feedback on Eventarc and which sources and sinks you would value the most.
As always, feel free to reach out to me on Twitter @meteatamel for questions.
Posted by Ankur Parikh and Xuezhi Wang, Research Scientists, Google Research
In the last few years, research in natural language generation, used for tasks like text summarization, has made tremendous progress. Yet, despite achieving high levels of fluency, neural systems can still be prone to hallucination (i.e.generating text that is understandable, but not faithful to the source), which can prohibit these systems from being used in many applications that require high degrees of accuracy. Consider an example from the Wikibio dataset, where the neural baseline model tasked with summarizing a Wikipedia infobox entry for Belgian football player Constant Vanden Stock summarizes incorrectly that he is an American figure skater.
While the process of assessing the faithfulness of generated text to the source content can be challenging, it is often easier when the source content is structured (e.g., in tabular format). Moreover, structured data can also test a model’s ability for reasoning and numerical inference. However, existing large scale structured datasets are often noisy (i.e., the reference sentence cannot be fully inferred from the tabular data), making them unreliable for the measurement of hallucination in model development.
In “ToTTo: A Controlled Table-To-Text Generation Dataset”, we present an open domain table-to-text generation dataset generated using a novel annotation process (via sentence revision) along with a controlled text generation task that can be used to assess model hallucination. ToTTo (shorthand for “Table-To-Text”) consists of 121,000 training examples, along with 7,500 examples each for development and test. Due to the accuracy of annotations, this dataset is suitable as a challenging benchmark for research in high precision text generation. The dataset and code are open-sourced on our GitHub repo.
ToTTo introduces a controlled generation task in which a given Wikipedia table with a set of selected cells is used as the source material for the task of producing a single sentence description that summarizes the cell contents in the context of the table. The example below demonstrates some of the many challenges posed by the task, such as numerical reasoning, a large open-domain vocabulary, and varied table structure.
Designing an annotation process to obtain natural but also clean target sentences from tabular data is a significant challenge. Many datasets like Wikibio and RotoWire pair naturally occurring text heuristically with tables, a noisy process that makes it difficult to disentangle whether hallucination is primarily caused by data noise or model shortcomings. On the other hand, one can elicit annotators to write sentence targets from scratch, which are faithful to the table, but the resulting targets often lack variety in terms of structure and style.
In contrast, ToTTo is constructed using a novel data annotation strategy in which annotators revise existing Wikipedia sentences in stages. This results in target sentences that are clean, as well as natural, containing interesting and varied linguistic properties. The data collection and annotation process begins by collecting tables from Wikipedia, where a given table is paired with a summary sentence collected from the supporting page context according to heuristics, such as word overlap between the page text and the table and hyperlinks referencing tabular data. This summary sentence may contain information not supported by the table and may contain pronouns with antecedents found in the table only, not the sentence itself.
The annotator then highlights the cells in the table that support the sentence and deletes phrases in the sentence that are not supported by the table. They also decontextualize the sentence so that it is standalone (e.g., with correct pronoun resolution) and correct grammar, where necessary.
We conducted a topic analysis on the ToTTo dataset over 44 categories and found that the Sports and Countries topics, each of which consists of a range of fine-grained topics, e.g., football/olympics for sports and population/buildings for countries, together comprise 56.4% of the dataset. The other 44% is composed of a much more broad set of topics, including Performing Arts, Transportation, and Entertainment.
Furthermore, we conducted a manual analysis of the different types of linguistic phenomena in the dataset over 100 randomly chosen examples. The table below summarizes the fraction of examples that require reference to the page and section titles, as well as some of the linguistic phenomena in the dataset that potentially pose new challenges to current systems.
|Require reference to page title||82%|
|Require reference to section title||19%|
|Require reference to table description||3%|
|Reasoning (logical, numerical, temporal etc.)||21%|
|Comparison across rows/columns/cells||13%|
|Require background information||12%|
We present some baseline results of three state-of-the-art models from the literature (BERT-to-BERT, Pointer Generator, and the Puduppully 2019 model) on two evaluation metrics, BLEU and PARENT. In addition to reporting the score on the overall test set, we also evaluate each model on a more challenging subset consisting of out-of-domain examples. As the table below shows, the BERT-to-BERT model performs best in terms of both BLEU and PARENT. Moreover, all models achieve considerably lower performance on the challenge set indicating the challenge of out-of-domain generalization.
|Puduppully et al. 2019||19.2||29.2||13.9||25.8|
While automatic metrics can give some indication of performance, they are not currently sufficient for evaluating hallucination in text generation systems. To better understand hallucination, we manually evaluate the top performing baseline, to determine how faithful it is to the content in the source table, under the assumption that discrepancies indicate hallucination. To compute the “Expert” performance, for each example in our multi-reference test set, we held out one reference and asked annotators to compare it with the other references for faithfulness. As the results show, the top performing baseline appears to hallucinate information ~20% of the time.
Model Errors and Challenges
In the table below, we present a selection of the observed model errors to highlight some of the more challenging aspects of the ToTTo dataset. We find that state-of-the-art models struggle with hallucination, numerical reasoning, and rare topics, even when using cleaned references (errors in red). The last example shows that even when the model output is correct it is sometimes not as informative as the original reference which contains more reasoning about the table (shown in blue).
|in the 1939 currie cup, western province lost to transvaal by 17–6 in cape town.||the first currie cup was played in 1939 in transvaal1 at new- lands, with western province winning 17–6.|
|a second generation of micro- drive was announced by ibm in 2000 with increased capacities at 512 mb and 1 gb.||there were 512 microdrive models in 2000: 1 gigabyte.|
|the 1956 grand prix motorcy- cle racing season consisted of six grand prix races in five classes: 500cc, 350cc, 250cc, 125cc and sidecars 500cc.||the 1956 grand prix motorcycle racing season consisted of eight grand prix races in five classes: 500cc, 350cc, 250cc, 125cc and sidecars 500cc.|
|in travis kelce’s last collegiate season, he set personal career highs in receptions (45), re- ceiving yards (722), yards per receptions (16.0) and receiving touchdowns (8).||travis kelce finished the 2012 season with 45 receptions for 722 yards (16.0 avg.) and eight touchdowns.|
In this work, we presented ToTTo, a large, English table-to-text dataset that presents both a controlled generation task and a data annotation process based on iterative sentence revision. We also provided several state-of-the-art baselines, and demonstrated ToTTo could be a useful dataset for modeling research as well as for developing evaluation metrics that can better detect model improvements.
The authors wish to thank Ming-Wei Chang, Jonathan H. Clark, Kenton Lee, and Jennimaria Palomaki for their insightful discussions and support. Many thanks also to Ashwin Kakarla and his team for help with the annotations.
Posted by Jochen Eisinger, Engineering Director, Google Chrome
As we start the new year, we see ongoing revelations about an attack involving SolarWinds and others, that in turn led to the compromise of numerous other organizations. Software supply chain attacks like this pose a serious threat to governments, companies, non-profits, and individuals alike. At Google, we work around the clock to protect our users and customers. Based on what is known about the attack today, we are confident that no Google systems were affected by the SolarWinds event. We make very limited use of the affected software and services, and our approach to mitigating supply chain security risks meant that any incidental use was limited and contained. These controls were bolstered by sophisticated monitoring of our networks and systems.
Beyond this specific attack, we remain focused on defending against all forms of supply chain risk and feel a deep responsibility to collaborate on solutions that benefit our customers and the common good of the industry. That’s why today we want to share some of the security best practices we employ and investments we make in secure software development and supply chain risk management. These key elements of our security and risk programs include our efforts to develop and deploy software safely at Google, design and build a trusted cloud environment to deliver defense-in-depth at scale, advocate for modern security architectures, and advance industry-wide security initiatives.
To protect the software products and solutions we provide our cloud customers, we have to mitigate potential security risks, no matter how small, for our own employees and systems. To do this, we have modernized the technology stack to provide a more defensible environment that we can protect at scale. For example, modern security architectures like BeyondCorp allow our employees to work securely from anywhere, security keys have effectively eliminated password phishing attacks against our employees, and Chrome OS was built by design to be more resilient against malware. By building a strong foundation for our employees to work from, we are well-prepared to address key issues, such as software supply chain security. Many of these topics are covered more extensively in our book Building Secure and Reliable Systems.
How we develop and deploy software and hardware safely at Google
Developing software safely starts with providing secure infrastructure and requires the right tools and processes to help our developers avoid predictable security mistakes. For example, we make use of secure development and continuous testing frameworks to detect and avoid common programming mistakes. Our embedded security-by-default approach also considers a wide variety of attack vectors on the development process itself, including supply chain risks.
A few examples of how we tackle the challenge of developing software safely:
Trusted Cloud Computing: Google Cloud’s infrastructure is designed to deliver defense-in-depth at scale, which means that we don’t rely on any one thing to keep us secure, but instead build layers of checks and controls that includes proprietary Google-designed hardware, Google-controlled firmware, Google-curated OS images, a Google-hardened hypervisor, as well as data center physical security and services. We provide assurances in these security layers through roots of trust, such as Titan Chips for Google host machines and Shielded Virtual Machines. Controlling the hardware and security stack allows us to maintain the underpinnings of our security posture in a way that many other providers cannot. We believe that this level of control results in reduced exposure to supply chain risk for us and our customers. More on our measures to mitigate hardware supply chain risk can be found in this blog post.
Binary Authorization: As we describe in our Binary Authorization whitepaper, we verify, for example, that software is built and signed in an approved isolated build environment from properly checked-in code that has been reviewed and tested. These controls are enforced during deployment by policy, depending on the sensitivity of the code. Binaries are only permitted to run if they pass such control checks, and we continuously verify policy compliance for the lifetime of the job. This is a critical control used to limit the ability of a potentially malicious insider, or other threat actor using their account, to insert malicious software into our production environment. Google Cloud customers can use the Binary Authorization service to define and automatically enforce production deployment policy based on the provenance and integrity of their code.
Change Verification: Code and configuration changes submitted by our developers are provably reviewed by at least one person other than the author. Sensitive administrative actions typically require additional human approvals. We do this to prevent unexpected changes, whether they’re mistakes or malicious insertions.
Reshaping the ecosystem
We also believe the broader ecosystem will need to reshape its approach to layered defense to address supply chain attacks long-term. For example, software development teams should adopt tamper-evident practices paired with transparency techniques that allow for third-party validation and discoverability. We have published an architectural guide to adding tamper checking to a package manager, and this is implemented for Golang. Developers can make use of our open-source verifiable Trillian log, which powers the world’s largest, most used and respected production crypto ledger-based ecosystem, certificate transparency.
Another area for consideration is limiting the effects of attacks by using modern computing architectures that isolate potentially compromised software components. Examples of such architectures are Android OS’s application sandbox, gVisor (an application sandbox for containers), and Google’s BeyondProd where microservice containerization can limit the effects of malicious software. Should any of the upstream supply-chain components in these environments become compromised, such isolation mechanisms can act as a final layer of defense to deny attackers their goals.
Our industry commitment and responsibility
The software supply chain represents the links across organizations—an individual company can only do so much on their own. We need to work together as an industry to change the way software components are built, distributed and tracked throughout their lifecycle.
One example of collaboration is the Open Source Security Foundation, which Google co-founded last year to help the industry tackle issues like software supply chain security in open source dependencies and promote security awareness and best practices. We also work with industry partners to improve supply chain policies and reduce supply chain risk, and publish information for users and customers on how they can use our technology to manage supply chain risk.
Pushing the software ecosystem forward
Although the history of software supply chain attacks is well-documented, each new attack reveals new challenges. The seriousness of the SolarWinds event is deeply concerning but it also highlights the opportunities for government, industry, and other stakeholders to collaborate on best practices and build effective technology that can fundamentally improve the software ecosystem. We will continue to work with a range of stakeholders to address these issues and help lay the foundation for a more secure future.
In our first blog post in this series, we talked broadly about the democratization of data and insights. Our second blog took a deeper look at insights derived specifically from machine learning, and how Google Cloud has worked to push those capabilities to more users across the data landscape. In our third and final blog in this series, we’ll examine data access, data insights, and machine learning in the context of real-time decision making, and how we’re working to help all users – business and technical – get access to real-time insights.
Getting real about real-time data analysis
Let’s start by taking a look at real-time data analysis (also referred to as stream analytics) and the blend of factors that increasingly make it critical to business success.
First, data is increasingly real-time in nature. IDC predicts that by 2025, more than 25% of all data created will be real-time in nature. We predict the number of business decisions being made at Google Cloud based on real-time data will be even higher than that. What’s driving that growth? There are a number of factors that represent an overall trend towards digitization in not just business, but society in general. These factors include, but aren’t limited to, digital devices, IoT-enabled manufacturing and logistics, digital commerce, digital communications, and digital media consumption. Harnessing the real-time data created by these activities gives companies the opportunity to better analyze their market, competition, and importantly, customers.
Next, customers expect more than ever in terms of personalization; they expect to be a “segment of one” across recommendations, offers, experience, and more. Companies know this and compete with each other to deliver the best user and customer experience possible. Google Cloud customers such as AB Tasty are processing billions of real-time events for millions of users each day to deliver just that for their clients—an experience that’s optimized for smaller and smaller segments of users.
With our new data pipeline and warehouse, we are able to personalize access to large volumes of data that were not previously there. That means new insights and correlations and, therefore, better decisions and increased revenue for customers.Jean-Yves Simon, VP Product, AB Tasty
Finally, real-time analysis is most useful when there’s an opportunity to take quick actions based on the insights. The same digitization driving real-time data generation provides an opportunity to drive immediate action in an instant feedback loop. Whether the action involves on-the-spot recommendations for digital retail, rerouting delivery vehicles based on real-time traffic information, changing the difficulty of an online gaming session, digitally recalibrating a manufacturing process, stopping fraud before a transaction is completed, or countless other examples, today’s technology opens up the opportunity to drive a more responsive and efficient business.
Democratizing real-time data analysis
We think of democratization in this space in two different frames. One is the standard frame we’ve taken in this blog series of expanding the capabilities of various data practitioners: “how do we give more users the ability to generate real-time insights?”
The other frame, specifically for stream analytics, is democratization at the company level. Let’s start with how we’re helping more businesses move to real-time, and then we’ll dive into how we’re helping across different users.
Democratizing stream analytics for all businesses
Historically, collecting, processing, and acting upon real-time data was particularly challenging. The nature of real-time data is that its volume and velocity can vary wildly in many use cases, creating multiple layers of complexity for data engineers trying to keep the data flowing through their pipelines. The tradeoffs involved in running a real-time data pipeline led many engineers to implement a lambda architecture, in which they would have both a real-time copy of (sometimes partial) results as well as a “correct” copy of results that took a traditional batch route. In addition to presenting challenges in reconciling data at the end of these pipelines, this architecture multiplied the number of systems to manage, and typically increased the number of ecosystems these same engineers had to manage. Setting this up, and keeping it all working, took large teams of expert data engineers. It kept the bar for use cases high.
Google and Google Cloud knew there had to be a better way to analyze real-time data… so we built it! Dataflow, together with Pub/Sub, answers the challenges posed by traditional streaming systems by providing a completely serverless experience that handles the variation in event streams with ease. Pub/Sub and Dataflow scale to exactly what resources are needed for the job at hand, handling performance, scaling, availability, security, and more—all automatically. Dataflow ensures that data is reliably and consistently processed exactly once, so engineers can trust the results their systems produce. Dataflow jobs are written using the Apache Beam SDK, which provides programming language choice for Dataflow (in addition to portability). Finally, Dataflow also allows data engineers to easily switch back and forth between both batch streaming modes, meaning users can experiment between real-time results and cost-effective batch processing with no changes to the code.
Google unifies streaming analytics and batch processing the way it should be. No compromises. That must be the goal when software architects create a unified streaming and batch solution that must scale elastically, perform complex operations, and have the resiliency of Rocky Balboa.The Forrester Wave™, Streaming Analytics, Q3 2019, by Mike Gualtieri, Forrester Research, Inc.
All together, Dataflow and Pub/Sub deliver an integrated, easy-to-operate experience that opens real-time analysis up to companies that don’t have large teams of expert data engineers. We’ve seen small teams of as few as six engineers processing billions of events per day. They can author their pipelines, and leave the rest to us.
Democratizing stream analytics for all personas
Having developed a streaming platform that made streaming available to data engineering teams of all sizes and skills, we set about making it easier for more people to access real-time analysis and drive better decisions as a result. Let’s dive into how we’ve expanded access to real-time analytics.
Business and data analysts
Providing access to real-time data for data analysts and business analysts starts with enabling data to be rapidly ingested into the data warehouse. BigQuery is designed to be “always fast, always fresh,” and it enables streaming inserts into the data warehouse at millions of events per second. This gives data warehouse users the ability to work on the very freshest data, making their analysis more timely and accurate.
In addition to the insights that data analysts typically drive out of the data warehouse, analysts can also apply machine learning capabilities delivered by BigQuery ML against real-time data being streamed in. If data analysts know there’s a source of data that they need to access but that isn’t currently in the warehouse, Dataflow SQL enables them to connect new streaming sources of data with a few simple lines of SQL.
The real-time capabilities we describe for data analysts have cascading effects for the business analysts who rely on dashboards sourced from the data warehouse. BigQuery’s BI Engine enables sub-second query response and high concurrency for BI use cases, but including real-time data in the data warehouse gives business analysts (and those who rely on them) a fuller picture of what’s happening in the business right now. In addition to BI, Looker’s data-driven workflows and data application capabilities benefit from fast-updating data in BigQuery.
Data Fusion, Google Cloud’s code-free ETL tool, delivers real-time processing capabilities to ETL developers with the simplicity of flipping a switch. Data Fusion users can easily set their pipelines to process data in real-time and land it into any number of storage or database services at Google Cloud. Further, Data Fusion’s ability to call upon a number of predefined connectors, transformations, sinks, and more – including machine learning APIs – and to do so in real-time gives businesses an impressive level of flexibility without the need to write any code at all.
Each blog in this series (catch up on Part 1 and Part 2 if you missed them) has shown how Google Cloud can democratize data and insights. It’s not enough to deliver data access, then simply hope for good things to happen within your business. We’ve observed a clear formula for successfully democratizing the generation of ideas and insights throughout your business:
Start by ensuring you can deliver broad access to data that’s relevant to your business. That means moving towards systems that have elastic storage and compute with the ability to automatically scale both. This will enable you to bring in new data sources and new data workers without the need for labor-intensive operations, increasing the agility of your business.
Ensure that users can generate insights from within the tools they know and are comfortable with. By delivering new capabilities to existing users within their tools, you can help your business put data to work across the organization. Further, this will keep your workforce excited and engaged as they get to explore new areas of analysis like machine learning.
Once you’ve given your employees the ability to access data and the ability to drive insights from the data, give them the ability to analyze real-time data and automate the outcomes of that analysis. This will drive better customer experiences, and help your organization take faster advantage of opportunities in the market.
We hope you’ve enjoyed this series, and we hope you’ll consider working with us to help democratize data and insights within your business. A great way to get started is by starting a free trial or jumping into the BigQuery sandbox, but don’t hesitate to reach out if you want to have a conversation with us.
Posted by Florina Muntenescu, Developer Relations Engineer
We just wrapped up another series of MAD Skills videos and articles – this time on Kotlin and Jetpack. We covered different ways in which we made Android code more expressive and concise, safer, and easy to run asynchronous code with Kotlin.
Check out the episodes below to level up your Kotlin and Jetpack knowledge! Each episode covers a specific set of APIs, talking both about how to use the APIs but also showing how APIs work under the hood. All the episodes have accompanying blog posts and most of them link to either a sample or a codelab to make it easier to follow and dig deeper into the content. We also had a live Q&A featuring Jetpack and Kotlin engineers.
Episode 1 – Using KTX libraries
In this episode we looked at how you can make your Android and Jetpack coding easy, pleasant and Kotlin-idiomatic with Jetpack KTX extensions. Currently, more than 20 libraries have a KTX version. This episode covers some of the most important ones:
core-ktx that provides idiomatic Kotlin functionality for APIs coming from the Android platform, plus a few Jetpack KTX libraries that allow us to have a better user experience when working with APIs like
Episode 2 – Simplifying APIs with coroutines and Flow
Episode 2, covers how to simplify APIs using coroutines and Flow as well as how to build your own adapter using
callbackFlow APIs. To get hands-on with this topic, check out the Building a Kotlin extensions library codelab.
Episode 3 – Using and testing Room Kotlin APIs
This episode opens the door to Room, peeking in to see how to create Room tables and databases in Kotlin and how to implement one-shot suspend operations like insert, and observable queries using Flow. When using coroutines and Flow, Room moves all the database operations onto the background thread for you. Check out the video or blog post to find out how to implement and test Room queries. For more hands-on work – check out the Room with a view codelab.
Episode 4 – Using WorkManager Kotlin APIs
Episode 4 makes your job easier with WorkManager, for scheduling asynchronous tasks for immediate or deferred execution that are expected to run even if the app is closed or the device restarts. In this episode we go over the basics of WorkManager and look a bit more in depth at the Kotlin APIs, like
Episode 5 – Community tip
Episode 5 is by Magda Miu – a Google Developer Expert on Android who shared her experience of leveraging foundational Kotlin APIs with CameraX. Check it out here:
Episode 6 – Live Q&A
In the final episode we launched into a live Q&A, hosted by Chet Haase, with guests Yigit Boyar – Architecture Components tech lead, David Winer – Kotlin product manager, and developer relations engineers Manuel Vivo and myself. We answered questions from you on YouTube, Twitter and elsewhere.
Do you have an application that’s a little… sluggish? Cloud Profiler, Google Cloud’s continuous application profiling tool, can quickly find poor performing code that slows your app performance and drives up your compute bill. In fact, by helping you find the source of memory leaks and other errors, Profiler has helped some of Google Cloud’s largest accounts reduce their CPU consumption by double-digit percentage points.
What makes Profiler so useful is that it aggregates production performance data over time from all instances of an application, while placing a negligible performance penalty on the application that you are examining—typically less than 1% CPU and RAM overhead on a single profiled instance, and practically zero when it’s amortized over the full collection duration and all instances of the service!
In this blog post, we look at elements of Profiler’s architecture that help it achieve its light touch. Then, we demonstrate the negligible effect of Profiler on an application in action by using DeathStarBench, a sample hotel reservation application that’s popular for testing loosely coupled microservices-based applications. Equipped with this understanding, you’ll have the knowledge you need to enable Profiler on those applications that could use a little boost.
Profiler vs. other APM tools
Traditionally, application profiling tools have imposed a heavy load on the application, limiting the tools’ usefulness. Profiler, on the other hand, uses several mechanisms to ensure that it doesn’t hurt application performance.
Sampling and analyzing aggregate performance
To set up Profiler, you need to link a provided language-specific library to your application. Profiler uses this library to capture relevant telemetry from your applications that can then be analyzed using the user interface of the tool. Cloud Profiler supports applications written in Java, Go, Node.js and Python.
Cloud Profiler’s libraries sample application performance, meaning that they periodically capture stack traces that represent the CPU and heap consumption of each function. This behavior is different from an event-tracing profiler, which intercepts and briefly halts every single function call to record performance information.
To ensure your service’s performance is not impacted, Profiler carefully orchestrates the interval and duration of the profile collection process. By aggregating data across all of the instances of your application over a period of time, Profiler can provide a complete view into production code performance with negligible overhead.
Roaming across instances
The more instances of each service from which you capture profiles, the more accurately Cloud Profiler can analyze your codebase. While each Profiler library / agent uses sampling to reduce the performance impact on a running instance, Profiler also ensures that only one task in a deployment is being profiled at a given time. This ensures that your application is never in a state where all instances are being sampled at the same time.
Profiler in action
To measure the effect of Profiler on an application, we used it with an application with known performance characteristics, the DeathStarBench hotel reservation sample application. The DeathStarBench services were designed to test the performance characteristics of different kinds of infrastructure, service topologies, RPC mechanisms, and service architecture on overall application performance, making them an ideal candidate for these tests. While this particular benchmark is written in Go and uses the Go profiling agent, we expect results for other languages to be similar, since Profiler’s approach to sampling frequency and profiling is similar for all languages that it supports.
In this example, we ran the eight services that compose the hotel reservation application on a GCE c2-standard-4 (4 vCPUs, 16 GB memory) VM instance running Ubuntu 18.04.4 LTS Linux and configured the load generator for two series of tests: one at 1,000 queries per second, and one at 10,000. We then performed each test 10 times with Profiler attached to each service and 10 times without it, and recorded the service’s throughput and the CPU and memory consumption in Cloud Monitoring. Each iteration ran for about 5 minutes, for a total of about 50 minutes for 10 iterations.
The following data shows the result of the 1,000 QPS run:
In the first test we observe that Profiler introduces a negligible increase in CPU (less than 0.5%) consumption and a minor increase in memory consumption, averaging to roughly 32 MB (3.7%) of additional RAM usage across eight services, or just under 4 MB per service.
The following data shows the result of the 10,000 QPS run:
In the second test, we see that Profiler’s only impact on application is in line with the previous observations that the increase in memory consumption is roughly 23 MB (2.8%) of memory, or 3MB per service, and a negligible increase in CPU (less than 0.5%) consumption.
In both tests, the increase in memory usage can be attributed to the increase in the application’s binary size after linking with the Profiler agent.
In exchange, you gain deep insight into code performance, down to each function call, as shown here for the hotel reservation application:
Here we use Profiler to analyze the memory usage of the benchmark’s “frontend” service. We utilize Profiler’s weight filter and weight comparison features to determine the functions that increased their memory usage while the application scaled from 1,000 QPS to 10,000 QPS, which are highlighted in orange.
In short, Profiler introduces no discernible impact on an application’s performance, and a negligible impact on CPU and memory consumption. And in exchange, it lets you continuously monitor the production performance of your services without affecting their performance or incurring any additional costs! That’s a win-win, in our book. To learn more about Profiler, be sure to read this Introduction to Profiler, and this blog about its advanced features.
Building an ELT pipeline using Google Sheets as an intermediary
BigQuery offers the ability to quickly import a CSV file, both from the web user interface and from the command line:
Limitations of autodetect and import
This works for your plain-vanilla CSV files, but can fail on complex CSV files. As an example of a file it fails on, let’s take a dataset of New York City Airbnb rentals data from Kaggle. This dataset has 16 columns, but one of the columns consists of pretty much free-form text. This means that it can contain emojis, new line characters, …
Indeed, try to open this file up with BigQuery:
and we get the errors like:
This is because a row is spread across multiple lines, and so the starting quote on one line is never closed. This is not an easy problem to solve — lots of toolsstruggle with CSV files that have new lines inside cells.
Sheets to the rescue
Google Sheets, on the other hand, has a much better CSV import mechanism. Open up a Google Sheet, import the CSV file and voila …
The cool thing is that by using a Google Sheet, you can do interactive data preparation in the Sheet before loading it into BigQuery.
First, delete the first row (the header) from the sheet. We don’t want that in our data.
ELT from a Google Sheet
Once it is in Google Sheets, we can use a handy little trick — BigQuery can directly query Google Sheets! To do that, we define the Google Sheet as a table in BigQuery:
Steps from the BigQuery UI
- Select a dataset and click on Create Table
- Select Drive as the source, specify the Drive URL to the Google Sheet
- Set Google Sheet as the file format
- Give the table a name. I named it airbnb_raw_googlesheet
- Specify the schema:
This table does not copy the data from the sheet — it queries the sheet live.
So, let’s copy the data as-is into BigQuery (of course, we could do some transformation here as well):
How to automate
You can automate these steps:
- Here’s an article on how toread a CSV file into Sheets using Python
- From then on, usedataform.co or BigQuery scripts to define the BigQuery table and do the ELT.
To import complex CSV files into BigQuery, build an ELT pipeline using Google Sheets as an intermediary. This allows you to handle CSV files with new lines and other special characters in the columns. Enjoy!