Google Cloud: Managing and monitoring a Cloud Dataflow setup

Share this article

In a Qubit Cloud blog post, Qubit's Lead Platform Engineer Ravi Upreti and Platform Engineer Saira Hussain share how Qubit is keeping its data analytics pipelines healthy.

Establishing and maintaining data pipelines is essential for Qubit, delivering the data and insights we need to power real-time personalization for the world’s leading brands. In our previous post, we talked about our journey to build high-throughput, low-latency, streaming data collection and processing pipelines on Google Cloud Platform (GCP) using Cloud Dataflow, Cloud Pub/Sub and BigQuery

Cloud Dataflow, in particular, is a fully managed service that takes away a lot of the pain of managing a pipeline once it’s up and running. Features such as autoscaling and dynamic work rebalancing make Cloud Dataflow pipelines very efficient, self-sustaining systems that need very little external tuning to keep them functioning. This reduces the cost of maintaining the system to the bare minimum. 

However, at our scale, we felt we needed a simpler way to automatically launch and update pipelines. In this post, we’ll discuss how we manage and monitor our Cloud Dataflow pipelines as we work to further automate them, and troubleshoot when problems arise. 

Streamlining pipeline management

While writing and testing a pipeline or new pipeline features, it’s common to launch it multiple times, testing out various deployment options. You can either set execution options for the pipeline in code as Cloud Dataflow PipelineOptions, or take them from command-line arguments. This can make it awkward to tweak options when testing or executing the pipeline in different environments, such as development or production, because you have to re-launch it every time It would be much easier to keep the options in environment-specific configuration files and tweak them as needed. Also, it would make hooking up the deployment of the pipeline to a CI/CD tool such as Cloud Build seamless and easy. 

With this in mind, we developed the Dataflow launcher, an open-source CLI tool written in Python, to launch and manage our Cloud Dataflow pipelines. The tool reads the pipeline configuration from a configuration file and launches the packaged pipeline code in Cloud Dataflow. This lets you localize your pipeline configurations in simple, easy-to-read and manageable config files, making it simple to update the pipeline options. It also means that configurations for different execution environments, such as staging or production, can be kept separately in their own configuration files, further simplifying integration with CI/CD tools. In addition, this mitigates the risk of accidental deployments to the wrong environment—for example, to production. 

You can keep configurations in individual environment-specific files along with the rest of the pipeline code base in a version control system such as GitHub. You can then launch or update the pipeline by configuring your CI/CD tool, like Cloud Build, to trigger on every commit to the release branch. The dataflow launcher also allows you to automate the creation of Cloud Pub/Sub resources like topics or subscriptions, if needed, before launching the pipeline. This eliminates the pain of setting up these resources manually before launching.

The Dataflow launcher has helped us automate the process of Cloud Dataflow pipeline management by further simplifying deployments. We wrote it to support pipelines written in Java. However, Apache Beam now has Python and GO SDKs. The tool does not support pipelines written in these languages at the moment, but adding support for these languages is in the roadmap. The project is also open source and we welcome contributions from the community

This article was originally posted on the Google Cloud blog. Click here to read it.

Subscribe to stay up to date