Numaflow: Let Golden Signals Work for You!

Rohan Ashar
numaproj
Published in
7 min readAug 10, 2023

--

Photo by Ibrahim Boran on Unsplash

This article is co-authored by Rohan Ashar and Shilpa Bhat, Senior Software Engineers at Intuit. Editor: Jason Zesheng Chen.

Observability is a critical aspect of modern software development and operations. As platform engineers at Intuit, we are excited to share with you our experience using Numaflow for real-time monitoring and observability of golden signals.

Intuit, as a global financial technology company, provides a wide array of familiar products like TurboTax, QuickBooks, Mint, Credit Karma, and Mailchimp, serving over 100 million customers. TurboTax, for example, handles hundreds of thousands of users simultaneously during tax peak, so it is undoubtedly important for us to continuously monitor all our services and catch any anomalous behavior. Therefore, having the right observability tool is crucial for identifying and diagnosing issues quickly, minimizing downtime, and ensuring that we provide a great customer experience.

What we set out to do

A key step towards the above goal involves building efficient pipelines for computing the Golden Signals: Errors, Saturation, Traffic, and Latency. The ability to monitor them effectively is central to ensuring service reliability and reducing Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR), thereby guaranteeing the optimal customer experience.

To achieve this, we needed a reliable real-time stream processing platform that can compute Golden Signals for alerting and monitoring with low latency, while handling a high throughput. In addition, the system needs to be able to transform the aggregated data into different formats and route it to different sinks, including a Kafka sink and a custom HTTP endpoint for visualization. The solution must be highly scalable, cost-effective, and easy to configure.

Finally, as Intuit’s products are built on a common development platform leveraging Kubernetes, and as we will be serving a diverse range of developers programming in different languages, we sought a Kubernetes-native, language-agnostic tool that has little learning curve. In short, we were looking for something that would allow developers without any special knowledge of data/stream processing to easily create massively parallel data/stream processing jobs using a programming language of their choice, with just basic knowledge of Kubernetes.

Introducing Numaflow: Kubernetes-native, language-agnostic, and scalable

Numaflow allows developers without any special knowledge of data/stream processing to easily create massively parallel data/stream processing jobs using a programming language of their choice, with just basic knowledge of Kubernetes.

Enter Numaflow, a Kubernetes-native, language-agnostic platform for massively parallel data and streaming processing, and a powerhouse with its support for Kafka source, custom functions for transformations, aggregation capabilities, and ability to sink to multiple sources.

Throughout our workflow, Numaflow’s Kubernetes-native design allows us to concentrate on the business logic without worrying much about managing the pipeline. That’s just one of the many reasons we chose it — Numaflow also offers autoscaling functionalities, which enable our pipeline to scale for load without any interruptions or data loss. We can also easily scale each individual vertex up or down based on the load, meaning that we can reliably optimize our costs by scaling only the vertices that need it, without having to pay for unnecessary resources. This important feature is not easy to achieve in other platforms such as Flink, and it was a main factor driving our decision to adopt Numaflow.

Numaflow also gives us the flexibility to mix and match languages. For example, our Aggregation MapReduce vertex is written in Java because it relies on certain libraries that only have support in Java, whereas all other vertices are written in Golang. Despite this polyglot setup, Numaflow’s language-agnostic design empowers us to solve problems on our terms, lending itself to massive libraries from diverse languages.

How we used Numaflow

To provide some context, our platform architecture roughly involves the following:

  1. Feeding data for all services into Kafka, where a source Numaflow vertex reads the data.
  2. A pre-aggregation vertex then transforms the incoming data before feeding the output into an aggregation vertex where we aggregate all the data collected for a minute.
  3. The data is then sent to a post-aggregation vertex for further enrichment and post-processing before being sent to Kafka for further anomaly detection and other visualization platforms using HTTP.
An illustrative overview of our pipeline, rendered by the Numaflow UI.

We are thrilled to report that Numaflow impresses at every turn. Let’s dive into this process below and see where Numaflow shines.

Filtering data at the source: Watermarking and Source Data Transformer

The messages at the source are often out of order and they could arrive late. Ideally, in stream processing, we would want the data to be in an orderly fashion so that we don’t lose data.

At the beginning of our pipeline, data from all services are fed into Kafka. Numaflow’s source vertex, which offers seamless integration with Kafka, ingests from the Kafka topic. However, due to various possible source incidents such as faulty connection or data corruption, the messages at the source are often out of order and they could arrive late. Ideally, in stream processing, we would want the data to be in an orderly fashion so as not to lose data.

One useful method to keep track of event time progress is watermarking, which will help us group unbounded data into discrete chunks in order to establish a threshold for lateness. Numaflow supports watermarks out-of-the-box: source vertices generate watermarks based on the event time, and propagate to downstream vertices.

Additionally, we also leveraged the source transformer capability of Numaflow for two things: to validate the input and drop any invalid/corrupt input, which reduces the load on further vertices, and to assign a watermark for the entire pipeline.

Streamlined aggregation process

Numaflow allows us to transform and enrich the payload, adding the relevant information so that we can calculate our golden signals and easily export our data to different platforms without any issues.

We then use the user-defined Pre-Aggregation vertex to apply transformations and enrichment to the payload before we send it for aggregation. This vertex has most of the business logic of adding the relevant information to the message payload so that we can calculate our golden signals.

Next, the Aggregation vertex, which is a MapReduce vertex with a fixed window, aggregates the data and sends it to the next vertex for further processing. See the Numaflow documentation for more examples and features of how our MapReduce vertex works.

After the data is aggregated, the Post-Aggretation vertex transforms the aggregated data into different formats and tags it, and Numaflow takes care of routing the messages to the correct sink based on the tags. The data exits the pipeline into two different sinks: a Kafka sink, which is available out of the box and only needs the addition of a config, and another custom HTTP sink where we export the data to our visualization tool. With Numaflow’s support for multiple sinks, we were able to easily export our data to different platforms without any issues.

Easy monitoring and debugging through intuitive user interface

Numaflow comes equipped with an intuitive user interface out of the box. As seen below, it provides us with a visual representation of all the vertices and the edges along with the number of pods on each vertex. A quick look can tell us the amount of data being processed at each vertex, the watermark progression, and the number of events in the buffer. Each vertex representation also shows the number of pods running on it and back pressure, if any is highlighted.

Furthermore, clicking on any vertex will expand its details, like the CPU and memory usage per pod, logs on both the platform as well as the user defined function. We’ve taken advantage of this UI extensively to not only get a quick overview of the pipeline but also to debug issues.

Conclusion

Overall, Numaflow has been a great choice for us, as we are constantly striving for operational excellence across Intuit through our observability capabilities. Our positive experience with Numaflow has motivated us to continue exploring new ways to reliably monitor system health and performance and detect issues earlier.

If you’re interested in trying out Numaflow, visit the Numaflow website for more information. It’s very easy to get started: check out our Quick Start Guide to set up a few example pipelines in just a few minutes!

Questions? Comments? Eager to learn about how others use Numaflow or share your own experiences? Join the Numaproj Slack channel. In the meantime, feel free to browse the Numaproj blog for the latest features, releases, and use cases.

We’re always looking for feedback and contributions, so if you like what you see, please consider giving our Numaflow project a star on GitHub. Thank you and we look forward to seeing what you’ll build with Numaflow!

--

--