[Webinar] Bringing Flink to On-Prem and Private Clouds. Register Now

How Factorial builds real-time data products with Confluent and Tinybird

Écrit par

Founded in 2016, Factorial took aim at the stagnant HR software market with their own unique twist. Factorial is present in over 65 countries, having raised $220 million in venture funding, and employs over 950 people. The company’s goal is to bring modern HR solutions to more than 8,000 businesses worldwide, automating mundane HR tasks so that People leaders can focus on people, not paperwork.

Since its founding, Factorial has worked diligently to understand its customers’ needs and has evolved from simple HR software to an integrated HR operations, finance, and people solution. In order to build new data-driven features and scale their business to new markets and larger customers, Factorial turned to Confluent Cloud and Tinybird. Factorial’s decision enabled them to accelerate their time to market and build great new customer experiences like Job Catalog, Audit Log, and Attendance without sacrificing reliability. With help from Confluent and Tinybird, Factorial has improved its data freshness and reduced query latency, leading to significantly faster user feature launches.

Tinybird gives users a simple means of transforming and enriching data streams and publishing them as high-concurrency, low-latency APIs through a native integration with Confluent Cloud.

In the beginning: batch pipelines on a data lake house 

Like many companies, Factorial began with a traditional batch pipeline that was easy to set up in its early days, guided by the team’s previous experience building scalable data architectures. They knew that they would eventually need to adopt a data mesh architecture to achieve future scale, pushing the responsibility of defining data products out to domain area experts. Their applications use MySQL as a transactional database backend, which also served their early small-scale analytics use cases. 

Rather than adopting a data warehouse, Factorial opted to build a data lake house on top of native AWS services. This was fed by capturing scheduled snapshots from MySQL, processing the data into Parquet files, and storing them in Amazon S3. They process data using managed Spark, organized with AWS Glue, with Amazon Athena giving users the ability to query the lake with SQL. 

This was a simple and effective setup, and at the time, their data team consisted of a single engineer. However, as the product evolved, developers increasingly needed to use this data to build user-facing features, and this architecture did not satisfy two non-negotiable requirements: data freshness and low query latency.

Meeting Data Consumer Requirements

Freshness

Although the lake house architecture proved easy to implement and served internal reporting use cases perfectly, the schedule-driven pipelines to load the data made it difficult for developers to work with. When interacting with user-facing analytical features, users demand up-to-date data. In many cases, the data available in the lake house was stale, in some cases days, or at best, hours, old. This dramatically reduced the value of the data to an end user and thus made it unattractive for developers to use the data to power new product features. 

The developers made their requirements clear: they needed fresh data.

If Factorial’s data team could not deliver fresher data, the development team may have been forced to build and operate their own backend to serve the use case. But that could lead to problems that the data team hoped to avoid, such as having too many disparate systems to support, increasing costs and reducing confidence in overall data quality—the data world’s very own Shadow IT.

Low Latency

While developers could run queries in Athena, the average latency of these queries could range from seconds to minutes. For running a report, that’s not too bad. But Factorial’s developers were aiming to power new user-facing applications like Audit Logs, and a large part of Factorial’s success can be attributed to the high quality of their user experience. Building critical features using queries that could take minutes to return a response was not an option, as users would be left waiting.

This led to the second requirement: faster queries. 

The data team made several changes to the existing stack in an effort to improve the situation, such as helping to optimize queries and building custom Spark jobs that pre-process some data to reduce the required complexity of queries. But they still could not satisfy the requirement. Many of their use cases require enriching fresh streaming data with sets of historical data, and they quickly ran into the common limitations of stateful stream processing, which could not access large historical data sets.

Meeting Data Team Requirements

To achieve the above requirements, the Factorial team decided they must add a real-time, streaming layer to their data architecture. Given that this streaming layer would be used to create business-critical, revenue-generating product features, the stakes were high to deliver a solution that developers could trust. However, the Factorial data team still had only one full-time data engineer. With such constrained resources, delivering a new bulletproof data architecture could prove challenging. Entering the discovery phase, Factorial’s data team had its own requirements: absolute reliability and near-zero operational overhead.

Reliability

When infrastructure is used to directly serve user-facing production features, reliability is critical. Factorial’s success is due, in part, to the quality of its user experience, which is directly impacted by the availability of the systems that the product relies on. Above all else, the components that make up Factorial’s streaming layer must always remain available.

Low Overhead

Adding new components to your architecture never comes for free. After expanding their data team to two engineers, Factorial still maintained a small, agile, and cost-efficient team, which shared the responsibility for keeping costs down. To solve the needs of the business, the data team needed new tooling, and they had to answer the question: build or buy? Factorial analyzed the total cost of ownership of hosting their own services against the cost of paying for a managed service. They deduced that to maintain a reliable service capable of powering production features they would need to hire dedicated engineers to maintain the new infrastructure on top of the infrastructure costs, and it would still take time away from delivering use cases. Combined, this would prove more expensive than buying a managed service, which would handle all of the operational overhead for them and free them to focus on delivering business value.

Confluent Cloud and Tinybird brought data streaming and real-time queries 

To solve for data freshness, the team decided to switch to capturing changes in real-time from MySQL rather than a batch process like running a snapshot on a schedule. They used the MySQL CDC Source (Debezium) Connector for Confluent Cloud to implement Change Data Capture (CDC) over their production MySQL. The MySQL CDC Source (Debezium) Connector captures changes from a database and writes the changes to Apache Kafka®.

Additionally, Factorial’s data team saw that Kafka could serve as a reliable buffer for data; if there were any failures downstream, data could buffer in Kafka and be retried. They assessed alternative tooling, such as Amazon Kinesis and Google Pub/Sub. Still, they discovered that the offset semantics in Kafka proved more flexible, allowing them to more easily resume data consumers from a previous message in the stream.

Their data team members had used Kafka at previous companies and had firsthand experience with the challenges of scaling and maintaining a self-managed, open-source implementation. They understood that Kafka would play a pivotal role in their data pipeline but decided they did not have the resources to invest in running and managing their own Kafka deployment. Confluent Cloud’s fully managed data streaming platform was the right choice for Factorial. 

Built by the original creators of Kafka, Confluent Cloud offered the strict reliability guarantees Factorial needed out of the box and required the least amount of engineering effort to get into production fast. Through Confluent’s 120+ source and sink connectors, they could take advantage of the rich Kafka ecosystem while dedicating almost zero resources to maintaining the platform.

By sending the MySQL CDC stream to Kafka, Factorial eliminated the batch process that introduced significant latency to their data. To complement their streaming pipeline, they also needed a system that would allow their developers to combine their fresh Kafka streams with historical data to power user-facing applications.

The system needed to be able to handle analytical queries in the order of milliseconds, as it would directly power user-facing experiences. Lastly, the system had to be just as reliable as Confluent Cloud, as together these systems would be the backbone of many production features. Absolute reliability was critical, but it must do all this while also bringing the operational overhead down to near zero.

To make real-time analytical queries over data streams available to developers, Factorial chose Tinybird. Tinybird took away all of the operational overhead of managing real-time analytics data infrastructure at scale in production while fulfilling the strict requirements for latency and freshness. Tinybird allows Factorial to ingest from Confluent Cloud in real-time, using Tinybird’s native Confluent connector

This connector is a fully managed integration provided by Tinybird that requires no external ETL or scheduler tooling. This helps Factorial to reduce tooling costs and avoid adding any additional delay to the data so that the freshest data is available to developers within seconds. 

Data streams from Confluent arrive in Tinybird and are enriched with historical data that lives in the platform, avoiding the limitations of stateful stream processors. This means that nearly all data processing can be performed at the time of ingestion, with the result being materialized and ready for developers to push into production features. With only two data engineers, Factorial reduced the average query time for production queries from minutes to sub-50 milliseconds. Tinybird APIs are integrated directly into Factorial’s user-facing product, greatly simplifying their application architecture and saving on further infrastructure and tooling costs.

Together, Confluent Cloud and Tinybird have allowed Factorial to reduce average data freshness from days or hours to mere seconds. At the same time, the platform also reduces query latency from seconds and minutes to milliseconds over richer data. Since being deployed, the Confluent Cloud and Tinybird platform has proven itself several times over. Factorial completed a POC and launched their first production feature in one month, and over the next six months launched more than 12 user-facing product features that are powered by this real-time pipeline, with many more to come.

Ready to get started?

Start your free trial of Confluent Cloud today. New signups receive $400 to spend during their first 30 days.

Interested in using Tinybird to build real-time data products over streaming data? Sign up for a Build plan—it’s free and has no time limit—or check out pricing for Pro or Enterprise plans.

  • Alasdair has been a consultant in the big data space for a decade, working with the largest enterprises through to the smallest startups. He built the UK Customer Success team at Tinybird, and is now building out their global Developer Relations team.

Avez-vous aimé cet article de blog ? Partagez-le !