Data Reliability Engineering (DRE) is the work done to keep data pipelines delivering fresh and high-quality input data to the users and applications that depend on them. The goal of DRE is to allow for iteration on data infrastructure, the logical data model, etc. as quickly as possible, while—and this is the key part! —still guaranteeing that the data is usable for the applications that depend on it.
End users—data scientists examining a/b-test results, executives looking at dashboards, customers seeing product recommendations, etc.—don’t care about data quality in the abstract. They care about whether the data they’re seeing is useful for the task at hand. DRE focuses on quantifying and meeting those needs, without slowing down the organization’s ability to grow and evolve its data architecture.
It borrows the core concepts of Site Reliability Engineering, which is used at companies like Google, Meta, Netflix, and Stripe to iterate quickly while keeping their products reliable 24/7. These concepts bring a methodical and quantified approach to defining quality, gracefully handling problems, and aligning teams to balance speed and reliability.
Nobody needs to be told how critical data is becoming to nearly every industry. As we move to a world where more roles—not just data science and engineering professionals—are interacting with data whether through self-service analytics or the outputs from machine learning models, there’s more demand for it to “just work” every hour of every day.
But in addition to having more users and more use cases to serve, data teams are simultaneously dealing with larger and more diverse volumes of data. Thanks to Snowflake, Databricks, Airflow, dbt, and other modern data infra tools, it’s never been easier to reach a scale where ad hoc approaches can’t keep up.
While the most obvious big-data companies like Uber, AirBnB, and Netflix felt these pains sooner and led much of the foundational work in this discipline, it’s rapidly catching on more broadly.
The seven principles from Google’s SRE Handbook provide a great starting point for DRE, which can adapt them to deal with data warehouses and pipelines, instead of software applications.
While one could argue that data reliability engineering is still an emerging concept, modern companies (Uber, DoorDash, Instacart, etc.) that use data to operate and grow their businesses are leading the charge to establish DRE as a standard practice. And job postings for the role are already starting to grow. Given the pace of business and the need for data to be trusted, expect to see DRE someday be as prevalent as SRE is now.
© 2022 LeackStat.com
2024 © Leackstat. All rights reserved