Out with ETL, in with ELT - why data preparation needs to change

Cycles of innovation in data management and analytics appear to drive classic ETL (Extract-Transform-Load) functions to a reversal, ELT (Extract-Load-Transform). The implications of this reversal exceed merely the sequence of events.

The cornerstone of ETL was a schema or numerous schemas, but nothing was on the fly. The source data could be incomprehensible, but ETL always knew the target. ELT does not necessarily see the target.

Today, with many information architecture variations, "prep" (prep is shorthand for turning raw data into valuable materializations, metadata, abstraction, processes, pipelines, and models) continues to play a crucial role. Still, with ELT, it is elevated because the "T" in ELT is more lightweight than its predecessor. The numerous ways of handling information today, cloud architectures, data lakes, data lake houses, and cloud-native data warehouses require unique prep capabilities for data science, AI, analytics, edge intelligence, or data warehousing.

The terms "data prep" (or only "prep") and "wrangling" (or "data wrangling") refer to the preparation of data for analytics, reporting or any downstream applications. Prep refers to directly manipulating after accessing it from a data source or sources. In the past, wrangling was referred separately to preparing data during the interactive data analysis and model building. For the sake of simplicity, we use "prep" for both. In the past, they were different disciplines, but they are nearly the same now.

In the same way, a profusion of data management architectures now exist that align themselves with an ELT approach, and prep plays an essential role in the process.

La Tierra, Internet, Globalización

Drivers of change in data

"Après Moi le deluge." After me, the flood," attributed to French King Louis XV or Madame de Pompadour (history is a little fuzzy about the attribution), alludes to signs about the approaching Revolution. The virtually unlimited amount of data, processing capacity, and tools to leverage it are a modern deluge without a doubt. The challenge today is not the volume of data. It's making sense of it, at scale, continuously.

The vision of big data freed organizations to capture far more data sources at lower levels of detail and vastly greater volumes, which exposed a massive semantic dissonance problem. For example, data science always consumes "historical" data, and there is no guarantee that the semantics of older datasets are the same, even if their names are. Pushing data to a data lake and assuming it is ready for use is a mistake.

Harkening to the call to be "data-driven" (we prefer the term "data-aware"), organizations strove to ramp up skills in all manner of predictive modeling, machine learning, AI, or even deep learning. And, of course, the existing analytics could not be left behind, so any solution must satisfy those requirements as well. Integrating data from your own ERP and CRM systems may be a chore, but for today's data-aware applications, the fabric of data is multi-colored. So-called secondary data such as structured interviews, transcripts from focus groups, email, query logs, published texts, literature reviews, and observation records present a challenging problem to understand. Records written and kept by individuals (such as diaries and journals) and accessed by other people are secondary sources.

The primary issue is that enterprise data no longer exists solely in a data center or even a single cloud (or more than one, or combinations of both). Edge analytics for IoT, for example, capture, digest, curate, and even pull data from other, different application platforms and live connections to partners, previously a snail-like process using obsolete processes like EDI or even batch ETL. Edge computing can be thought of as decentralized from on-premises networks, cellular networks, data center networks, or the cloud. These factors pose a risk of data originating in far-flung environments, where the data structures and semantics are not well understood or documented. The risk of smoothly moving data from place to place or the complexity of moving the logic to the data while everything is in motion is too extreme for manual methods. The problem arises because these data sources are semantically incompatible, but only by drawing from multiple sources can useful analytics be derived.

Codificación, Programación, As1

The path to data prep in ELT

Today, data warehouses only slightly resemble their predecessors of only a decade ago. Gone are the single physical location, the "single version of the truth schema," and a blurring of the lines between data warehouses and so-called "data lakes." A decade ago, mixed workloads are now ultra-mixed workloads serving the usual discovery and analytics, plus data science, AI/ML, private and public data exchanges, and providing a platform for data of all types and formats. Today, clusters of servers, even distributed ones, are the norm. As a result, the valuable qualities of ETL are insufficient for the evolved ecosystem, and ELT emerged as the logical choice for data warehousing.

The "T" in ELT

With the onslaught of AI/ML, data volumes, cadence, and complexity have exploded. Cloud providers Amazon AWS, Microsoft Azure, Google, and Alibaba not only provide capacity beyond which the data center can provide, their current and emerging capabilities and services drive the execution of AI/ML away from the data center. To make the transition or streamline processes already there, a data prep platform that is ready for the cloud will be essential by enabling all the enhancements a cloud environment offers:

Moving/Integrating Data in the Cloud/Data Exploration and Quality Assessment
Special Requirements for AI/ML
Data Pipeline Orchestration
Collaboration and Governance
Low-code, no-code operation and,
Support for languages and SQL

In the beginning, data scientists and data engineers search, assemble and manipulate data sources. At each step, the prep tools assist with:

Predictive Transformation - an inference algorithm that informs the analyst with a ranked set of suggestions about the transformation.
Parametrization - a technique to automate changes in iterative passes.
Collaboration - support for multiple analysts t work together and create the facility to share quality work for reuse.
Pattern Matching - useful for exposing patterns in the data.
Visual Profiling - ability to interact with the actual data and perform analysis on it
Sampling - automatic sampling to test transformation.
Scheduling - provide the facility to a time or event for a job to run and offer useful post-run information.
Target Matching - Similar to a data warehouse schema, the prep tool automates the development of the recipe to match.
Common Data Types - csv, JSON, XML, AVRO, Parquet, Hyper, TDE, and more
Native Cloud Operation

Though not entirely "self-service," prep is designed for people who understand the data and its uses, and very seldom does it require the coding process of dev/test/prod. In the context of using the information in your work, this seems like a pretty good idea. Going back decades, you only had two choices for gathering data for reporting and analysis: putting in a request to IT for a report and waiting six months or more to get it or getting a microfiche reader. Compared to that, wrangling some data and developing a report or a dashboard or a visualization was a huge step forward.

The data lake:

From Hadoop, which was indifferent to the size and type of files that could be processed, instead of the rigid and not nearly as scalable nature of relational data warehouses, hatched the idea of the single place for everything in the organization - the data lake. Though it did simplify searching for and locating data, it provided no analytical processing tools at all. The logic of moving a JSON file from Paris, France, to a data lake in a cloud location in Paris, Texas, adds no value except for some economies of scale in storage and processing by the unit.

Something to consider is that "organization" is often an oxymoron. While there may be a single "strategy" for data architecture in most organizations, the result of acquisitions, legacies, geography, and just the usual punctuated progress, there may be a collection of them, distributed physically and architecturally. The best advice is to:

Pay more attention to what your data means than where you put it.
To patch some of the data lake idea's manifest deficiencies, cloud providers have regularly added processing capabilities that mimic early data warehousing features - the data lakehouse.

The data lakehouse:

According to Databricks, "A data lakehouse is a new, open data management paradigm that combines the capabilities of data lakes and data warehouses, enabling BI and ML on all data. Merging them into a single system means that data teams can move faster to use data without accessing multiple systems." This statement is more aspirational than fact. Data warehouses represent sixty years of continuous (though not consistently smooth) progress and provide all the services that are needed, such as:

AI-driven query optimizer
Complex query formation
Massively parallel operation based on the model, not just sharding
Workload Management
Load Balancing
Scaling to thousands of simultaneous queries
Full ANSI SQL and beyond
In-database Advanced Analytics
Ability to handle native data types such as spatial and time-series

Cloud data warehouse:

There are principally four Cloud data warehouses: AWS Redshift, Azure, Snowflake, and Google BigQuery. Many other relational data warehouse technologies have acceptable cloud versions, but the cloud-natives claim the high ground for now. They provide all the functions listed above at a particular maturity, rather than being bolt-on capabilities to generic cloud features. It does get a little blurry, though, because the CDW's provide more than a traditional data warehouse. One, for example, proves a public data exchange market.

ELT and prep work perfectly with CDW because:

They use optimized cloud storage and accept the same kinds and volumes of data as a data lake(house).
Prep tools leverage the internal analytical, statistical, and even AI capabilities of the CDW to ever more valuable data and models for the origination, serving a wide swath of business functions and skills.

Equipo, Smartphone, En Línea, Digital

My take

It is worth noting that despite the relentless expansion of data management and the technologies that enable it, not everything is solved with technology. Forty years ago, a major stumbling block in a data warehouse and Business Intelligence was the inability to understand data at its source. That problem still exists. Data can never be taken at face value, and context is always critical.

A semi-cryptic name for an attribute may seem like a match, but it will make much progress in deep learning ad NLP to look across millions of instances to see if that assumption is correct. It nullifies the idea of a data lake, which overlooked the problem. It may be our most challenging problem going forward (other than the lyrics of "Louie, Louie") exacerbated by algorithmic opacity in AI. However, Prep is the key to progress.

A high-level list of the prep functions AI could enable:

Predictive Transformation - An inference algorithm that informs the analyst with a ranked set of suggestions about the transformation.
Parametrization - a technique to automate changes n iterative passes.
Collaboration - support for multiple analysts t work together and create the facility to share quality work for reuse.
Pattern Matching - useful for exposing patterns in the data.
Visual Profiling - ability to interact with the actual data and perform analysis on it.
Sampling - automatic sampling to test transformation.
Scheduling - provide the facility to a time or event for a job to run and offer useful post-run information.
Target Matching - Similar to a data warehouse schema, the prep tool automates the development of the recipe to match.
Common Data Types - csv, JSON, XML, AVRO, Parquet, Hyper, TDE, and more
Native Cloud Operation

In other words, plenty of work ahead.