Cycles of innovation in data management and analytics appear to drive classic ETL (Extract-Transform-Load) functions to a reversal, ELT (Extract-Load-Transform). The implications of this reversal exceed merely the sequence of events.
The cornerstone of ETL was a schema or numerous schemas, but nothing was on the fly. The source data could be incomprehensible, but ETL always knew the target. ELT does not necessarily see the target.
Today, with many information architecture variations, "prep" (prep is shorthand for turning raw data into valuable materializations, metadata, abstraction, processes, pipelines, and models) continues to play a crucial role. Still, with ELT, it is elevated because the "T" in ELT is more lightweight than its predecessor. The numerous ways of handling information today, cloud architectures, data lakes, data lake houses, and cloud-native data warehouses require unique prep capabilities for data science, AI, analytics, edge intelligence, or data warehousing.
The terms "data prep" (or only "prep") and "wrangling" (or "data wrangling") refer to the preparation of data for analytics, reporting or any downstream applications. Prep refers to directly manipulating after accessing it from a data source or sources. In the past, wrangling was referred separately to preparing data during the interactive data analysis and model building. For the sake of simplicity, we use "prep" for both. In the past, they were different disciplines, but they are nearly the same now.
In the same way, a profusion of data management architectures now exist that align themselves with an ELT approach, and prep plays an essential role in the process.
"Après Moi le deluge." After me, the flood," attributed to French King Louis XV or Madame de Pompadour (history is a little fuzzy about the attribution), alludes to signs about the approaching Revolution. The virtually unlimited amount of data, processing capacity, and tools to leverage it are a modern deluge without a doubt. The challenge today is not the volume of data. It's making sense of it, at scale, continuously.
The vision of big data freed organizations to capture far more data sources at lower levels of detail and vastly greater volumes, which exposed a massive semantic dissonance problem. For example, data science always consumes "historical" data, and there is no guarantee that the semantics of older datasets are the same, even if their names are. Pushing data to a data lake and assuming it is ready for use is a mistake.
Harkening to the call to be "data-driven" (we prefer the term "data-aware"), organizations strove to ramp up skills in all manner of predictive modeling, machine learning, AI, or even deep learning. And, of course, the existing analytics could not be left behind, so any solution must satisfy those requirements as well. Integrating data from your own ERP and CRM systems may be a chore, but for today's data-aware applications, the fabric of data is multi-colored. So-called secondary data such as structured interviews, transcripts from focus groups, email, query logs, published texts, literature reviews, and observation records present a challenging problem to understand. Records written and kept by individuals (such as diaries and journals) and accessed by other people are secondary sources.
The primary issue is that enterprise data no longer exists solely in a data center or even a single cloud (or more than one, or combinations of both). Edge analytics for IoT, for example, capture, digest, curate, and even pull data from other, different application platforms and live connections to partners, previously a snail-like process using obsolete processes like EDI or even batch ETL. Edge computing can be thought of as decentralized from on-premises networks, cellular networks, data center networks, or the cloud. These factors pose a risk of data originating in far-flung environments, where the data structures and semantics are not well understood or documented. The risk of smoothly moving data from place to place or the complexity of moving the logic to the data while everything is in motion is too extreme for manual methods. The problem arises because these data sources are semantically incompatible, but only by drawing from multiple sources can useful analytics be derived.
Today, data warehouses only slightly resemble their predecessors of only a decade ago. Gone are the single physical location, the "single version of the truth schema," and a blurring of the lines between data warehouses and so-called "data lakes." A decade ago, mixed workloads are now ultra-mixed workloads serving the usual discovery and analytics, plus data science, AI/ML, private and public data exchanges, and providing a platform for data of all types and formats. Today, clusters of servers, even distributed ones, are the norm. As a result, the valuable qualities of ETL are insufficient for the evolved ecosystem, and ELT emerged as the logical choice for data warehousing.
With the onslaught of AI/ML, data volumes, cadence, and complexity have exploded. Cloud providers Amazon AWS, Microsoft Azure, Google, and Alibaba not only provide capacity beyond which the data center can provide, their current and emerging capabilities and services drive the execution of AI/ML away from the data center. To make the transition or streamline processes already there, a data prep platform that is ready for the cloud will be essential by enabling all the enhancements a cloud environment offers:
In the beginning, data scientists and data engineers search, assemble and manipulate data sources. At each step, the prep tools assist with:
Though not entirely "self-service," prep is designed for people who understand the data and its uses, and very seldom does it require the coding process of dev/test/prod. In the context of using the information in your work, this seems like a pretty good idea. Going back decades, you only had two choices for gathering data for reporting and analysis: putting in a request to IT for a report and waiting six months or more to get it or getting a microfiche reader. Compared to that, wrangling some data and developing a report or a dashboard or a visualization was a huge step forward.
From Hadoop, which was indifferent to the size and type of files that could be processed, instead of the rigid and not nearly as scalable nature of relational data warehouses, hatched the idea of the single place for everything in the organization - the data lake. Though it did simplify searching for and locating data, it provided no analytical processing tools at all. The logic of moving a JSON file from Paris, France, to a data lake in a cloud location in Paris, Texas, adds no value except for some economies of scale in storage and processing by the unit.
Something to consider is that "organization" is often an oxymoron. While there may be a single "strategy" for data architecture in most organizations, the result of acquisitions, legacies, geography, and just the usual punctuated progress, there may be a collection of them, distributed physically and architecturally. The best advice is to:
According to Databricks, "A data lakehouse is a new, open data management paradigm that combines the capabilities of data lakes and data warehouses, enabling BI and ML on all data. Merging them into a single system means that data teams can move faster to use data without accessing multiple systems." This statement is more aspirational than fact. Data warehouses represent sixty years of continuous (though not consistently smooth) progress and provide all the services that are needed, such as:
There are principally four Cloud data warehouses: AWS Redshift, Azure, Snowflake, and Google BigQuery. Many other relational data warehouse technologies have acceptable cloud versions, but the cloud-natives claim the high ground for now. They provide all the functions listed above at a particular maturity, rather than being bolt-on capabilities to generic cloud features. It does get a little blurry, though, because the CDW's provide more than a traditional data warehouse. One, for example, proves a public data exchange market.
ELT and prep work perfectly with CDW because:
It is worth noting that despite the relentless expansion of data management and the technologies that enable it, not everything is solved with technology. Forty years ago, a major stumbling block in a data warehouse and Business Intelligence was the inability to understand data at its source. That problem still exists. Data can never be taken at face value, and context is always critical.
A semi-cryptic name for an attribute may seem like a match, but it will make much progress in deep learning ad NLP to look across millions of instances to see if that assumption is correct. It nullifies the idea of a data lake, which overlooked the problem. It may be our most challenging problem going forward (other than the lyrics of "Louie, Louie") exacerbated by algorithmic opacity in AI. However, Prep is the key to progress.
A high-level list of the prep functions AI could enable:
In other words, plenty of work ahead.
© 2021 LeackStat.com
2025 © Leackstat. All rights reserved