Towards a semantics of data in a digital world - why is effective data collaboration so elusive?

Traditionally, the job of gathering and integrating data for analytics fell on data warehouses. Data warehouses cleaned and aggregated the "operational exhaust" of business transactions.

Data warehouses created precision, accuracy and consistency out of messy transactional structures, so that business people could look back, and make sense of the past.

As data warehousing evolved, so did our operational applications. With the advent of the Internet, a whole new challenge to precision, accuracy and consistency emerged.

For business analysts and data scientists alike, their work now involves much more searching and discovery of data rather than navigating through steady structures. It is no longer a singular effort. Sharing results and persisting new views of data and models, and collaboratively governing their catalogs is the approach that works.

It is important to make the distinction between tools that provide "connectors" that operate on a physical level, either generating SQL from structural metadata or using API's, etc., and those that have a rich understanding of the data based on its content, structure and how it is being used. Connectors are not sufficient. Someone, or some thing, has to understand the meaning of the data to provide a catalog for analysts to do their work.

Keep in mind that all of the data used by data scientists, analysts and other applications is essentially "used." In other words, it was originally created for purposes other than to be analyzed: supporting operations, capturing events, recording transactions. The data at source, even when it is clear, consistent and error-free, which is rare in an integrated context, will still contain semantic errors, missing entries, or inconsistent formatting for the secondary context of analytics. It must be handled before it can be used. This is especially true if it is meant to be integrated with data from other sources or other previously managed data.

Even somewhat stable and understandable data like logs, especially weblogs, tend to drift over time. Can you think of a major website that hasn't gone through a complete refresh in the past three years or so?

Data warehousing addressed this problem of dealing with "used" data long ago, with processes that executed before data was stored. This is where data warehousing and Hadoop differ. Tools such as ETL, methodologies and best practices ensured that any analyst working with that data accessed a "single source of truth" that was already cleaned and aggregated to produce a pre-defined business metric, often labeled a key performance indicator.

Though in fairness, applications from data warehouses are often quite creative. These solutions are also partially useful for data scientists, but because the data was pre-processed to fit a specific model or schema, the richness of the data is lost when it comes to hypothesis testing in a more exploratory manner and typically too slow to implement when experiments and discoveries are happening in an unplanned fashion ("Our competitor just released a press release about a new pricing model, how should we respond?").

Red, Web, Horizonte, Pixel, Datos

Cloud data warehouses - mixed results

The innovation of widespread cloud computing and "cloud native" data warehouses introduced a better storage mechanism and best practices for hypothesis testing on rich, raw data - at low cost. These cloud data warehouses, as well as cloud native storage protocols such as JSON, evolved out of a system designed to capture "digital exhaust."

Initially the byproduct of online activities, today, "clouds" also include a wealth of machine-generated data; events gathered from sensors in real-time. While data warehouses typically stored aggregated information based on application transactions, digital exhaust can be found in a multitude of forms like XML and JSON, typically referred to as "unstructured," though more accurately defined as "not highly structured."

There is a large hurdle facing working with data stored in its most raw form - any manipulation of that data that is done for hypothesis testing results in a new, unique data structure. While the data may be in one place, a collection of data silos is logically created, whereby any new analysis creates a new data structure silo of its own.

The problem of a "single source of truth"

This provides tremendous agility to analysts who can now ask any question that they'd like to of the data. But it has the downside of making it harder to find a "single source of truth." In the big data world, most of this agility is accomplished physically done by hand coding, writing scripts or by manually editing data in a spreadsheet, a tedious and time-consuming effort, made worse by slow network connections and under-powered platforms. The only upside to data preparation, if it doesn't consume so much valuable time, is that the process itself often yields new insights about the meaning of the data, what assumptions are safe to make about the data, what idiosyncrasies exist in the collection process, and what models and analyses are appropriate to apply.

The struggle for data context

But that is the work of analysts and data scientists. As an enterprise platform supporting many needs and uses, there are data preparation tools. The Hadoop argument was "Why do it in expensive cycles on your RDBMS data warehouse when you can do it in in the cloud?" The reason is that writing code is not quite the same as an intelligent tool that provides data preparation assistance with versioning, collaboration, reuse, metadata and lots of existing transforms built in. It's also a little contradictory. If Hadoop is for completely flexible and novel analysis, who is going to write transformation code for every project? This approach involves mechanical, time consuming data preparation and filtering that is often one-off, consumes a large percentage of the data scientsts' time and provides no reference to the content or meaning of the data.

We hear a lot today about "democratizing" analytics or "pervasive" analytics, but as the producers of "Gorillas in the Mist" learned almost thirty years ago, there is great utility is letting things roll and learning as you go, especially if you have tools to capture those interactions, and to provide all of the features that a big data catalog platform needs to be effective - with both operational data and digital exhaust alike.

The relevant questions today are not, "Is this a ‘single version of the truth,'" or," Who has access to what parts of the schema?" Instead, it's the (excuse the big word) phenomenology of how analysts actually work that matters:

How do people find and use information in their work?
How do they collaborate with other and how do they share insights?
How do we make use of today's ample resources in hardware and algorithms to create information agents/advisers tailored to people's (changing) needs?
How do we stitch together data discovery (of what's there, not the market segment), of data and modeling and presentation without boiling the ocean for each new analysis?
How to make the computer understand me, anticipate, in a non-trivial way, what I do and what I need. "Help me."
And: a hundred other things.

Moving the task of data integration and data extraction to more advanced knowledge integration and knowledge extraction - which includes not only input from machine learning but also human collaboration - is unfinished at best. One example: Alation's apps for collaboration help capture knowledge from subject matter experts that is expressed through queries and documents, in addition to encouraging documentation of data and metadata by making it easy for analysts to tag, annotate and share within their existing SQL-based workflows for analytics:

"It takes the guesswork out of what the data means, what it's related to and how it can be dynamically linked together without endless data modeling and remodeling."

Globo, Espacio, Galaxia, Datos

My take

The behavior of human analysts, captured in logs of queries and the reports and analyses of BI tools, provides crucial guidance to the work of machine learning algorithms. Machine learning algorithms, on the other hand, are invaluable for discovering patterns and relationships that business analysts may never perceive. Data unification approaches that tie various schema together by column names and, more usefully, content of the columns themselves, are useful to a point, but these approaches lack the weighting of the data implied by its previous and ongoing use.

A great deal of our problems understanding data - and this was already the case when we had orders of magnitude less of it - is that the semantics of data, how it is captured, how it is modeled, exposed the gap between the real phenomena we think we're seeing, and what an application actually captures and encodes. Data mining, predictive models, machine learning are just that, models of imperfectly-used data. The process demands the input from both machine models and people, assisted by software that brings both together.