main-article-of-news-banner.png

Debunking Data Myths And Misconceptions With Dun & Bradstreet’s Chief Data Scientist, Anthony Scriff

Source: forbes.com

With endless amounts of data currently available and growing at an astounding rate, the opportunities for modern businesses to glean valuable insights and the resulting possibilities are constantly evolving. As a result, the conversation surrounding data and its uses must evolve as well. Just a few years ago, the primary topics of interest for those in the data industry were focused on big data, data localization, unstructured data, and predictive analytics. More recently, the focus has shifted towards topics such as data manipulation, privacy of personal data, and aspects of data bias. As data continues to increase in importance, the list of questions and concerns associated with its use grows as well.

Anthony Scriffignano

Anthony Scriffignano

At the Data for AI Community’s April event, Dun & Bradstreet’s Chief Data Scientist and SVP Anthony Scriffignano spoke about common myths and misconceptions surrounding data and how organizations mu

st push for more responsible use. Scriffignano is an internationally recognized data scientist with over 35 years of experience spanning multiple industries, and as the Chief Data Scientist at Dun & Bradstreet, he is deeply involved in the worlds of finance, government, and enterprise. At the recent Data for AI event, Mr. Scriffignano shared some interesting insights on the use and application of data at firms like Dun & Bradstreet

 

Decisions around using data

Data is a unique resource in many ways. Unlike natural resources which have limited availability and lifespan, data has no such limits, infinitely increasing in amount and availability. Data continues to grow and even compounds as it's being used. In fact, using or generating data creates metadata — data about data. As such, there are unique considerations surrounding data’s use, dealing with growth, availability, and the increasingly challenging needs for manipulation. 

Scriffignano explains that in any scenario, there are two types of data: the data in hand readily available for use, and the discoverable data that could be acquired to further inform a decision. The first decision when using data should be deciding if there is enough data present to even make that decision. The answer to this conundrum depends on what type of question the data is intended to answer, and whether the available data is representative of reality.

However, there is a third type of data as well: existing but incomplete data. Data often has many unknowns, but despite those unknowns, organizations are called on to make decisions. For example, the advances in science and knowledge required to make the first moon landing possible were incredible, but there was one factor that the scientists could not estimate which was the “squishiness” of the lunar regolith, or lunar dust. Since the lunar landing module was landing on a surface with unknown properties, it had giant hemispherical feet to prevent it from potentially tipping over. By considering the incomplete data and its potential effects on the outcome of a situation, one can account for a wider range of possibilities and make a correct decision even when the available data does not provide a complete picture.

 

Common myths and inconvenient truths

A few short decades ago devices like fitness monitors, GPS, or AI powered recommendation engines would have sounded like science fiction to many. However through the power of data and advanced analysis methods they are all made possible today. Indeed, many of us use these devices daily without giving it much thought into just how they work and just how much data is being collected and used. 

We take for granted many of the modern luxuries we have thanks to data. However, when using data, Scriffignano explains, it is the user’s responsibility to stop and consider the context of the information and its true meaning. By thinking about how the data is changing and how this alters the conclusions drawn from it, one can avoid falling victim to the common myths and misconceptions surrounding data and its use.

One of the most common myths is that more data gives a better picture. As the human race generates and accumulates astounding amounts of data, identifying the data of interest becomes a search for a needle in an ever-growing haystack making important data more difficult to find, and amplifying the potential to magnify errors, bias, or noise. More often than not blindly collecting more and more data will turn a “data lake” into a “data swamp.”

Another myth is that by using data, AI and machine learning will discover answers or hidden truths. In reality, AI and machine learning algorithms cannot evaluate the veracity of the data they are given. For example, consider a machine learning algorithm trained on images of seagulls landing in a parking lot. If it were given five examples of the birds landing in consecutive parking spaces, the algorithm could possibly conclude that the next seagull to arrive would land in the next available parking space. Of course, common sense dismisses this idea as ridiculous. The seagulls are not intentionally landing in a specific sequence of parking spaces, they are simply landing in a random fashion that happens to look like a pattern. Scriffignano provides examples of algorithms that could lead an AI to make silly assumptions like this, noting that automating the process gives these types of mistakes the potential to scale in severity.

These examples of common misconceptions highlight the fact that as data becomes more ubiquitous and important, an increasing emphasis must be placed on using data thoughtfully and intelligently. While data can be a powerful tool and provide valuable insights, improper use of data, whether done with malicious intent or accidentally out of ignorance, can have harmful consequences.

The evolving future of data

Discussions around data, such as those happening on a monthly basis in the online Data for AI community shows the complexity of these issues. In prior technological waves, companies manly needed people proficient in database administration, loading data, and using programming languages like Python and R to manipulate and move data internally. While these are still essential skills, organizations now also need professionals who understand concepts like permissible use, intellectual property, AI ethics, and more.

As new purposes for data continue to develop, the dialogue is shifting to have an increased focus on using data responsibly. Inequality, AI bias, adversarial data manipulation, data rights, and additional threats must be kept at the forefront of the conversation. “Don’t ignore these things; don’t make them inconvenient truths,” warns Scriffignano. Ask the tough questions including: By what right am I using this data? Where did I get this data? How do I know it’s true? All of these are important questions to consider before using data to inform a decision.

Even with vast amounts of data available and limitless potential contained within, it is always important to remember to stop and think. Data has value, but to get context one must look beyond the data itself. It’s not about how much data you have; it’s not about how much data you’re making; it’s about the sense you’re making out of it.

This article written in collaboration with David Pu, Johns Hopkins University

© 2021 LeackStat.com