Despite the obvious impact of the most salient macro level trends impacting data science—including Artificial Intelligence, cloud computing, and the Internet of Things—the ends of this discipline remain largely unchanged from when it initially emerged nearly 10 years ago.
The goal has always been to equip the enterprise with tailored solutions spanning technological approaches that not only justify, but also maximize the use of data for fulfilling the most meaningful business objectives at hand.
Oftentimes, those involve the upper end of the analytics continuum in the form of predictive and prescriptive measures. Currently, cognitive computing deployments factor substantially into data scientists’ abilities to complete this task.
Ergo, the most profound developments affecting this space in 2022 reduce the traditional impediments to devising the underlying models that support applications of Natural Language Processing, cognitive search, image recognition, and other advanced analytics manifestations.
There are relatively new, established, and resurgent data science approaches that make it much easier to work with unstructured data, reduce the sheer quantities of training data required to build models, and decrease the manual efforts for providing labels for that data.
Most exciting of all, many of these techniques operate at the nexus point between supervised and unsupervised learning, the two conventional methods underpinning most machine learning solutions. The impending collapse of this divide is unfolding a new world of opportunities that make data science more accessible and facile than it’s ever been.
Plus, by relying less on strictly supervised learning approaches, this data science trend is furthering AI’s march towards replicating human intelligence, since it’s primarily “a combination of this supervised and unsupervised learning,” reflected Wayne Thompson, SAS Chief Data Scientist. “Most of us humans learn through an unsupervised type way.”
Unaided, supervised learning requires tremendous data quantities and time consuming annotations of business outcomes or factors influencing them. Unsupervised learning also involves inordinate training data, yet identifies patterns or features in them without annotations. Between these two approaches there’s a range of techniques that either involve subsets of one or the other, both, or additional techniques related to the aforementioned two to reduce either the amounts of training data or labels involved. These methods include:
The amount of training data necessary to build credible machine learning models for business applications is inordinately large and serves as the main inhibitor for applications of supervised or unsupervised learning. Certain domains simply don’t have enough of such data, which can potentially unhinge data science efforts for them. Approaches involving transfer learning, GANs, and reinforcement learning ameliorate this issue by either decreasing the amount of training data required or generating enough data on which to teach models. These methods also help with the labeled data issue discussed below. “With supervised learning the barrier has historically been the supervision,” Wilde observed. “You need tens of thousands of examples before the machine learns what you’re trying to teach it. Transfer learning cuts that down to a few hundred.”
The generative prowess of GANs is ideal for creating data for which reinforcement learners can interact in a simulated environment. The former is responsible for the deep fake phenomenon and creating lifelike images, which is an example of generative AI going awry. Within the confines of the enterprise, however, “you’re seeing a combining of GANs with reinforcement learners to put this synthetic tabular data generation into the reinforcement learning process,” Thompson commented. If that process happens to be around a business objective like converting sales prospects into customers “you can use GANs to simulate new data to train the reinforcement learner,” Thompson added.
The other caveat for data science projects involving supervised learning (which comprises the majority of such endeavors) is the extreme amounts of work and money required to label data—when there’s enough found. Aside from the strategies for transfer learning, GANs, and reinforcement learning identified above, other approaches for expediting the labeling of data involve:
Regardless how it’s invoked, almost any machine learning technique for unstructured content like images or text is able to provide structure to it so “you can leverage your RPA investments and analytics by bringing unstructured data, which is typically opaque and difficult to wrangle, into those existing investments,” Wilde indicated. The digital software agents powering RPA are instrumental in this regard when, equipped with machine learning, they transfer unstructured text data into structured data systems. Wilde identified a use case in which a well known insurer could “ingest annuity documents, classify and extract data, then analyze it to see if annuities are in good order.”
There are other techniques for rendering what’s widely considered unstructured text into a conventionally structured tabular format. “Text is structured; people don’t think it is but the way we do that is through counting,” Thompson propounded. With this methodology, each row in a table is a document while each column includes the terms in it. “You just count how often each term appears across each row of documents and you total those up,” Thompson disclosed. “Suddenly, you’ve basically taken textual data and converted that into a numerical representation that you can then model.”
Data science has always been characterized by innovation and opportunity. The above developments regarding the hybridization of supervised and unsupervised learning to overcome the training data and annotation issues plaguing the former produce the profound effect of making it easier for organizations to leverage advanced analytics models. Subsequently, data science barriers (including unstructured data) are systemically falling while machine intelligence is gaining on human intelligence—behooving the enterprise.
© 2022 LeackStat.com
2024 © Leackstat. All rights reserved