How AI Is Improving Data Management

Artificial intelligence is quietly improving the management of data, including its quality, accessibility, and security.

Thomas H. Davenport and Thomas C. Redman

December 20, 2022

MIT SLOAN

Management Review

Data management is crucial for creating an environment where data can be useful across the entire organization. Effective data management minimizes the problems that stem from bad data, such as added friction, poor predictions, and even simple inaccessibility, ideally before they occur.

Managing data, though, is a labor-intensive activity: It involves cleaning, extracting, integrating, cataloging, labeling, and organizing data, and defining and performing the many data-related tasks that often lead to frustration among both data scientists and employees without “data” in their titles.

Artificial intelligence has been applied successfully in thousands of ways, but one of the less visible and less dramatic ones is in improving data management. There are five common data management areas where we see AI playing important roles:

Classification: Broadly encompasses obtaining, extracting, and structuring data from documents, photos, handwriting, and other media.
Cataloging: Helping to locate data.
Quality: Reducing errors in the data.
Security: Keeping data safe from bad actors and making sure it’s used in accordance with relevant laws, policies, and customs.
Data integration: Helping to build “master lists” of data, including by merging lists.

Below, we discuss each of these areas in turn. We also describe the vendor landscape and the ways that humans are essential to data management.

AI to the (Partial) Rescue

Technology alone cannot replace good data management processes such as attacking data quality proactively, making sure everyone understands their roles and responsibilities, building organizational structures such as data supply chains, and establishing common definitions of key terms. But AI is a valuable resource that can dramatically improve both productivity and the value companies obtain from their data. Here are the five areas where AI can have the most impact on effective data management in an organization.

Area 1: Classification

Data classification and extraction is a broad area, and it has grown larger still as more media has been digitized and as social media has increasingly centered around images and video. In today’s online settings, moderating content to identify inappropriate postings would not be possible at scale without AI (although many humans are still employed in the field as well). We include in this area classification (Is this hate speech?), identity/entity resolution (Is this a human or a bot, and, if human, which one?), matching (Is the Jane Doe in database A the same human as J.E. Doe in database B?), data extraction (What is the most important data in this judicial filing?), and so forth.

For many years, primitive forms of AI have been used for optical character recognition (OCR) to extract important data from items such as bank checks or addressed envelopes. OCR has become so common that we no longer think of such capabilities as AI. Newer AI systems have expanded on OCR with deep learning models that are now becoming capable of accurately reading human handwriting.

AI is a valuable resource that can dramatically improve both productivity and the value companies obtain from their data.

Important data is often stuck in inflexible document formats like faxes, PDFs, and long word-processing documents, and in order to access it, analyze it, or even answer questions about it, it must first be extracted. In health care, for example, information is still communicated in faxes, and accessing it has required substantial human effort. One electronic health records company wrote an AI program to extract data from faxes and input it directly into the EHR system, which saves significant time. AI programs can also identify and extract important provisions from contracts, which is useful to lawyers and auditors, among others.

Area 2: Cataloging

For decades, companies have lacked an accurate guidance on where key data resides throughout their systems and records. Fortunately, data cataloging has emerged over the past several years as an important aid to keeping track of that material. Creating and keeping such catalogs current, however, has been labor intensive.

AI can automate searches through various repositories of data and create catalogs automatically. AI systems can capture any metadata that exists within system documentation. AI can also describe the lineage of data — where it originated, who created it, how it has been modified, and where it currently resides.

But while creating catalogs and data lineage information is easier with AI, companies must still wrestle with the messiness of their existing data environments. Many companies have resisted creating catalogs using traditional labor-intensive methods because they haven’t wanted to reveal the extent of the architectural mess, or because they’ve wanted to wait until data was better organized and of higher quality before devoting the extensive effort involved. The ease of creating and updating catalogs with AI, however, means that companies can combine easier information access with continuous data improvement processes.

Area 3: Quality

Data quality tools essentially implement controls, typically using business rules, that define the domains of allowed data values. Consider a date consisting of a day and a month. There are only 366 combinations of allowed values. Thus, “Jebruary” is not an allowed month, “35” is not an allowed day, and “February 31” is not an allowed combination. Defining, coding, and keeping business rules up to date is especially onerous and an area where we see great benefit in machine learning-based AI.

AI tools can scan data to identify values that are not allowed, with some errant values corrected automatically and others assigned to some person or group for correction. Several vendors already boast that their tools employ machine learning for these purposes.

AI can also perform other data quality-related functions, including augmenting data with additional information from other internal or external databases (after a matching process), making predictions about how to fill missing data gaps, and deleting data that has become duplicated or seldom used.

Importantly, vendors could improve their tools if they supported a more proactive approach to data quality management — one that focused on preventing data errors rather than finding and fixing them. To that end, controls should be applied as close to the points of data creation as possible. Additionally, tools should make data quality measurements closely aligned to business impact and support statistical process control and quality improvement.

Area 4: Security

Preserving data security and privacy are critical issues for any organization today. Preventing hacks, breaches, and denials of service have been largely human activities since the birth of the data protection profession.

AI can assist with many of these functions. It is useful, for example, in threat intelligence — observing the external world; synthesizing threat signals, actors, and language; and predicting who might be doing what to whom. AI-based threat intelligence is a response to numerous challenges faced by cybersecurity professionals, including a high volume of threat actors, massive amounts of seemingly meaningless information, and a shortage of skilled professionals.

Leading solutions employ machine learning to automate the collection of security data across multiple internal and external systems, create structured data from unstructured formats, and assess which threats are most credible. AI systems can predict likely attack paths based on previous attack patterns and determine whether new threats are coming from previously known actors or new ones. Given the number of false-positive cybersecurity threats across multiple, unconnected security systems, a combination of decision rules and machine learning models can prioritize or triage threats for human investigation.

Unsupervised learning systems can identify anomalies in an organization’s IT environments, such as unusual patterns of access or rare IP addresses accessing the organization’s systems. These approaches have the advantage of not needing to be trained on past approaches to cybersecurity, which are always subject to change.

AI can also be used to identify internal threats of fraud or noncompliance with regulations. This capability is of particular interest to highly regulated industries like banking and investing. The AI software monitors digital communications within an organization and identifies suspicious language or patterns of behavior. Of course, human investigation is necessary to confirm malfeasance by employees or customers.

Area 5: Data Integration

Perhaps one of AI’s greatest improvements to data management is in the area of data integration — also known as mastering — which involves creating a master or “golden” data record that is the best possible source of a data element within an organization. Companies can require data integration for a number of reasons: because they proliferated different versions of key data over time, because they want to repurpose transactional data for analytic purposes, or because they acquired or merged with companies that have their own databases. Combining and mastering data across a large organization has historically been an enormous task requiring years of effort.

In the past, the most common approach to data integration was master data management, which used a set of business rules to decide, for example, whether a particular set of customer or supplier records should be combined because they were essentially the same record. Creating and revising an extensive set of rules was so difficult and expensive, however, that many data integration projects were abandoned before completion.

Now, machine learning-based mastering systems from companies like Tamr use probabilistic matching techniques to decide whether records should be combined. Records that have a high probability of being the same entity — say, 90% or higher — are automatically merged. The relatively few records that can’t be resolved by this approach can be reviewed by human subject matter experts.

The Vendor Environment for AI and Data

Companies seeking to employ AI to broadly improve their data management situations have two primary choices among vendors of these tools: They can opt for a comprehensive, expensive, and at best, translucent solution, or cobble together a set of single-purpose AI systems.

Companies such as Palantir, which initially focused on the defense and intelligence market but has broadened to commercial applications as well, represent the former option. Other vendors that are approaching the breadth of Palantir’s data management offerings include Collibra, Informatica, IBM, and Talend. Others focus on particular data types, such as Splunk for machine data.

Most vendors offering single-purpose products are small and not well known. Some large cloud providers offer AI-for-data tools, but having multiple options from which to choose is often confusing to potential customers. The vendor environment for these tools is changing rapidly: One vendor told us, “There is a startup every day in this space, and most offer a tool that is ridiculously narrow.”

Large professional services firms may represent a third possibility for companies that want to use AI for data management. Several have formed partnerships with smaller businesses to integrate their options, and with larger ones to provide configuration and customization services. One large services firm is exploring new business models with clients based not on the usual time and materials arrangements but rather on the provision of clean, integrated data records and a specified cost per record. In such a complex environment, that level of simplicity is likely to appeal to many organizations.

What AI Can’t Do — and Where Humans Matter Most

While AI is making headway at improving data management, there are still many things that it can’t do. Overall, good data still requires good managers who care about data, view it as an important asset, and establish a management system that treats it as such.

Specific tasks for which AI isn’t much help yet include the following:

Creating a data strategy and deciding which data is most important to a business.
Creating a data-driven culture.
Calibrating sensors or equipment.
Developing data governance policies and structures.
Defining key business terms or putting a common language in place.
Establishing whether an organization is using the right data or the wrong data to solve a problem.
Recommending where an organization should store or process its data.
Punishing anyone for cybersecurity violations or data-related fraud.All organizations, then, will continue to need humans to manage data — both regular employees who create data and use it, and data management professionals whose job it is to architect, protect, and curate it. It is inevitable that highly structured and frequently performed data management tasks will be automated with the help of AI, either now or in the near future. This is good news overall for data management and its users and practitioners, although some low-level data management professionals’ jobs may change dramatically or even disappear. At organizations that believe that good data is important to their present and future operations, it’s important to plan for what tasks they want to use AI for, what activities will still belong to humans, and how the two will work together.

ABOUT THE AUTHORS

Thomas H. Davenport (@tdav) is the President’s Distinguished Professor of Information Technology and Management at Babson College, a visiting professor at Oxford’s Saïd Business School, and a fellow of the MIT Initiative on the Digital Economy. He is coauthor of Working With AI: Real Stories of Human-Machine Collaboration (MIT Press, 2022). Thomas C. Redman (@thedatadoc1) is president of New Jersey-based consultancy Data Quality Solutions and coauthor of The Real Work of Data Science: Turning Data Into Information, Better Decisions, and Stronger Organizations (Wiley, 2019).

Blog