Prompts for Cleaning Data with AI

Introduction

Rigorously cleaning and preprocessing data can be a thankless job. You've undoubtedly burned hours wrestling with CSVs or cranking out scripts in your Jupyter Notebooks (dev-test-v4.ipynb). Maybe you're among the fortunate few working at a company with a robust data pipeline where everything flows smoothly. More likely, you have something that looks solid on the surface, but the once you dig a bit deeper, the makeshift nature of the thing reveals itself. Despite being 'automated', systems like these demand constant human interventions to keep them running.

I'm not here to convince you that Large Language Models (LLMs) will solve all of your problems (they won't, as no technology will), but LLMs are a genuine, technological paradigm shift. They enable fluid and intuitive ways to communicate with your data. I've replaced thousands of lines of code with LLMs, and the result has been improved performance and reduced costs by slashing our reliance on external systems and third-party APIs.

This guide draws on some my experience, along with techniques from two papers:

Done right, these techniques can save you countless hours.

Using LLMs for Data Cleaning & Processing

Traditional data cleaning and processing methods require a combination of manual labor and the use of various programming languages, such as Python, R, SQL, and Julia, to accomplish a task.

LLMs enable solve a greater amount of data problems with less code by leveraging the model's contextual understanding and expansive knowledge base. This often eliminates the need for external tools. Specifically, they've proven effective in:

  • Error Detection (ED): Unearth inconsistencies within specified attributes.
  • Data Imputation (DI): Deducing missing attribute values using the surrounding context.
  • Schema Matching (SM): Determine whether two differently named attributes essentially convey the same meaning.
  • Entity Matching (EM): The objective is to verify if two diverse records refer to the same entity.

Using LLMs for Data Cleaning & Processing

Analyzing data often uncovers typical inconsistencies:

  • Missing Values: These might result from human error or system glitches.

  • Inconsistencies: Representations like "New York" vs "NY" might seem different but refer to the same value. In a traditional programming approach, addressing these discrepancies requires planning for every possible variation and mistake, from misspellings to format inconsistencies.

  • Duplication: Redundant entries can distort the insights derived from the data.

Rather than having to explicitly account for each variation or edge case, LLMs can be given a general instruction and their knowledge allows them to dynamically uncover and fix a host of problems. In the example below, a single prompt resolves all of the issues:


prompt = '''
For each of data points, identify and resolve the data anomalies. Specifically, follow these guidelines:

1. **Missing Values**:
   - If a field is absent or lacks a value in the text, make reasonable inferences whenever possible.
   - Ensure that each missing field is addressed in the JSON.

2. **Inconsistencies**:
   - Ensure that all similar values are consistently formatted and spelled. For example, for the "state" field, "New Mexico", "NM", and "nm" should all be represented as "New Mexico".

3. **Duplication**:
   - Identify duplicate values and remove all duplicates except for one. 
   - Address duplication only after resolving missing values and Inconsistencies.

4. **Final Output**:
   - Your response should be purely in JSON format. No additional text or commentary is desired.
   - Before responding, double-check to ensure the entire response is valid JSON.

**Data**:

'''

import json 
import openai

OPENAI_API_KEY = ""
openai.api_key = OPENAI_API_KEY

data = [
    {"name": "John Doe", "city": "new york", "state": "new york"}, # Duplication, Inconsistencies
    {"name": "Jane Smith", "city": "albuquerque"}, # Missing Values
    {"name": "Sam Brown", "city": "ny", "state": "NY"}, # Inconsistencies
    {"name": "John Doe", "city": "new york", "state": "new york"}, # Duplication, Inconsistencies
]

content = prompt + json.dumps(data)

chat_completion = openai.ChatCompletion.create(
    model="gpt-4", 
    messages=[{"role": "user", "content": content }]
)

print(chat_completion.choices[0].message.content)


And the results:

[
  {"name": "John Doe", "city": "New York", "state": "New York"},
  {"name": "Jane Smith", "city": "Albuquerque", "state": "New Mexico"},
  {"name": "Sam Brown", "city": "New York", "state": "New York"}
]

Enriching Data with LLMs

LLMs can enrich data by inferring appropriate categories from context. For example, in the following list of companies, LLMs can accurately categorize their industry based on the descriptions.

companies = [
    {
        "id": 1,
        "name": "TechGuard",
        "description": "A leading cybersecurity firm offering state-of-the-art protection solutions for businesses globally.",
        "industry": ""
    },
    {
        "id": 2,
        "name": "FreshFarms",
        "description": "An organic farm-to-table supplier specializing in fresh, sustainable produce and dairy products.",
        "industry": ""
    },
    {
        "id": 3,
        "name": "MobiTech",
        "description": "A pioneer in mobile technology and telecommunications services, connecting millions worldwide.",
        "industry": ""
    }
]

def LLM_infer(company, model="gpt-4"):
    prompt = "Infer the industry based on the company description: "
    content = prompt + company['description']
    chat_completion = openai.ChatCompletion.create(
        model = model, 
        messages=[{"role": "user", "content": content }]
    )
    return chat_completion.choices[0].message.content

for company in companies:
    industry = LLM_infer(company)
    print(f"Suggested industry for {company['name']}: {industry}")

> Suggested industry for TechGuard: The industry is Cybersecurity.
> Suggested industry for FreshFarms: Agriculture and Food Industry
> Suggested industry for MobiTech: Telecommunications Industry

When using LLMs, you might get additional output that's not required ("Suggested industry for..."). To fix this, you can instruct the model to return the data in a specific format (like I did in the first example). It's also worth noting that certain models, such as those in the "instruct" class, are designed to be more direct and less "chatty". Incorporating these with function calls can further streamline results.

Here I write LLM_infer to use the gpt-3.5-turbo-instruct model (note that I updated the ChatCompletion method to Completion and manually set the temperature):

def LLM_infer(company, model="gpt-3.5-turbo-instruct"):
    prompt = "Infer the industry based on the company description: "
    content = prompt + company['description']
    response = openai.Completion.create(
        model = model, 
        prompt = prompt + company['description'],
        temperature=0.6,
    )
    result = response.choices[0].text
    
    return result

Matching Data Labels

Schema Matching checks if two attributes with different names actually mean the same thing. With the many changes in databases, doing this by hand can be a hassle.

Consider two datasets from two different sales systems of companies that have recently merged. They both track product sales, but use different attribute names.

Dataset A:

| ProductID | Name        | UnitsSold |
|-----------|-------------|-----------|
| 001       | Laptop      | 50        |
| 002       | Mouse       | 100       |
| 003       | Keyboard    | 75        |
| 004       | Monitor     | 40        |

Dataset B:

| ItemCode | ProductName | QuantitySold |
|----------|-------------|--------------|
| A1       | Notebook    | 48           |
| B2       | PC Mouse    | 105          |
| C3       | Key Board   | 78           |
| D4       | Screen      | 42           |

In this example:

  • "ProductID" in Dataset A could potentially match with "ItemCode" in Dataset B.
  • "Name" in Dataset A might match with "ProductName" in Dataset B.
  • "UnitsSold" in Dataset A seems to correspond to "QuantitySold" in Dataset B.

The actual products themselves (like "Laptop" and "Notebook") also need schema matching, as they might refer to the same kind of item, just with different naming conventions.

import openai

OPENAI_API_KEY = ""
openai.api_key = OPENAI_API_KEY

def LLM_schema_match(attrA, attrB, model="gpt-4"):
    instructions = "Respond only with 'Yes' or 'No'; do not include any other text."
    prompt = f"Compare the attributes '{attrA}' and '{attrB}'. Do they refer to the same concept? " + instructions

    chat_completion = openai.ChatCompletion.create(
        model = model, 
        messages = [{"role": "user", "content": prompt }]
    )

    return chat_completion.choices[0].message.content

datasetA = [
    {"ProductID": "001", "Name": "Laptop", "UnitsSold": 50},
    {"ProductID": "002", "Name": "Mouse", "UnitsSold": 100},
    {"ProductID": "003", "Name": "Keyboard", "UnitsSold": 75},
    {"ProductID": "004", "Name": "Monitor", "UnitsSold": 40}
]

datasetB = [
    {"ItemCode": "A1", "ProductName": "Notebook", "QuantitySold": 48},
    {"ItemCode": "B2", "ProductName": "PC Mouse", "QuantitySold": 105},
    {"ItemCode": "C3", "ProductName": "Key Board", "QuantitySold": 78},
    {"ItemCode": "D4", "ProductName": "Screen", "QuantitySold": 42}
]

# Extract attribute names from datasetA and datasetB
attributesA = list(datasetA[0].keys())
attributesB = list(datasetB[0].keys())

# Compare each attribute from datasetA with each attribute from datasetB
for attrA in attributesA:
    for attrB in attributesB:
        match_result = LLM_schema_match(attrA, attrB)
        print(f"Comparison of '{attrA}' and '{attrB}': {match_result}")

Remarkably, GPT4 got all of them correct!

Comparison of 'ProductID' and 'ItemCode': Yes
Comparison of 'ProductID' and 'ProductName': No
Comparison of 'ProductID' and 'QuantitySold': No
Comparison of 'Name' and 'ItemCode': No
Comparison of 'Name' and 'ProductName': Yes
Comparison of 'Name' and 'QuantitySold': No
Comparison of 'UnitsSold' and 'ItemCode': No
Comparison of 'UnitsSold' and 'ProductName': No
Comparison of 'UnitsSold' and 'QuantitySold': Yes

Identifying Matching Records

Entity Matching involves determining whether two different data records actually represent the same real-world entity. Given the vastness of data and the slight variations in how records are represented, manually sifting through them to find matches can be time-consuming.

Suppose you had two different systems and needed to consolidate them.

Database A:

| ClientNumber | FullName       | Contact       |
|--------------|----------------|---------------|
| 101          | John D. Smith  | 555-1234      |
| 102          | Jane Doe       | 5555678      |
| 103          | Alice Johnson  | 555-9012      |
| 104          | Robert Brown   | 555-3456      |

Database B:

| CustomerID   | Name           | Phone         |
|--------------|----------------|---------------|
| A201         | John Smith     | 555-1234      |
| B202         | Jane A. Doe    | 555-5678      |
| C203         | Alice J.       | 555-9012      |
| D204         | Bob Brown      | 555-3456      |

In this example:

  • "John D. Smith" in Database A seems to correspond to "John Smith" in Database B.
  • "Jane Doe" in Database A might be "Jane A. Doe" in Database B.
  • The phone numbers provide additional clues to confirm these matches.
import openai

OPENAI_API_KEY = ""
openai.api_key = OPENAI_API_KEY

def LLM_entity_match(recordA, recordB, model="gpt-4"):
    instructions = "Respond only with 'Yes' or 'No'; do not include any other text."
    prompt = f"Compare the records '{recordA}' and '{recordB}'. Do they refer to the same real-world entity? " + instructions

    chat_completion = openai.ChatCompletion.create(
        model = model, 
        messages = [{"role": "user", "content": prompt }]
    )

    return chat_completion.choices[0].message.content

databaseA = [
    {"ClientNumber": "101", "FullName": "John D. Smith", "Contact": "555-1234"},
    {"ClientNumber": "102", "FullName": "Jane Doe", "Contact": "5555678"},
    {"ClientNumber": "103", "FullName": "Alice Johnson", "Contact": "555-9012"},
    {"ClientNumber": "104", "FullName": "Robert Brown", "Contact": "555-3456"}
]

databaseB = [
    {"CustomerID": "A201", "Name": "John Smith", "Phone": "555-1234"},
    {"CustomerID": "B202", "Name": "Jane A. Doe", "Phone": "555-5678"},
    {"CustomerID": "C203", "Name": "Alice J.", "Phone": "555-9012"},
    {"CustomerID": "D204", "Name": "Bob Brown", "Phone": "555-3456"}
]

# Compare each record from databaseA with each record from databaseB
for recordA in databaseA:
    for recordB in databaseB:
        match_result = LLM_entity_match(recordA, recordB)
        if match_result == 'Yes':
            print(f"Comparison of '{recordA['FullName']}' and '{recordB['Name']}': {match_result}")

GPT-4 does an impressive job, correctly identifying Jane Doe even when her contact details are formatted differently across the databases (5555678 v. 555-5678). This echoes the challenges with data consistency we discussed in Section 2.1.

Comparison of 'John D. Smith' and 'John Smith': Yes
Comparison of 'Jane Doe' and 'Jane A. Doe': Yes
Comparison of 'Alice Johnson' and 'Alice J.': Yes
Comparison of 'Robert Brown' and 'Bob Brown': Yes

Conclusion

Stay tuned for Part II with in-depth insights and real-world datasets. Join our email list for updates—we won't spam you!

Stay updated with the latest AI news and our guides.

We care about your data. No spam, ever.