Data cleaning is a mammoth topic, as so many methods, strategies, tools, and technologies are available to clean and process data.
Data cleaning ranges in complexity from making tiny ad-hoc alterations to full-blown feature engineering and mass imputation. It’s possible to turn small datasets into much larger, high-quality datasets with high dimensionality. In fact, many machine learning projects rely somewhat on feature engineering to turn sporadic, sparse, or poor-quality data into something usable.
As mentioned, data cleaning takes many forms, and the process varies depending on what you’re trying to do. Either way, dirty data is the downfall of countless projects, with HBR estimating that bad data costs US businesses an accounting $3 trillion a year! In the domain of machine learning, bad data leads to failed models and wasted time and investment.
The concept of data cleaning is related to numerous other concepts within data science and engineering. For example, in the realms of marketing, data cleaning, and feature engineering is used to enhance the performance and accuracy of marketing mix models (MMM) and numerous other marketing data.
While feature engineering somewhat falls within the remit of data cleaning, data cleaning is more broadly associated with processing the data you have.
What is good data?
Isn’t that the million-dollar question? While there are many ways to objectively assess datasets for their dimensionality, the definition of good data varies on the application. For example, in machine learning, a good dataset may be what a team can gather in a short space of time to train a good-functioning model.
Data cleaning and creating “good data” should occur right at the start of the data value chain, during the collection phase.
In marketing mix modeling, good data plays a similar role; it should be good enough to construct an accurate model that can be used to predict and analyze different marketing factors over a given period, while accounting for seasonality, diminishing returns, etc. Herein lies an example; if your marketing data doesn’t contain any evidence of seasonality or diminishing returns, then how do you expect to predict that in your model? In this situation, feature engineering or simply waiting for more data might be your only route toward creating “good data”.
Overall, there are at least four widely-touted qualities of good data, which are worth being aware of:
Accuracy pertains to the accuracy of the data itself. So, for example, if you’re collecting weather data from IoT sensors, those sensors would have to be well-calibrated and verified with ground truth data.
Data accuracy also depends on proper data validation, ensuring data ingested into a database, model, or other target conforms to expectations. For example, if you’re collecting data in 24-hour format, it’d be wise to reject data outside what a 24-hour clock can actually read.
Completeness refers to the wholeness of a dataset. For example, a table with 1000 columns, of which only 500 are filled is only 50% complete.
It might be that those other columns are obsolete and can be removed, but if not, then you’ll need more data/feature engineering. Finally, completeness also relates to data validation, as incomplete data can be rejected during the connection phase (e.g., when someone fails to enter their name with their address and the form is rejected).
Consistency refers to how data should be recorded, formatted, and structured consistently, especially when changes to that data are made.
When consistent backups of a system are made in portions, those backups should consistently reference each other to retain consistency. If the structures fail to reference each other properly, then the data might become inconsistent and fail. Additionally, when collecting data, it’s important to collect the same data for the same variables as defined at the start of the project, rather than mixing and matching.
Validity relates to all the above concepts. For data to be valid, it must meet the system or model’s expectations.
If the system or model expects data, incoming data needs to be validated as a date. For example, if it expects a US formatted date, dates need to be in MM/DD/YY format, rather than the UK’s DD/MM/YY.
Defining dirty or bad data
Bad and dirty data can take many forms. Here are the top six types of dirty data:
Duplicates often occur when data is moved due to migrations, imports, transfers, or ingestion from third-party sources. Therefore, deduplication is one of the first things that should be done, including removing irrelevant observations from the set. For example, suppose you download a healthcare dataset for respiratory disease in the under-65s but only want to look at the under-40s - you’ll need to delete irrelevant records before doing anything else.
- Removing duplicates is often simple when the dataset contains primarily unique records.
- Duplicates quickly skew data and lower model accuracy.
- Data filled with duplicates will also place unnecessary strain on resources.
Irrelevant or outdated data
Some data might be outdated or irrelevant today. For example, outdated CRM data might mean the employees/organizations have changed. Email lists from 10 years ago likely contain tons of outdated and redundant data.
Web data scraped even months ago may be inaccurate or irrelevant today. Scrutinize your dataset for outdated data and remove it at the start of the project. It’s probably best to feature engineer data from new data rather than using aged data.
- Topical, current data is often more accurate and useful than outdated data
- Outdated data can create model inaccuracy and bias
- It’s best to perform feature engineering on up-to-date data, as that better represents the ground truth
Inconsistent or invalid data
Data might be poorly validated (or not validated at all). Data may also be collected using inconsistent code or outdated or disused modules, and APIs. Small errors mount up and become a pressing issue when aggregated.
- Small data integrity issues add up to bigger problems in aggregate datasets
- Ensure you’re validating data where possible
- Update APIs and modules to ensure data is clean and error-free
Data that fails to comply with privacy regulations or internal business protocol should be considered ‘bad data’, and can even be costly if it’s sufficiently non-compliant to attract civil or even criminal proceedings. Some ley pieces of legislation here are GDPR, California’s Consumer Privacy Act (CCPA), and Maine’s Act to Protect the Privacy of Online Consumer Information.
- Non-compliant data often contains non-compliant personally identifiable information (PII)
- This is particularly important in machine learning projects that use PII records, faces, images, videos, etc, without explicit permission
- Holding or using such data might be a criminal or civil offense
Incomplete, insufficient, or missing data
Incomplete, insufficient, or missing data are probably the biggest issues in marketing and machine learning.
- Incomplete data might be missing critical variables, often necessitating feature engineering.
- Insufficient data means there’s not enough substantial data to garner any real insight from or train a decent model.
- Missing data means variables might be completely missing at random (MCAR), making it time-consuming to impute the data at scale.
Inaccurate data may be poorly validated or riddled with typos and other errors of commission and omission.
Dates written in a manually-input spreadsheet are a classic example, as these lack data validation. For example, someone might write a date in US format, someone else might write the year in full (e.g. 2022 instead of 22), and so on. This causes inaccuracies that need to be removed prior to using that in any meaningful fashion.
- Inaccurate data is often caused by human error and no data validation
- Another possibility is collecting data from poorly calibrated or inaccurate instruments (e.g. an IoT sensor)
- Inaccurate data needs to be located and conformed to the same format/range/etc
Too much data
It’s certainly possible to have too much data. In ML, too much data (or technically too many features) can skew the bias-variance trade-off, potentially leading to overfitting. Otherwise, too much data might make modeling extremely complex and time-consuming.
It’s usually necessary to strike a balance between having enough data to show the model the ‘target’ without inundating it with features and variables.
- Excessive data and features therein can lead to overfitting in ML projects
- At some point, adding more data will yield no discernible advantage (aka. diminishing returns)
- Adding more poor quality data will never fix fundamental issues
Data cleaning techniques
Data cleaning techniques vary depending on the data type(s), volume, and end goals. For example, cleaning data for a marketing mix model might involve imputing missing data using other data, outlier removal, discretization, binning, and normalization. Some of these tasks fall within the remit of feature engineering.
You can clean data by simply editing it in the database. Another option is using a data manipulation tool like Pandas, which is a Python library designed for data cleaning and data manipulation. CRMs, CDPs, and other data products have simple data cleaning operations built-in to the tool, e.g. Segment guarantees that 99% of your data will be free from duplicates within a 24-hour look-back window.
Here are some basic fundamental data cleaning techniques:
Deduplication is probably the first thing anyone should do to clean data. Duplicates cause havoc but are easy to remove. As mentioned, some tools include deduplication as standard.
Otherwise, you can filter and sort to group data in various ways, making removing duplicates pretty simple in smaller datasets.
Pandas is excellent for a range of deduplication procedures, including deduplicating keywords.
Grammar, spaces, and syntax errors
Grammar, spaces, and syntax errors are common issues when dealing with strings, particularly long pieces of text. For NLP, it’s often necessary to tokenize text, separating longer strings into token words that are easier to label and process.
When dealing with complex strings, it’s often necessary to remove URLs, HTML, boilerplate text, tracking codes, and various personally identifiable information (PII), which you don’t want to store (as doing so may be a privacy breach). This is often necessary when parsing data scraped from websites and web pages.
Reformatting, normalization, and standardization
While data validation should ensure that data ingested is compliant with the format the system expects, it’s often necessary to reformat and standardize data. An example is converting a Boolean value of Yes to TRUE, or reformatting string dates to date or time data types.
Standardization involves unifying data that can be expressed in different ways, e.g., converting various measurements (mm, cm, m, etc.) to the same measurement. You can also normalize and standardize scales to bring out-of-scale values into the same ranges.
Imputing missing values
Missing values present a different challenge to data cleaning. Data cleaning ensures pre-existing data is ready to use, but what if you encounter missing values or realize your remaining clean dataset is sparse and incomplete?
The approach depends on the type of missing value.
MCAR occurs when there’s no systematic reason why the data is missing. It just is, with no means to connect the missingness of the value to something else in the dataset. For example, say a temperature sensor fails as it runs out of battery, and you get no reading for temperature only - that’s likely missing completely at random (MCAR). However, suppose that the same sensor also provided another reading, e.g., humidity. Humidity and temperature would both be missing for connected reasons, which probably means one value or the other is missing at random (MAR).
Then, missing not at random (MNAR) occurs when something in the dataset predicts the missing value, e.g., men may be less likely to respond to questions about their baldness than women due to being men. Therefore, maleness is a predictor of baldness data being missing.
Negotiating the differences between missing values is complex, but it really matters when dealing with interlinked values. Imputing some data might result in bias, which could be a critical issue in sensitive industries such as healthcare (where the type of missing value is often discussed in great detail).
Data cleaning tools
Data cleaning is built into many data-intensive applications such as CRMs, CDPs, and BI tools.
For example, Segment, mParticle, and other customer and event data studios contain their own built-in and automated tools for deduplicating and cleaning data. For those with Python experience, Pandas and Numpy are excellent tools for executing data cleaning processes and calculations for preparing data.
Here’s a list of other data cleaning tools:
- Open Refine: An open-source application for cleaning data and transforming it between formats. Works similarly to a spreadsheet and can handle spreadsheet file formats such as CSV.
- Trifacta: Data cleaning and data wrangling software designed for data exploration and data preparation.
- Tibco Clarity: An SaaS data preparation and data cleaning tool.
- Cloudingo: A Salesforce-specific tool for cleaning data, removing duplicates, etc.
- IBM Quality Stage: An enterprise tool to audit data quality and deal with data governance initiatives.
- Melissa Clean Suite
- WinPure Clean & Match
- Informatica Cloud Data Quality
- Oracle Enterprise Data Quality
- SAS Data Quality
Data imputation and removal
As well as imputing missing data, it’s also often necessary to remove data. Both need to be done correctly to retain overall data integrity and quality. The last thing you want to do is add or remove data via pure guesswork.
Discretization and binning
These are both techniques for removing data, which may make overly complex or noisy data more manageable. Discretization involves placing continuous variables into categories. For example, course progress over 36-months could be categorized into year 1, year 2, and year 3.
Binning similarly involves grouping variables into smaller ‘bins’ to reduce dimensionality and noise. Binning can also be applied to integers.
Hot decking and cold decking
These are two imputation techniques that replace missing values with new values. Cold decking involves replacing a value with a similar value from a similar dataset, e.g., replacing a missing figure from a single day in 2022 with a similar value from that same day in 2021.
Hotdecking involves taking the new data from a similar dataset. For example, you could replace a non-response for a survey answer with a response from a similar survey record.
Median, mean, or mode
Missing data can be calculated from the mean, mode, or median of existing data. These methods are reliable in simple time-series datasets where you find MCAR values.
Using models and algorithms
Expectation-maximization, regression, classification, and other models are frequently used to compute missing values from existing values.
It’s also possible to cluster data to remove outliers.
Summary: Data cleaning
Data cleaning need not be a mammoth task, but when a dataset is truly dirty and full of missing values, it’s worth being as systematic as possible.
Of course, utilizing data validation and pre-processing techniques to ensure that ingested data is clean is the best way to mitigate dirty data issues.
Start with the low-lying fruit - deduplication and data removal. It’s probably better to trim your data harshly and perform feature engineering using perfectly clean data than pressing on with poor data.
When it comes to imputing missing values, it’s essential to consider why those values are missing. This may be obvious when they’re missing completely at random, but if missing values are linked somehow, then replacing values without due consideration to why they’re missing might create bias and other issues.
With that said, in most applications, replacing missing values and performing feature engineering is not particularly risky.