Contents
Overview:
Why data cleaning matters
Data cleansing can sometimes seem like overkill, especially for vast datasets. Sure, you may have a few incorrect values, but does it really matter?
The answer is yes. Even a small amount of dirty data can have a huge impact on machine learning models. For example, algorithms may learn incorrect patterns from unclean data, leading to inaccurate predictions or outputs. Even worse, errors can unintentionally introduce biases into the model.
When businesses give poor-quality data to ML models, they’re more likely to make bad — or at least misinformed — decisions.
Take AI scheduling tools(opens new window), for instance. Researchers studied 99 million shifts for retail employees that had been planned using popular AI software. They discovered that managers had manually corrected 84% of the shifts, and about 10% of these adjustments were caused (directly or indirectly) by faulty input data.
As doctoral student Caleb Kwon explains, “if you put in garbage, the AI tool — no matter how sophisticated it is or how complex it is or how much data you feed it — will produce something that’s suboptimal.” For the retailers, the faulty schedules hurt productivity and left stores understaffed.
Maintaining data integrity can prevent costly mistakes and help businesses make more informed decisions. It also ensures that companies are analysing truly reliable data — instead of getting led astray by insights that seem trustworthy but aren’t.
Common data issues
Data Analysts can encounter many types of problems when working with raw data, including:
- Missing values, such as customer phone numbers without all the digits
- Outdated information
- Duplicate data
- Inconsistent data, like “Street” vs. “St.” for addresses
- Inaccurate data, such as product numbers entered incorrectly
Many data errors are syntactic, meaning they contain simple formatting mistakes. For example, you might accidentally misspell a product name while manually updating your inventory.
By contrast, semantic errors look correct but don’t make sense logically. A medical chart, for instance, might say a patient has a heart rate of 600 beats per minute instead of 60. These errors can be harder to spot without close scrutiny.
Structural errors affect the layout of the entire dataset, not just a handful of data points. They can involve anything from mislabelled rows (like “Addresses” instead of “Application numbers”) to unnecessary line breaks.
It’s normal for humans to make mistakes, no matter how diligent they are. You may forget to update an appointment time or mishear a vendor’s name. But machines aren’t infallible, either. For instance, sensors can malfunction and give faulty readings, or Microsoft Excel could glitch and delete data. With so many potential errors, data cleansing is absolutely essential.
Step-by-step data cleaning workflow
When you hear the phrase “data cleaning,” you might picture yourself poring over thousands of data points with bleary eyes. But today, you can mostly automate it using the right methods and tools.
Data auditing
You can’t start cleaning and organising your closet until you know what’s inside. The same goes for your datasets. Without a proper audit, you’re just blindly guessing about what needs to be fixed.
An audit involves profiling your dataset and identifying all the problems. Sometimes, you can do this manually, especially if you only have a few data points. For example, anyone can spot missing data in an Excel spreadsheet — just look for the blank cells.
For larger datasets, you’ll need automated tools to get a handle on your information. One popular platform for this is OpenRefine(opens new window), which lets you filter and sort data. It also flags errors, such as null values and duplicate records. Using data auditing tools like this helps you quickly understand how clean (or dirty) your information actually is.
Cleaning
Data cleaning focuses on standardising data and correcting any errors. This step usually involves these tasks:
- Correcting spelling and grammar errors
- Removing duplicate information
- Rearranging columns and rows
- Fixing inconsistent or incorrect data
Data cleaning tools can significantly speed up this process. Many professionals use pandas(opens new window), a Python library, to tidy data. Other options include Alteryx One(opens new window) and SQL.
Always back up your data before and during cleaning. That way, you won’t have to stress if you accidentally delete the wrong information or realise that “Smyth” was the correct spelling of that customer’s name after all.
Validation
While data cleaning software can be incredibly useful, it doesn’t always catch every mistake. Take the time to double-check your dataset and output quality for anomalies or errors. That way, you can feel confident that you’re using genuinely clean data.
The best validation techniques vary depending on the type of data you’re handling. Here are a few options:
- Presence check: Verifies that every field has a valid value.
- Range check: Sort numerical data from low to high to make sure it falls within an acceptable range. A temperature of 758°F wouldn’t make sense for climate data, so it’s safe to assume it’s a mistake.
- Format validation: Check that all your data follows a standard format, such as military or standard time.
- Uniqueness check: Some data points — such as credit card numbers — should never repeat in a dataset.
Documentation and versioning
Always document each step in your data cleaning process. This could be as simple as noting which rows you removed or how you fixed an anomaly. By keeping transparent records, you’ll help your team understand your decisions and troubleshoot any problems.
Creating multiple versions is another best practice. Save different copies of your data at each stage, and clearly label them. (“Instagram Captions – Pre-Cleaning,” “Instagram Captions – Minus Duplicates,” and so on.) That way, you can easily go back to an earlier version to recover lost data or start over with a new approach.
Techniques for cleaning data
You don’t have to memorise advanced maths formulas to clean data, but you should understand some basic techniques.
String cleaning focuses on improving the quality of textual data. It often involves removing extra whitespaces, deleting random punctuation, and fixing capitalisation errors. This process is key for preparing data for natural language processing and textual analysis.
Many datasets also contain “noise,” which is random outliers or irrelevant data points. Like static on a radio, they make it hard to make sense of information. Data visualisations, such as heat maps and scatter plots, can help you identify these anomalies. Depending on the context, you may decide to remove or correct them.
Normalising formats is another key technique. You may discover that some rows with currency include the pound sign and some just have a number. Or some days might start with the year while others end with it. ETL (Extract, Transform, Load) tools can automatically standardise these formats.
In some situations, you may also need to convert categorical data into numerical codes. This process involves giving each variable a distinct number. For example, a clothing retailer might label small shirts as 1, medium shirts as 2, and so on. That can make it easier for machine learning models to interpret your data and make predictions.
Not sure which techniques to use? Consider the data analysis methods you plan to apply. For example, you’ll need clean textual data to analyse the sentiment of your customers’ reviews. On the other hand, normalised dates are essential for time regression analysis.
Handling missing data
It’s normal to have incomplete data, especially when you’re gathering information from several sources. But these missing values can throw off your data analysis, so you can’t just ignore them.
This might seem a bit paradoxical, but start by looking for patterns in what’s not there. There are three kinds of missing data(opens new window):
- Missing completely at random (MCAR): There’s no rhyme or reason to this incomplete data. For example, a glitchy form may not collect data from 20 out of 1,000 survey respondents, with no obvious pattern.
- Missing at random (MAR): The missing data is related to other variables in the dataset, but not to the value itself. A Business Data Analyst, for instance, may notice that first-time shoppers are 25% less likely to leave a review than repeat customers. That’s not necessarily because they’re less satisfied, but because they have a shorter purchase history.
- Missing not at random (MNAR): The reason for the incomplete data is directly related to what’s missing. Poor test-takers, for instance, may skip a question about their grades out of embarrassment.
Once you understand the reason for the missing data, you can take steps to fix it. You might just remove the incomplete rows, especially if it’s MCAR. Or you may need to rethink your entire data collection method to get more accurate information.
If you choose to simply delete information, you can take one of two approaches. Listwise deletion erases all the information for a subject with one or more missing values. It’s the simplest method, but it can drastically shrink your sample size. By contrast, pairwise deletion uses all the available data, only excluding the missing values.
Data imputation is another useful option. It uses mathematical formulas to fill in gaps — basically, making an educated guess about what the unknown data could be. Here are a few imputation methods:
- Mean: Replace missing values with the average of the observed data.
- Median: Use the middle value of the data points to fill in the gaps.
- Regression: Predict missing values by analysing the relationship between values. For example, you might use age and job title to predict salary.
- K-Nearest Neighbours (KNN): Use similar data points (“neighbours”) to estimate the missing values.
- ML-based: Use complex algorithms to spot patterns in the dataset and approximate missing values.
No matter which approach you choose, avoid projecting your own biases onto the gaps. And be cautious about deleting data — the last thing you want is to end up with a tiny sample size because you cleaned it too aggressively.
Tools of the trade
You don’t need to master every data science tool, especially as a beginner. Set yourself up for success by upskilling with these essential platforms:
- Microsoft Excel: Helpful for learning basic data cleaning skills, such as removing duplicates and correcting formatting errors.
- Python libraries: Have pre-written snippets of code that you can use to automatically clean and manipulate data. Popular frameworks include NumPy (best for numerical data), pandas (for structured data), and scikit-learn (designed for machine learning tasks).
- OpenRefine: A free tool for data transformation and cleaning.
- Jupyter Notebooks: Great for documenting your data cleaning process and batch processing information.
Automating and scaling data cleaning
Once you understand the principles of data cleaning, take your skills to the next level by learning how to write simple automation scripts in Python. These functions save time by handling basic tasks like removing duplicate entries.
You can also combine these scripts into a data cleaning pipeline. Data Engineers and other professionals often build these workflows for repetitive tasks, such as string cleaning and date normalisation.
For more complex or extensive projects, consider using a data cleaning framework or platform. Tools like OpenRefine come with a learning curve, but they can help you scale your projects more efficiently than writing every script yourself — or worse, cleaning vast datasets manually.
Preparing data for modelling
Data cleaning lays the foundation for all the other preprocessing steps.
Feature engineering transforms raw data into new variables (or “features”) that help train ML models. If you have sensor data, for instance, you might create features using temperature readings to detect malfunctions. Clean data supports this process by allowing you to develop more accurate features.
After you’ve cleaned the data, you can also conduct exploratory data analysis (EDA). This step helps you start detecting patterns and anomalies in the dataset.
Of course, you should always double-check your data before using it for training or inference. Look for missing data, outliers, and other issues. You should also make sure that you’ve correctly formatted and encoded everything. By using the most reliable data available, you’ll improve your model’s performance and make smarter forecasts.
Final tips for beginners
Learning how to clean data doesn’t have to be complicated or nerve-wracking. Set yourself up for success with these best practices:
- Document every step of the data cleaning process so you can retrace your steps (especially if something goes wrong).
- Start small with simple datasets, such as sales records from the last week. Once you gain confidence, you can branch out to larger projects.
- Build reusable Python scripts for basic cleaning tasks like capitalising names.
- Use validation methods frequently to maintain data quality.
Learn how to turn messy data into sparkling insights with Multiverse
The Multiverse Skills Intelligence Report found that 41% of employees struggle to source and clean data. By learning data cleaning, you can get more precise results and take on advanced AI and ML projects.
Multiverse’s free Data & Insights for Business Decisions apprenticeship is an excellent way to deepen your data knowledge and gain new technical skills. The 13-month programme combines hands-on projects with group learning and personal coaching. You’ll learn how to use data to drive change, support machine learning projects, and influence decision-making in your organisation.
Get ahead of the competition by taking the next step on your data science journey. Fill out our quick application(opens new window) now.