High Impact Factor : 4.396 icon | Submit Manuscript Online icon |

Impact of Data Cleaning on Machine Learning Model Accuracy Labeled Section

Author(s):

Bhavesh Sheshnath Prasad , Tilak Maharashtra Vidyapeeth University

Keywords:

Data Cleaning; Data Quality; Machine Learning; Model Accuracy; Data-Centric AI; Data Preprocessing; Error Correction

Abstract

Data cleaning is widely acknowledged as a critical step in preparing datasets for machine learning (ML). This review examines how data cleaning influences ML model accuracy by synthesizing recent literature. We survey systematic studies and empirical experiments addressing cleaning tasks (e.g., handling missing values, label errors, duplicates) and their effects on classification, regression, and clustering models. Key papers include the CleanML benchmark study, a broad systematic review of data cleaning for ML, an empirical analysis of data quality dimensions, and the COMET system for prioritizing cleaning efforts. Overall, we find that targeted cleaning generally improves accuracy, but gains vary by error type, data context, and resource constraints. For example, imputing missing values or correcting label errors often enhances performance, whereas removing duplicates or fixing minor inconsistencies may have little or no effect. We highlight limitations such as high cleaning costs and unpredictable benefits in real-world settings, and discuss strategies like automated tools and iterative methods (e.g., COMET, ActiveClean) to focus effort on the most impactful data issues. Our synthesis points to a “data-centric” ML paradigm: effective cleaning must be guided by downstream tasks. We conclude with practical insights (e.g., prioritize feature/label accuracy) and future directions, including tighter ML–cleaning integration and automated, cost-aware cleaning processes.

Other Details

Paper ID: IJSRDV13I30078
Published in: Volume : 13, Issue : 3
Publication Date: 01/06/2025
Page(s): 115-117

Article Preview

Download Article