Impact of Data Cleaning on Machine Learning Model Accuracy Labeled Section |
Author(s): |
| Bhavesh Sheshnath Prasad , Tilak Maharashtra Vidyapeeth University |
Keywords: |
| Data Cleaning; Data Quality; Machine Learning; Model Accuracy; Data-Centric AI; Data Preprocessing; Error Correction |
Abstract |
|
Data cleaning is widely acknowledged as a critical step in preparing datasets for machine learning (ML). This review examines how data cleaning influences ML model accuracy by synthesizing recent literature. We survey systematic studies and empirical experiments addressing cleaning tasks (e.g., handling missing values, label errors, duplicates) and their effects on classification, regression, and clustering models. Key papers include the CleanML benchmark study, a broad systematic review of data cleaning for ML, an empirical analysis of data quality dimensions, and the COMET system for prioritizing cleaning efforts. Overall, we find that targeted cleaning generally improves accuracy, but gains vary by error type, data context, and resource constraints. For example, imputing missing values or correcting label errors often enhances performance, whereas removing duplicates or fixing minor inconsistencies may have little or no effect. We highlight limitations such as high cleaning costs and unpredictable benefits in real-world settings, and discuss strategies like automated tools and iterative methods (e.g., COMET, ActiveClean) to focus effort on the most impactful data issues. Our synthesis points to a “data-centric” ML paradigm: effective cleaning must be guided by downstream tasks. We conclude with practical insights (e.g., prioritize feature/label accuracy) and future directions, including tighter ML–cleaning integration and automated, cost-aware cleaning processes. |
Other Details |
|
Paper ID: IJSRDV13I30078 Published in: Volume : 13, Issue : 3 Publication Date: 01/06/2025 Page(s): 115-117 |
Article Preview |
|
|
|
|
