Imputing Missing Data: Techniques and Trade-Offs

Exploring the wide range of data imputation strategies and their hidden costs.

Ulas Can Cengiz
10 min readMay 24, 2023
Photo by Brett Jordan on Unsplash

Welcome to the ubiquitous world of missing data! A world that every data scientist, no matter their industry, will inevitably encounter. It’s a universal challenge, affecting fields as diverse as healthcare, finance, social sciences, and everything in between. These seemingly innocent gaps in datasets can pose significant obstacles and distort the story that our data tries to tell. From patient records in a hospital to transaction logs in a bank, the ghostly absence of data values is a persistent issue, often undermining the robustness of analyses and leading to misleading conclusions.

The problem with missing data is more than a mere inconvenience — it’s a serious threat to the quality and reliability of insights drawn from data. Here’s why: missing data can compromise statistical significance, making it harder to discern true effects from mere noise. It can degrade the performance of machine learning models, as these models require comprehensive data to learn effectively. The more data missing, the less our models can learn, the less accurate they become. The overall reliability of data insights also takes a hit — after all, the chain of data analysis is only as strong as its weakest link, and missing…