Navigating the Maze of Missing Data

Samuel N Wekesa
3 min readSep 13, 2023

--

Intro

“In data analysis, the presence of missing values poses significant challenges, potentially leading to skewed results or inaccurate conclusions. The reasons for data absence can vary, from technical glitches to non-responses in data collection”, those were part of chat I was having in a few days back during an open day data analysis event.

Photo session after the event

The newbies to data analysis tend to have various challenges while handling data for instance do we do away with missing values or do we put it into consideration ?

Understanding the nature of missing data

MCAR (Missing Completely At Random): The reason for the missing data is unrelated to the dataset.

Example : Participants fill out the survey on tablets, and due to a software glitch, some of the survey responses were not saved.

MAR (Missing At Random) : MAR implies that the probability of an observation being missing depends on available information, i.e., other observed data, but not on the missing data itself.

Example : Consider a scenario where men are less likely than women to fill out a depression score on a survey, but this missing isn’t related to their actual depression levels.

MNAR(Missing Not At Random): MNAR means the missing is related to the value of the missing data itself.

Example : people with higher salaries are less likely to share their income on a survey, this missing data pattern might be MNAR.

Strategies for Handling Missing Data:

Deletion:

  • List wise or Complete Case Analysis: It’s akin to excluding cases where data are missing. While it’s straightforward, it can considerably reduce sample size and can introduce bias, especially if the data is not MCAR.
  • Pairwise Deletion: Utilizes all available data by analyzing all cases in which the variables of interest are present. This increases statistical power but can complicate analyses.

Imputation:

Mean/Median/Mode Imputation: Simple statistical measures (mean, median, or mode) of the observed data are used to replace missing values. While it retains data size, it doesn’t account for the uncertainty about the imputation and can reduce variability (Little & Rubin, 2002).

Linear Interpolation: In time-series, missing values can be estimated from adjacent data points. This assumes a linear relationship between points.

K-Nearest Neighbors (KNN) Imputation: An instance-based learning technique. The missing values of an instance are imputed based on similar instances in the dataset (Troyanskaya et al., 2001).

Multiple Imputation: It involves making multiple replacements for each missing value, leading to multiple complete datasets. These datasets are analyzed separately, and the results are pooled (Rubin, 1987).

Predictive Models : Using algorithms such as regression, where missing values are predicted based on other observed variables. However, care must be taken to avoid overfitting.

Substitution: LOCF & NOCB: In time series, if a data point is missing, it can be replaced by the previous or succeeding observation. These methods, however, can introduce bias.

Algorithms Robust to Missing Data: Some modern algorithms, like XGBoost, are designed to handle missing data internally without any imputation, by finding the best direction for the missing values during tree splitting.

Dummy Variable Approach: By creating an indicator variable for missingness, one can capture the pattern and potentially control for it in analyses.

Conclusion

In some cases, domain knowledge might provide insights into how best to handle missing data.

Remember, the approach you choose should be based on the nature and pattern of the missing data, the amount of data missing, and the specific analytical goals. After addressing missing values, it’s essential to validate the results to ensure the handling method hasn’t introduced any biases or inaccuracies.

References:

  • Little, R. J., & Rubin, D. B. (2002). Statistical Analysis with Missing Data. John Wiley & Sons.
  • Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons.
  • Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., … & Altman, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6), 520–525.

--

--

Samuel N Wekesa
Samuel N Wekesa

Written by Samuel N Wekesa

Data Analysis|| Information Technology|| Business Statistics

No responses yet