Handling Missing Data

Data Analysis
Author

Juma Shafara

Published

April 24, 2024

Handling Missing Data

Photo by DATAIDEA

Introduction:

Missing data is a common hurdle in data analysis, impacting the reliability of insights drawn from datasets. Python offers a range of solutions to address this issue, some of which we discussed in the earlier weeks. In this notebook, we look into the top three missing data imputation methods in Python—SimpleImputer, KNNImputer, and IterativeImputer from scikit-learn—providing insights into their functionalities and practical considerations. We’ll explore these essential techniques, using the weather dataset.

# install the libraries for this demonstration
! pip install dataidea==0.2.5
from dataidea.packages import *
from dataidea.datasets import loadDataset

from dataidea.packages import * imports for us np, pd, plt, etc. loadDataset allows us to load datasets inbuilt in the dataidea library

weather = loadDataset('weather')
weather
day temperature windspead event
0 01/01/2017 32.0 6.0 Rain
1 04/01/2017 NaN 9.0 Sunny
2 05/01/2017 28.0 NaN Snow
3 06/01/2017 NaN 7.0 NaN
4 07/01/2017 32.0 NaN Rain
5 08/01/2017 NaN NaN Sunny
6 09/01/2017 NaN NaN NaN
7 10/01/2017 34.0 8.0 Cloudy
8 11/01/2017 40.0 12.0 Sunny
weather.isna().sum()
day            0
temperature    4
windspead      4
event          2
dtype: int64

Let’s demonstrate how to use the top three missing data imputation methods—SimpleImputer, KNNImputer, and IterativeImputer—using the simple weather dataset.

# select age from the data
temp_wind = weather[['temperature', 'windspead']].copy()
temp_wind_imputed = temp_wind.copy()

SimpleImputer from scikit-learn:

  • Usage: SimpleImputer is a straightforward method for imputing missing values by replacing them with a constant, mean, median, or most frequent value along each column.
  • Pros:
    • Easy to use and understand.
    • Can handle both numerical and categorical data.
    • Offers flexibility with different imputation strategies.
  • Cons:
    • It doesn’t consider relationships between features.
    • May not be the best choice for datasets with complex patterns of missingness.
  • Example:
from sklearn.impute import SimpleImputer

simple_imputer = SimpleImputer(strategy='mean')
temp_wind_simple_imputed = simple_imputer.fit_transform(temp_wind)

temp_wind_simple_imputed_df = pd.DataFrame(temp_wind_simple_imputed, columns=temp_wind.columns)

Let’s have a look at the outcome

temp_wind_simple_imputed_df
temperature windspead
0 32.0 6.0
1 33.2 9.0
2 28.0 8.4
3 33.2 7.0
4 32.0 8.4
5 33.2 8.4
6 33.2 8.4
7 34.0 8.0
8 40.0 12.0

KNNImputer from scikit-learn:

  • Usage: KNNImputer imputes missing values using k-nearest neighbors, replacing them with the mean value of the nearest neighbors.
  • Pros:
    • Considers relationships between features, making it suitable for datasets with complex patterns of missingness.
    • Can handle both numerical and categorical data.
  • Cons:
    • Computationally expensive for large datasets.
    • Requires careful selection of the number of neighbors (k).
  • Example:
from sklearn.impute import KNNImputer

knn_imputer = KNNImputer(n_neighbors=2)
temp_wind_knn_imputed = knn_imputer.fit_transform(temp_wind)

temp_wind_knn_imputed_df = pd.DataFrame(temp_wind_knn_imputed, columns=temp_wind.columns)

If we take a look at the outcome

temp_wind_knn_imputed_df
temperature windspead
0 32.0 6.0
1 33.0 9.0
2 28.0 7.0
3 33.0 7.0
4 32.0 7.0
5 33.2 8.4
6 33.2 8.4
7 34.0 8.0
8 40.0 12.0

IterativeImputer from scikit-learn:

  • Usage: IterativeImputer models each feature with missing values as a function of other features and uses that estimate for imputation. It iteratively estimates the missing values.
  • Pros:
    • Takes into account relationships between features, making it suitable for datasets with complex missing patterns.
    • More robust than SimpleImputer for handling missing data.
  • Cons:
    • Can be computationally intensive and slower than SimpleImputer.
    • Requires careful tuning of model parameters.
  • Example:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

iterative_imputer = IterativeImputer()
temp_wind_iterative_imputed = iterative_imputer.fit_transform(temp_wind)

temp_wind_iterative_imputed_df = pd.DataFrame(temp_wind_iterative_imputed, columns=temp_wind.columns)

Let’s take a look at the outcome

temp_wind_iterative_imputed_df
temperature windspead
0 32.000000 6.000000
1 35.773287 9.000000
2 28.000000 3.321648
3 33.042537 7.000000
4 32.000000 6.238915
5 33.545118 7.365795
6 33.545118 7.365795
7 34.000000 8.000000
8 40.000000 12.000000

Datawig:

Datawig is a library specifically designed for imputing missing values in tabular data using deep learning models.

# import datawig

# # Impute missing values
# df_imputed = datawig.SimpleImputer.complete(weather)

These top imputation methods offer different trade-offs in terms of computational complexity, handling of missing data patterns, and ease of use. The choice between them depends on the specific characteristics of the dataset and the requirements of the analysis.

Homework

  • Try out these techniques for categorical data

Credit

Do you seriously want to learn Programming and Data Analysis with Python?

If you’re serious about learning Programming, Data Analysis with Python and getting prepared for Data Science roles, I highly encourage you to enroll in my Programming for Data Science Course, which I’ve taught to hundreds of students. Don’t waste your time following disconnected, outdated tutorials

My Complete Programming for Data Science Course has everything you need in one place.

The course offers:

  • Duration: Usually 3-4 months
  • Sessions: Four times a week (one on one)
  • Location: Online or/and at UMF House, Sir Apollo Kagwa Road

What you’l learn:

  • Fundamentals of programming
  • Data manipulation and analysis
  • Visualization techniques
  • Introduction to machine learning
  • Database Management with SQL (optional)
  • Web Development with Django (optional)

Best

Juma Shafara

Data Scientist, Instructor

jumashafara0@gmail.com / dataideaorg@gmail.com

+256701520768 / +256771754118

You may also like:

Handling Missing Data in Pandas, When to Use bfill and ffill Methods

Handling Missing Data in Pandas, When to Use bfill and ffill Methods

Back to top