Ramgopal Prajapat:

Learnings and Views

Missing value treatment for Categorical and Numeric Variables using Python

By: Ram on Nov 06, 2020

Missing values are common in real-life scenarios. All of the data science and analytics professionals need to understand strategies to manage missing values. In this blog, we will discuss missing value identification, treatment, and imputation. After reading the blog you will be able to answer these questions.

How to identify missing values?
Why do we do missing value treatment?
What are the ways to handle missing values for numeric variables?
What are the missing value treatment methods for categorical variables?
How to fill missing values?
How to handle missing categorical data in python?
How to handle missing numeric data in python?
How to handle missing data machine learning?
How to report missing data?
How to do missing value treatment in python?
How to predict or impute missing values in Python?

 

Key steps in Missing Value Management are:

  • Missing Value Identification
  • Analysis to understand reasons for missing values
  • Strategies to Handle Missing Values

 

Now, we will show some of these steps in Python. The first step is to read the data.

 

Scenario 1 and Read Data

We are using a dataset that has some missing values. This is a Pima Indians Diabetes Database. Data Description

We will explore the dataset to understand a bit more. This data is used for Binary Classification. The class variable takes two value and class value 1 is interpreted as tested positive for diabetes.

Information is based on the data description file.

Number of Instances: 768

Number of Attributes: 8 plus class

For Each Attribute: (all numeric-valued)

  1. Number of times pregnant
  2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  3. Diastolic blood pressure (mm Hg)
  4. Triceps skinfold thickness (mm)
  5. 2-Hour serum insulin (mu U/ml)
  6. Body mass index (weight in kg/(height in m)^2)
  7. Diabetes pedigree function
  8. Age (years)
  9. Class variable (0 or 1)

 

# Import Pandas

import pandas as pd

# Read data

diabetes= pd.read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv", header=None, prefix="C_")

diabetes.head()

 

Missing Value Identification

The first step is to find if there are missing values in a data frame. Sometimes, it is easy that missing values are represented as NaN in the Data Frame.

We will try another example where missing values are represented as "?" and will not be captured directly by standard function in Python.

For this example, let's explore first.

All the variables are numeric. And features do not have NaN (Missing values)

If we look at the values analytically, you will notice, there are 0 values of these features. These are nothing but missing values.

So, first, we can make these as NaN.

 

replace0withNaN(diabetes)

diabetes.head()

For finding missing counts and % for each of the data frame attributes, we are using a custom function. It is simple but makes it very easy.

 

# Check if missing values

def missing_values_table(df):

        mis_val = df.isnull().sum()

        mis_val_percent = 100 * df.isnull().sum() / len(df)

        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)

        mis_val_table_ren_columns = mis_val_table.rename(

        columns = {0 : 'Missing Values'1 : '% of Total Values'})

        mis_val_table_ren_columns = mis_val_table_ren_columns[

            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(

        '% of Total Values', ascending=False).round(1)

        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      

            "There are " + str(mis_val_table_ren_columns.shape[0]) +

              " columns that have missing values.")

        return mis_val_table_ren_columns

# source: https://stackoverflow.com/questions/26266362/how-to-count-the-nan-values-in-a-column-in-pandas-dataframe

missing_values_table(diabetes)

 

Out of 8 features in the data frame, 6 features have missing values. % missing observations vary for each of the features. Sometimes % missing value governs the treatment strategies.

Practically, we spend time identifying potential reasons for missing values, and based on that treatment strategy is adopted. Some of the examples or scenarios are here.

Treatment Strategies

Typically, we have 4 strategies to manage the missing values for the features

  • Ignoring Feature or No action on missing value treatment
  • Creating a dummy or indicator variable
  • Replacement - with 0, mean, mode, or median values
  • Imputation with a Machine Learning Method
  • Missing value for International Spend in Credit Card Spend scenario. The missing value is justifiably missing. Potential Treatment: Can be considered or imputed as 0
  • The order value is missing in the retail context. If an order is completed, the order value can't not be missing. Probably some technical issues. We can investigate if there is any reason or pattern in the missing values. Potential Treatment: Imputed with Average Order Value (we can also try modeling)
  • Missing income values - There can be various reasons for missing values. High or low-income people may not fill the details if not mandatory. Potential Treatment: Imputed with Models (Regression Model)
  • Due to technical reasons, some of the non-completed orders are created entries with multiple missing values. Potential Treatment: Delete the entries

 

Missing Value Treatment: Categorical Variable.

Creating a separate level for missing values

When a customer applies for a credit card. She or he fills up an application form. Some of the details required to be filled in are income, occupation type, and age.

Sample Data

More detailed steps in data preparation and modeling are discussed here - Credit card Approval Model using XGBoost

In the application data - the OCCUPATION_TYPE feature has almost 32% missing values.

OCCUPATION_TYPE is a categorical feature. We can treat this as a separate level for this variable and code as "Non - Available"

 

Ignoring Feature or No action on missing value treatment

Insurance Agency extract leads information of leads providing a platform by paying a certain amount per lead. A lot of variables are extracted as part of the process and this information can be used for Leads Scoring and Prioritization.

One of the features - medicare_or_medicaid - had over 99% of the values as missing. This is an example where we can directly exclude the features from the next set of analyses be it bivariate analysis or modeling. Though the variable is binary, the same logic can be applied for a numeric feature.

Creating a dummy or indicator variable

In the same insurance leads scoring scenario, the age variable has around 15% missing values. we have created a separate variable to indicating whether age was missing and also relaced the age with the mean value in the original variable.

When we are creating a new indicator variable, we need to take some action or do the treatment on the original variables and both of these features will be considered for the modeling.

Replacing with Mode (Most Frequent Category)

If a small % of observations are missing for a categorical variable, we can replace it with the most frequent value as well.

Missing Value Treatment for Numeric Features

Replacement - with 0, mean or median values

For the numeric features, post investigating the reasons for missing values, the appropriate treatment method can be selected.

Some of the commonly used missing values replacement method used is:

  • Replacing with 0: In some scenarios, it logically makes sense to replace it with 0. For example, in a credit card, we have columns for each month spend. If the spend for each some users for a particular month is missing, we can replace it with 0. As these customers did not use the credit card to spend in that month. Replacement with 0 causes means of the series is impacted.
  • Replacing with Mean or Median Values: In some of the scenarios, the replacement with 0 does not make sense, and replacement with mean or median is a better approach. For example, we have Athletes data and for some of the athletes the date of birth was missing, as a result, the age was coming as null. So, we had to replace it with the average age. This step may not impact the mean of the distribution of replaced with mean. If a distribution is skewed, the missing value replacement with the median may be a better choice.

Check missing values

Both height and weight features have missing values. And we decided to replace them with their respective mean values.

Here is the Python code to replace missing values with the mean value.

Now, there is no missing value.

Missing Value Treatment for Numeric Values: Imputation

We have discussed steps to replace with zero and mean/median value.

Now, we will discuss a scenario where we will using a regression-based model to impute missing values.

We will read a data file that has missing values and one of the columns has missing values.

We will read the data file and explore the data before imputation.

# read data 

import pandas as pd

breast_cancer_wisconsin = pd.read_csv("http://archive.ics.uci.edu//ml//machine-learning-databases//breast-cancer-wisconsin//breast-cancer-wisconsin.data",

                                       sep=",",

                                       header=None,

                                      prefix="V_")

breast_cancer_wisconsin.head()

 

The feature description for this data frame is as follows.

Attribute Domain


  1. Sample code number id number
  2. Clump Thickness 1 - 10
  3. Uniformity of Cell Size 1 - 10
  4. Uniformity of Cell Shape 1 - 10
  5. Marginal Adhesion 1 - 10
  6. Single Epithelial Cell Size 1 - 10
  7. Bare Nuclei 1 - 10
  8. Bland Chromatin 1 - 10
  9. Normal Nucleoli 1 - 10
  10. Mitoses 1 - 10
  11. Class: (2 for benign, 4 for malignant)

Feature V_6 has missing values. We will explore this.

16 observations are missing and missing values are represented as "?"

We can replace "?" with NA or missing values.

from numpy import nan as NA

breast_cancer_wisconsin['V_6'] = breast_cancer_wisconsin['V_6'].replace('?',NA).astype('float')

 

breast_cancer_wisconsin.describe()

 

We can check the data for the rows which has missing values for the column V_6

breast_cancer_wisconsin.loc[(breast_cancer_wisconsin['V_6'].isnull())]

 

We can build a regression model to estimate the value of attribute V_6 based on the available information about other columns.

This data sample is from the classification model scenario - predict whether a cancer tumor is benign or malignant. The class variable is the last one "V_10". So, we will not use this in the regression model for imputation.

Regression Model for Imputation

We will create a data sample by ignoring the missing rows. We can use dropna() method to do this.

reg_df = breast_cancer_wisconsin.dropna()

len(reg_df)

16 rows with missing values for column V_6 are removed. Now, we will fit a regression model.

V_0 is an ID type variable and V_10 is a Class/Label variable, so we removed these variables from the regression model development.

 

#  Linear Regression

import statsmodels.api as sm

from statsmodels.api import add_constant

# Multiple Regression using statsmodels

simpleOLS = sm.OLS(reg_df['V_6'], add_constant(reg_df[['V_1''V_2''V_3''V_4''V_5''V_7''V_8''V_9']])).fit()

simpleOLS.summary()

 

Now, we can impute the missing values based on the regression model.

# Imput based on regression model

import numpy as np 

breast_cancer_wisconsin_imputed =  simpleOLS.predict(add_constant(breast_cancer_wisconsin[['V_1''V_2''V_3''V_4''V_5''V_7''V_8''V_9']]))

breast_cancer_wisconsin['V_6_reg'] = np.where(breast_cancer_wisconsin['V_6'].isnull(),breast_cancer_wisconsin_imputed, breast_cancer_wisconsin['V_6'])

 

We can check the imputed values.

breast_cancer_wisconsin[breast_cancer_wisconsin.V_6.isnull()]

 

combined_v6['V_6_y'].value_counts()

 

breast_cancer_wisconsin.describe()

Summary and Concluding thoughts

If values are missing for a categorical variable, you can select one of these strategies

  • Replace with the mode (most frequent category level): If missing % lower than 5%
  • Create a separate level for the variable: if missing % is at least 5%
  • Ignore the variable and create an indicator variable - if a variable has a very high % missing observation
  • Remove rows with missing values - this is probably the last option.

If values are missing for a numeric variable

  • Replace with 0
  • Replace with Mean/Median
  • Impute with a model (e.g. regression model for the continuous or random forest for ordinal). This is probably the best approach if you have time.
  • Create a dummy/indicator variable and ignore original: if a variable has a very high % missing observation

If you are working on a Binary Classification Scenario, the Weight of Evidence (WOE) based approach can be used to create a new variable and missing values will be managed in the process.

Also, typically tree-based methods manage the missing values as a part of the algorithm itself.

Please write your comments or questions.

Leave a comment