Monday, October 23, 2017

How to complement missing values in data on Python

As data pre-processing, we frequently need to deal with missing values. There are some ways to deal with those and one of them is to complement those by representative values.

On Python, by scikit-learn, we can do it.
I'll use air quality data to try it.

To prepare the data, on R console, execute the following code on your working directory.

write.csv(airquality, "airquality.csv", row.names=FALSE)




On Python, let’s try to complement the missing values with the representative values.

import pandas as pd

airquality = pd.read_csv('airquality.csv')
print(airquality.head())
   Ozone  Solar.R  Wind  Temp  Month  Day
0   41.0    190.0   7.4    67      5    1
1   36.0    118.0   8.0    72      5    2
2   12.0    149.0  12.6    74      5    3
3   18.0    313.0  11.5    62      5    4
4    NaN      NaN  14.3    56      5    5

As you can see, the air quality data has the missing values. To precisely check the existence of missing values, we can use isnull() method.

print(airquality.isnull().any())
Ozone       True
Solar.R     True
Wind       False
Temp       False
Month      False
Day        False
dtype: bool

The columns, Ozone and Solar.R, have missing values.
To focus on the missing value dealing, I’ll limit the columns.

data = airquality[['Ozone', 'Solar.R']]
print(data.head())
   Ozone  Solar.R
0   41.0    190.0
1   36.0    118.0
2   12.0    149.0
3   18.0    313.0
4    NaN      NaN

The Imputer class of scikit-learn works well for complements.
The code below is one of the examples for complements.
On this case, I complemented the missing values by the mean of columns equivalent to the missing value’s position.

The fit_transform() method can be separated into fit() and transform(). The role of fit() is to adapt the data and the role of transform() is to execute complements. The fit_transform() method does those at once.

from sklearn.preprocessing import Imputer

imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imputed_data = imr.fit_transform(data)
print(imputed_data[:10])
[[  41.          190.        ]
 [  36.          118.        ]
 [  12.          149.        ]
 [  18.          313.        ]
 [  42.12931034  185.93150685]
 [  28.          185.93150685]
 [  23.          299.        ]
 [  19.           99.        ]
 [   8.           19.        ]
 [  42.12931034  194.        ]]

The missing values were complemented with mean of the columns.
By changing the strategy parameter, we can choose the imputation strategies such as mean, median.

From the official page, sklearn.preprocessing.Imputer, we can check the detail.