欠損値を補完する (scikit-learn SimpleImputer)
Python
Published: 2019-08-12

やったこと

pandas の isnull を使い欠測値をカウントします。

確認環境

$ ipython --version
6.1.0
$ jupyter --version
4.3.0
$ python --version
Python 3.6.2 :: Anaconda custom (64-bit)
import sklearn
print(sklearn.__version__)

出力結果

0.21.2

調査

Imputer (Deprecated)

import pandas as pd
from sklearn.preprocessing import Imputer
df = pd.DataFrame({'A':[1,2,3,4,5], 'B':[1,2,None,None,5], 'C':[None, None, 3, None, 4]})

imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr = imr.fit(df)
imr.transform(df.values)

出力結果

/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:66: DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version 0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead.
  warnings.warn(msg, category=DeprecationWarning)

array([[1.        , 1.        , 3.5       ],
       [2.        , 2.        , 3.5       ],
       [3.        , 2.66666667, 3.        ],
       [4.        , 2.66666667, 3.5       ],
       [5.        , 5.        , 4.        ]])

SimpleImputer

import pandas as pd
from sklearn.impute import SimpleImputer

df = pd.DataFrame({'A':[1,2,3,4,5], 'B':[1,2,None,None,5], 'C':[None, None, 3, None, 4]})
imr = SimpleImputer( strategy='mean')
imr.fit(df)
imr.transform(df.values)

出力結果

array([[1.        , 1.        , 3.5       ],
       [2.        , 2.        , 3.5       ],
       [3.        , 2.66666667, 3.        ],
       [4.        , 2.66666667, 3.5       ],
       [5.        , 5.        , 4.        ]])

参考