やったこと
pandas の isnull を使い欠測値をカウントします。
確認環境
$ ipython --version
6.1.0
$ jupyter --version
4.3.0
$ python --version
Python 3.6.2 :: Anaconda custom (64-bit)
import sklearn
print(sklearn.__version__)
出力結果
0.21.2
調査
Imputer (Deprecated)
import pandas as pd
from sklearn.preprocessing import Imputer
df = pd.DataFrame({'A':[1,2,3,4,5], 'B':[1,2,None,None,5], 'C':[None, None, 3, None, 4]})
imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr = imr.fit(df)
imr.transform(df.values)
出力結果
/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:66: DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version 0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead.
warnings.warn(msg, category=DeprecationWarning)
array([[1. , 1. , 3.5 ],
[2. , 2. , 3.5 ],
[3. , 2.66666667, 3. ],
[4. , 2.66666667, 3.5 ],
[5. , 5. , 4. ]])
SimpleImputer
import pandas as pd
from sklearn.impute import SimpleImputer
df = pd.DataFrame({'A':[1,2,3,4,5], 'B':[1,2,None,None,5], 'C':[None, None, 3, None, 4]})
imr = SimpleImputer( strategy='mean')
imr.fit(df)
imr.transform(df.values)
出力結果
array([[1. , 1. , 3.5 ],
[2. , 2. , 3.5 ],
[3. , 2.66666667, 3. ],
[4. , 2.66666667, 3.5 ],
[5. , 5. , 4. ]])