sklearn の DictVectorizer で one-hot エンコーディングする

Python

Published: 2019-08-08

やったこと

カテゴリ特徴を、エンコーディングします。

確認環境

$ ipython --version
6.1.0
$ jupyter --version
4.3.0
$ python --version
Python 3.6.2 :: Anaconda custom (64-bit)

import sklearn
print(sklearn.__version__)

0.19.0

調査

from sklearn.feature_extraction import DictVectorizer
D = [{'name': 'ABC', 'age': 23}, {'name': 'DEF', 'age': 35}, {'name': 'XYZ', 'age': 66}, {'name': 'AAA', 'age': 5}]
v = DictVectorizer(sparse=False)
X = v.fit_transform(D)
print(v.get_feature_names())
print(type(X))
print(X.shape)
print(X)

print('---')

# 疎行列
v2 = DictVectorizer(sparse=True)
X2 = v2.fit_transform(D)
print(X2)
print(v2.get_feature_names())
print(type(X2))
print(X2.shape)
print(X2.toarray())

出力結果

['age', 'name=AAA', 'name=ABC', 'name=DEF', 'name=XYZ']
<class 'numpy.ndarray'>
(4, 5)
[[23.  0.  1.  0.  0.]
 [35.  0.  0.  1.  0.]
 [66.  0.  0.  0.  1.]
 [ 5.  1.  0.  0.  0.]]
---
  (0, 0)	23.0
  (0, 2)	1.0
  (1, 0)	35.0
  (1, 3)	1.0
  (2, 0)	66.0
  (2, 4)	1.0
  (3, 0)	5.0
  (3, 1)	1.0
['age', 'name=AAA', 'name=ABC', 'name=DEF', 'name=XYZ']
<class 'scipy.sparse.csr.csr_matrix'>
(4, 5)
[[23.  0.  1.  0.  0.]
 [35.  0.  0.  1.  0.]
 [66.  0.  0.  0.  1.]
 [ 5.  1.  0.  0.  0.]]

参考

sklearn.feature_extraction.DictVectorizer — scikit-learn 0.21.3 documentation