sklearn の CountVectorizer を使う

Python

Published: 2019-08-07

やったこと

テキストから単語の数を数えるため、sklearn の CountVectorizer を使ってみます。

確認環境

$ ipython --version
6.1.0
$ jupyter --version
4.3.0
$ python --version
Python 3.6.2 :: Anaconda custom (64-bit)

import sklearn
print(sklearn.__version__)

0.19.0

調査

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vec = CountVectorizer()
X = vec.fit_transform(corpus)
print(vec.get_feature_names())
print(type(X))
print(X.toarray())

出力結果

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
<class 'scipy.sparse.csr.csr_matrix'>
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]

toarray() は疎行列 -> 配列に変換します。

参考

sklearn.feature_extraction.text.CountVectorizer — scikit-learn 0.21.3 documentation
scipy.sparse.csr_matrix.toarray — SciPy v1.3.0 Reference Guide