やったこと
TF-IDF を用いて、文書内の単語の重み付けをします。
確認環境
$ ipython --version
6.1.0
$ jupyter --version
4.3.0
$ python --version
Python 3.6.2 :: Anaconda custom (64-bit)
import sklearn
print(sklearn.__version__)
0.19.0
調査
TF-IDF とは
TF-IDF とは、term frequency-inverse のことで、
TF (単語の出現頻度) X IDF (逆文書頻度) となります。
$$ tf-idf(t,d) = tf(t, d) * idf(t, d) $$
使ってみる
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vec = TfidfVectorizer()
X = vec.fit_transform(corpus)
print(vec.get_feature_names())
print(type(X))
print(X.toarray())
出力結果
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
<class 'scipy.sparse.csr.csr_matrix'>
[[0. 0.46979139 0.58028582 0.38408524 0. 0.
0.38408524 0. 0.38408524]
[0. 0.6876236 0. 0.28108867 0. 0.53864762
0.28108867 0. 0.28108867]
[0.51184851 0. 0. 0.26710379 0.51184851 0.
0.26710379 0.51184851 0.26710379]
[0. 0.46979139 0.58028582 0.38408524 0. 0.
0.38408524 0. 0.38408524]]