sklearn で TF-IDF

Python

Published: 2019-08-09

やったこと

TF-IDF を用いて、文書内の単語の重み付けをします。

確認環境

$ ipython --version
6.1.0
$ jupyter --version
4.3.0
$ python --version
Python 3.6.2 :: Anaconda custom (64-bit)

import sklearn
print(sklearn.__version__)

    0.19.0

調査

TF-IDF とは

TF-IDF とは、term frequency-inverse のことで、

TF (単語の出現頻度) X IDF (逆文書頻度) となります。

$$ tf-idf(t,d) = tf(t, d) * idf(t, d) $$

使ってみる

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vec = TfidfVectorizer()
X = vec.fit_transform(corpus)
print(vec.get_feature_names())
print(type(X))
print(X.toarray())

出力結果

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
<class 'scipy.sparse.csr.csr_matrix'>
[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]

参考

sklearn.feature_extraction.text.TfidfVectorizer — scikit-learn 0.21.3 documentation