How to Win a Data Science Competition (Week1-6 part1) – ためすう

How to Win a Data Science Competition (Week1-6 part1)

Kaggle 機械学習

Published: 2019-05-22

Feature extraction from text and images

学習目標

Explain how employed model impacts choice of preprocessing
Summarize feature preprocessings for numeric and categorical features
Summarize feature generation approaches for datetime and coordinates
Summarize approaches to deal with missing values
Outline the pipeline of applying Bag of Words
Compare Bag of Words and Word2vec
Explain how to extract CNN descriptors from images

Bag of words

テキストから特徴を抽出する

主に2つの方法がある

bag of words を適用する
単語をベクトルにするようなアンサンブルを使う

bag of words

単語ごとに出現数を数える
sklearn では CountVectorizer で出来る

後処理

サンプルを比較可能にする（特徴のスケーリングに依存しているモデルに対して有効）単語の出現頻度で正規化する。（全体数で割る）
重要な機能を強化して、不要なものを減らす特徴を文書数の逆数で正規化する (idf) => 頻度が高い単語に対応する特徴は、頻度が低い単語に対応する特徴より、縮小される
sklearn では、TfidVectorizer が使われる

Ngram

単語に対応する列だけでなく、異なる単語に対応する列も追加する
単語ではなく、文字で特徴を持たせた方が手軽なことがある
文字の Ngram はあまり目にしない単語の場合に役立つ
sklearn の CountVectorizer は、Ngaram を使うのに適切な Ngram_range というパラメータがある
単語のNgramから、文字のNgramに変更するのに、analyzer というパラメータが使える

前処理

Lowercase (小文字化)
Lemmatization (語彙化)
Stemming (ステミング)
- 単語の末尾を切り捨て、単語を生成する
Stopwords
- 重要な情報を含まない単語を省く
- sklearnのCountVectorizerには、max_df というパラメータもある

まとめ

前処理
Ngram で分割
後処理 (TF-iDFを使って正規化)

参考

How to Win a Data Science Competition: Learn from Top Kagglers | Coursera