How to Win a Data Science Competition (Week1-5 part2) – ためすう

How to Win a Data Science Competition (Week1-5 part2)

Kaggle 機械学習

Published: 2019-05-19

はじめに

Coursera の「How to Win a Data Science Competition」のメモです。

英語字幕しかなかったので、翻訳の意味が分からないところがあるかもしれません。

Feature Preprocessing and Generation with Respect to Models

学習目標

Explain how employed model impacts choice of preprocessing
Summarize feature preprocessings for numeric and categorical features
Summarize feature generation approaches for datetime and coordinates
Summarize approaches to deal with missing values
Outline the pipeline of applying Bag of Words
Compare Bag of Words and Word2vec
Explain how to extract CNN descriptors from images

Categorical and ordinal features

前処理

ラベルエンコーディング

一意の値を数値にマッピングする

tree-based model は有効
non-tree-based はあまり効果がない場合がある

ラベルエンコーディングの種類

Alphabetical (アルファベットのソート)

sklearn.preprocessing.LabelEncoder

※ 動画の並び順は間違っているような気がします

Order of appearance (出現順)

Panda.factorize

frequency encoding (出現確率でエンコーディングする)
- 線形モデル、他の種類のモデルにも役立つ
- 頻度が予測値と相関がある場合、役立つ
- 分割数を少なくするのに役立つ
- 新しく生成する特徴において、同頻度のカテゴリが複数ある場合、区別できない

one-hot encoding

線形モデル、kNN、ニューラルネットワークで役立つ
すでに最小値0、最大値1でスケールされている
何百ものバイナリの特徴があると、tree-based modelは速度が遅くなる
メモリの節約で疎行列を使う
疎行列は、カテゴリの特徴、テキストを扱うときに便利
疎行列を扱えるライブラリ
- XGBoost
- LightGBM
- sklearn
ゼロ以外の値がデータ数の半分よりはるかに少ないなら、疎行列は有効

特徴の生成

non-tree-based で役に立つ (ex. 線形モデル、kNN) いくつかのカテゴリの特徴の間での、相互作用

例えば、タイタニックで、予測値が下記に依存するとします。

Pclass
sex

正しい場合、線形モデルはあらゆるこの2つの組み合わせで、良い結果を出します。

生成方法

両方の特徴量の文字列を繋げて、one-hot encoding を適用する

まとめ

ordinal はカテゴリの特徴の特殊なケース
- ordinal は意味のある順番でソートされている
ラベルエンコーディングは、カテゴリの特徴を数字に置き換える
frequency encoding は、出現確率でエンコーディングする
ラベルエンコーディング、frequency encoding は tree-based model でよく使われる
one-hot encoding は、non-tree-based で使われる
one-hot encoding をカテゴリの特徴の組み合わせに適用することで、non-tree-based が特徴間の相互作用を考慮される

参考

How to Win a Data Science Competition: Learn from Top Kagglers | Coursera