Python: featuretools ではじめる総当り特徴量エンジニアリング

今回は featuretools というパッケージを用いた総当り特徴量エンジニアリング (brute force feature engineering) について書いてみる。総当り特徴量エンジニアリングは、実際に効くか効かないかに関係なく、考えられるさまざまな処理を片っ端から説明変数に施して特徴量を作るというもの。一般的にイメージする、探索的データ分析などにもとづいて特徴量を手動で作っていくやり方とはだいぶアプローチが異なる。そして、featuretools は総当り特徴量エンジニアリングをするためのフレームワークとなるパッケージ。

使った環境は次の通り。

$ sw_vers
ProductName:    Mac OS X
ProductVersion: 10.14.6
BuildVersion:   18G1012
$ python -V
Python 3.7.5

もくじ
下準備
単独のデータフレームで試してみる
- 特徴量を作る
- さらに組み合わせた特徴量を作る
特徴量の加工に用いる処理 (Primitive) について
複数のデータフレームで試してみる
- Aggregation 特徴を作ってみる
- Aggregation と Transform の組み合わせ
単独のデータフレームで Aggregation する
featuretools で取り扱うデータ型について
組み込みの Primitive について
- Transform
- Aggregation

下準備

まずは featuretools をインストールしておく。

$ pip install featuretools

そして、Python のインタプリタを起動する。

$ python

単独のデータフレームで試してみる

まずは、サンプルとなるデータフレームを用意する。

>>> import pandas as pd
>>> data = {
...     'name': ['a', 'b', 'c'],
...     'x': [1, 2, 3],
...     'y': [2, 4, 6],
...     'z': [3, 6, 9],
... }
>>> df = pd.DataFrame(data)

このデータフレームには、名前に加えて三次元の座標を表すような特徴量が含まれている。ここから、いくつかの特徴量を抽出してみよう。

>>> df
  name  x  y  z
0    a  1  2  3
1    b  2  4  6
2    c  3  6  9

まずは featuretools をインポートする。

>>> import featuretools as ft

featuretools では EntitySet というオブジェクトが処理の起点になる。このオブジェクトを使うことで複数のデータフレームをまとめて扱うことができる。ただし、現在のサンプルはデータフレームを 1 つしか使わないのであまり意味はない。

>>> es = ft.EntitySet(id='example')

EntitySet にデータフレームを追加する。 featuretools 的には、EntitySet に Entity を追加することになる。

>>> es = es.entity_from_dataframe(entity_id='locations',
...                               dataframe=df,
...                               index='name',  # 便宜上、名前をインデックス代わりにする
...                               )

これで EntitySet に Entity が登録された。

>>> es
Entityset: example
  Entities:
    locations [Rows: 3, Columns: 4]
  Relationships:
    No relationships

それぞれの Entity は辞書ライクに参照できる。

>>> es['locations']
Entity: locations
  Variables:
    name (dtype: index)
    x (dtype: numeric)
    y (dtype: numeric)
    z (dtype: numeric)
  Shape:
    (Rows: 3, Columns: 4)

上記において dtype という部分に index や numeric といった、見慣れない表示があることに注目してもらいたい。詳しくは後述するものの、featuretools ではカラムの型を pandas よりも細分化して扱う。これは、そのカラムに対してどのような処理を適用するのが適切なのかを判断するのに用いられる。

また、内部に格納されているデータフレームも次のようにして参照できる。

>>> es['locations'].df
  name  x  y  z
a    a  1  2  3
b    b  2  4  6
c    c  3  6  9

特徴量を作る

これで準備ができたので、実際に特徴量を作ってみよう。特徴量の生成には featuretools.dfs() という API を用いる。 dfs は Deep Feature Synthesis の略語となっている。 featuretools.dfs() には、起点となる EntitySet と Entity および適用する処理内容を指定する。以下では es['locations'] を起点として、add_numeric と subtract_numeric という処理を適用している。

>>> feature_matrix, feature_defs = ft.dfs(entityset=es,
...                                       target_entity='locations',
...                                       trans_primitives=['add_numeric', 'subtract_numeric'],
...                                       agg_primitives=[],
...                                       max_depth=1,
...                                       )

生成された特徴量を確認してみよう。元は 4 次元だった特徴量が 9 次元まで増えていることがわかる。カラム名と内容を見るとわかるとおり、増えた分はそれぞれのカラムを足すか引くかして作られている。

>>> feature_matrix
      x  y  z  x + y  y + z  x + z  x - y  y - z  x - z
name                                                   
a     1  2  3      3      5      4     -1     -1     -2
b     2  4  6      6     10      8     -2     -2     -4
c     3  6  9      9     15     12     -3     -3     -6
>>> feature_matrix.shape
(3, 9)

もう一方の返り値には特徴量の定義に関する情報が入っている。

>>> feature_defs
[<Feature: x>, <Feature: y>, <Feature: z>, <Feature: x + y>, <Feature: y + z>, <Feature: x + z>, <Feature: x - y>, <Feature: y - z>, <Feature: x - z>]

さらに組み合わせた特徴量を作る

続いて、先ほどは 1 を指定した max_depth オプションに 2 を指定してみよう。これは DFS の深さを表すもので、ようするに一度作った特徴量同士でさらに同じ処理を繰り返すことになる。

>>> feature_matrix, feature_defs = ft.dfs(entityset=es,
...                                       target_entity='locations',
...                                       trans_primitives=['add_numeric', 'subtract_numeric'],
...                                       agg_primitives=[],
...                                       max_depth=2,
...                                       )

生成された特徴量を確認すると 21 次元まで増えている。中身を見ると、最初の段階で作られた特徴量同士をさらに組み合わせて特徴量が作られている。

>>> feature_matrix
      x  y  z  x + y  y + z  x + z  ...  x + y - y  x + z - y + z  x + y - z  y - y + z  x - x + y  x - y + z
name                                ...                                                                      
a     1  2  3      3      5      4  ...          1             -1          0         -3         -2         -4
b     2  4  6      6     10      8  ...          2             -2          0         -6         -4         -8
c     3  6  9      9     15     12  ...          3             -3          0         -9         -6        -12

[3 rows x 21 columns]

特徴量の加工に用いる処理 (Primitive) について

先ほどの DFS では add_numeric と subtract_numeric という 2 種類の加工方法を指定した。 featuretools では特徴量の加工方法に Primitive という名前がついている。

Primitive は、大まかに Transform と Aggregation に分けられる。 Transform は名前からも推測できるように元の shape のまま、足したり引いたりするような処理を指している。それに対して Aggregation は何らかのカラムで GroupBy するような集計にもとづく。

デフォルトで扱える Primitive の一覧は以下のようにして得られる。

>>> primitives = ft.list_primitives()
>>> primitives.head()
               name         type                                        description
0  time_since_first  aggregation  Calculates the time elapsed since the first da...
1          num_true  aggregation                Counts the number of `True` values.
2               all  aggregation     Calculates if all values are 'True' in a list.
3              last  aggregation               Determines the last value in a list.
4               std  aggregation  Computes the dispersion relative to the mean v...

次のように、Primitive は Transform と Transform に分けられることが確認できる。

>>> primitives.type.unique()
array(['aggregation', 'transform'], dtype=object)

複数のデータフレームで試してみる

続いては複数のデータフレームから成るパターンを試してみよう。これは、SQL でいえば JOIN して使うようなテーブル設計のデータが与えられるときをイメージするとわかりやすい。

次のように、item_id というカラムを使って JOIN して使いそうなサンプルデータを用意する。商品のマスターデータと売買のトランザクションデータみたいな感じ。

>>> data = {
...     'item_id': [1, 2, 3],
...     'name': ['apple', 'banana', 'cherry'],
...     'price': [100, 200, 300],
... }
>>> item_df = pd.DataFrame(data)
>>> 
>>> from datetime import datetime
>>> data = {
...     'transaction_id': [10, 20, 30, 40],
...     'time': [
...         datetime(2016, 1, 2, 3, 4, 5),
...         datetime(2017, 2, 3, 4, 5, 6),
...         datetime(2018, 3, 4, 5, 6, 7),
...         datetime(2019, 4, 5, 6, 7, 8),
...     ],
...     'item_id': [1, 2, 3, 1],
...     'amount': [1, 2, 3, 4],
... }
>>> tx_df = pd.DataFrame(data)

上記を、新しく用意した EntitySet に登録していく。

>>> es = ft.EntitySet(id='example')
>>> es = es.entity_from_dataframe(entity_id='items',
...                               dataframe=item_df,
...                               index='item_id',
...                               )
>>> es = es.entity_from_dataframe(entity_id='transactions',
...                               dataframe=tx_df,
...                               index='transaction_id',
...                               time_index='time',
...                               )

次のように Entity が登録された。

>>> es
Entityset: example
  Entities:
    items [Rows: 3, Columns: 3]
    transactions [Rows: 4, Columns: 4]
  Relationships:
    No relationships

次に Entity 同士に Relationship を張ることで結合方法を featuretools に教えてやる。

>>> relationship = ft.Relationship(es['items']['item_id'], es['transactions']['item_id'])
>>> es = es.add_relationship(relationship)

これで、EntitySet に Relationship が登録された。

>>> es
Entityset: example
  Entities:
    items [Rows: 3, Columns: 3]
    transactions [Rows: 4, Columns: 4]
  Relationships:
    transactions.item_id -> items.item_id

Aggregation 特徴を作ってみる

それでは、この状態で DFS を実行してみよう。今度は Primitive として Aggregation の count, sum, mean を指定してみる。なお、Aggregation は Entity に Relationship がないと動作しない。

>>> feature_matrix, feature_defs = ft.dfs(entityset=es,
...                                       target_entity='items',
...                                       trans_primitives=[],
...                                       agg_primitives=['count', 'sum', 'mean'],
...                                       max_depth=1,
...                                       )

作られた特徴を確認すると、トランザクションを商品ごとに集計した情報になっていることがわかる。

>>> feature_matrix
           name  price  COUNT(transactions)  SUM(transactions.amount)  MEAN(transactions.amount)
item_id                                                                                         
1         apple    100                    2                         5                        2.5
2        banana    200                    1                         2                        2.0
3        cherry    300                    1                         3                        3.0

Aggregation と Transform の組み合わせ

続いては Aggregation と Transform を両方指定してやってみよう。

>>> feature_matrix, feature_defs = ft.dfs(entityset=es,
...                                       target_entity='items',
...                                       trans_primitives=['add_numeric', 'subtract_numeric'],
...                                       agg_primitives=['count', 'sum', 'mean'],
...                                       max_depth=1,
...                                       )

しかし、先ほどと結果が変わらない。この理由は max_depth に 1 を指定しているためで、最初の段階では Transform を適用する先がない。

>>> feature_matrix
           name  price  COUNT(transactions)  SUM(transactions.amount)  MEAN(transactions.amount)
item_id                                                                                         
1         apple    100                    2                         5                        2.5
2        banana    200                    1                         2                        2.0
3        cherry    300                    1                         3                        3.0

試しに max_depth を 2 に増やしてみよう。

>>> feature_matrix, feature_defs = ft.dfs(entityset=es,
...                                       target_entity='items',
...                                       trans_primitives=['add_numeric', 'subtract_numeric'],
...                                       agg_primitives=['count', 'sum', 'mean'],
...                                       max_depth=2,
...                                       )

すると、今度は Aggregation で作られた特徴に対して、さらに Transform の処理が適用されていることがわかる。

>>> feature_matrix
           name  price  ...  COUNT(transactions) - MEAN(transactions.amount)  price - SUM(transactions.amount)
item_id                 ...                                                                                   
1         apple    100  ...                                             -0.5                                95
2        banana    200  ...                                             -1.0                               198
3        cherry    300  ...                                             -2.0                               297

[3 rows x 17 columns]

カラムが省略されてしまっているので、定義の方を確認すると次の通り。

>>> from pprint import pprint
>>> pprint(feature_defs)
[<Feature: name>,
 <Feature: price>,
 <Feature: COUNT(transactions)>,
 <Feature: SUM(transactions.amount)>,
 <Feature: MEAN(transactions.amount)>,
 <Feature: COUNT(transactions) + price>,
 <Feature: COUNT(transactions) + SUM(transactions.amount)>,
 <Feature: MEAN(transactions.amount) + price>,
 <Feature: MEAN(transactions.amount) + SUM(transactions.amount)>,
 <Feature: COUNT(transactions) + MEAN(transactions.amount)>,
 <Feature: price + SUM(transactions.amount)>,
 <Feature: COUNT(transactions) - price>,
 <Feature: COUNT(transactions) - SUM(transactions.amount)>,
 <Feature: MEAN(transactions.amount) - price>,
 <Feature: MEAN(transactions.amount) - SUM(transactions.amount)>,
 <Feature: COUNT(transactions) - MEAN(transactions.amount)>,
 <Feature: price - SUM(transactions.amount)>]

単独のデータフレームで Aggregation する

ここまでの例だけ見ると、単独のデータフレームが与えられたときは Aggregation の特徴は使えないのか？という印象を持つと思う。しかし、そんなことはない。試しに以下のようなデータフレームを用意する。

>>> data = {
...     'item_id': [1, 2, 3, 4, 5],
...     'name': ['apple', 'broccoli', 'cabbage', 'dorian', 'eggplant'],
...     'category': ['fruit', 'vegetable', 'vegetable', 'fruit', 'vegetable'],
...     'price': [100, 200, 300, 4000, 500],
... }
>>> item_df = pd.DataFrame(data)

上記を元に EntitySet を作る。

>>> es = ft.EntitySet(id='example')
>>> es = es.entity_from_dataframe(entity_id='items',
...                               dataframe=item_df,
...                               index='item_id',
...                               )

作れたら EntitySet#normalize_entity() を使って新しいエンティティを作る。

>>> es = es.normalize_entity(base_entity_id='items',
...                          new_entity_id='category',
...                          index='category',
...                          )

EntitySet は以下のような状態になる。

>>> es
Entityset: example
  Entities:
    items [Rows: 5, Columns: 4]
    category [Rows: 2, Columns: 1]
  Relationships:
    items.category -> category.category
>>> es['category']
Entity: category
  Variables:
    category (dtype: index)
  Shape:
    (Rows: 2, Columns: 1)
>>> es['category'].df
            category
fruit          fruit
vegetable  vegetable

category カラムに入る値だけから成る Entity ができて Relationship が張られている。これは SQL でいえば外部キー制約用のテーブルをマスターとは別に作っているようなイメージ。

上記に対して Aggregation を適用してみよう。

>>> feature_matrix, feature_defs = ft.dfs(entityset=es,
...                                       target_entity='items',
...                                       trans_primitives=[],
...                                       agg_primitives=['count', 'sum', 'mean'],
...                                       max_depth=2,
...                                       )

すると、category カラムの内容ごとに集計された特徴量が作られていることがわかる。

>>> feature_matrix
             name   category  price  category.COUNT(items)  category.SUM(items.price)  category.MEAN(items.price)
item_id                                                                                                          
1           apple      fruit    100                      2                       4100                 2050.000000
2        broccoli  vegetable    200                      3                       1000                  333.333333
3         cabbage  vegetable    300                      3                       1000                  333.333333
4          dorian      fruit   4000                      2                       4100                 2050.000000
5        eggplant  vegetable    500                      3                       1000                  333.333333

featuretools で取り扱うデータ型について

前述した通り、featuretools では適用する処理を選別するために pandas よりも細かい粒度でデータ型を取り扱う。もし、適切なデータ型になっていないと意図しない処理が適用されて無駄やリークを起こす原因となる。データ型の定義は現行バージョンであれば以下のモジュールにある。

github.com

ざっくり調べた感じ以下の通り。

型	親	説明
Variable	-	全ての型のベース
Unknown	Variable	不明なもの
Discrete	Variable	名義尺度・順序尺度のベース
Boolean	Variable	真偽値
Categorical	Discrete	順序なしカテゴリ変数
Id	Categorical	識別子
Ordinal	Discrete	順序ありカテゴリ変数
Numeric	Variable	数値
Index	Variable	インデックス
Datetime	Variable	時刻
TimeIndex	Variable	時刻インデックス
NumericTimeIndex	TimeIndex, Numeric	数値表現の時刻インデックス
DatetimeTimeIndex	TimeIndex, Datetime	時刻表現の時刻インデックス
Timedelta	Variable	時間差
Text	Variable	文字列
LatLong	Variable	座標 (緯度経度)
ZIPCode	Categorical	郵便番号
IPAddress	Variable	IP アドレス
FullName	Variable	名前
EmailAddress	Variable	メールアドレス
URL	Variable	URL
PhoneNumber	Variable	電話番号
DateOfBirth	Datetime	誕生日
CountryCode	Categorical	国コード
SubRegionCode	Categorical	地域コード
FilePath	Variable	ファイルパス

組み込みの Primitive について

続いて、featuretools にデフォルトで組み込まれている Primitive について勉強がてらざっくり調べた。現行バージョンに組み込まれているものは以下で確認できる。

github.com

なお、動作する上で特定のパラメータを必要とするものもある。

Transform

まずは Transform から。

名前	入力型	出力型	説明
is_null	Variable	Boolean	Null か (pandas.isnull)
absolute	Numeric	Numeric	絶対値 (np.absolute)
time_since_previous	DatetimeTimeIndex	Numeric	時刻の最小値からの差分
time_since	DatetimeTimeIndex, Datetime	Numeric	特定時刻からの差分
year	Datetime	Ordinal	年
month	Datetime	Ordinal	月
day	Datetime	Ordinal	日
hour	Datetime	Ordinal	時
minute	Datetime	Numeric	分
second	Datetime	Numeric	秒
week	Datetime	Ordinal	週
is_weekend	Datetime	Boolean	平日か
weekday	Datetime	Ordinal	週の日付 (月:0 ~ 日:6)
num_characters	Text	Numeric	文字数
num_words	Text	Numeric	単語数
diff	Numeric	Numeric	値の差
negate	Numeric	Numeric	-1 をかける
percentile	Numeric	Numeric	パーセンタイルに変換
latitude	LatLong	Numeric	緯度
longitude	LatLong	Numeric	経度
haversine	(LatLong, LatLong)	Numeric	2 点間の距離
not	Boolean	Boolean	否定
isin	Variable	Boolean	リストに含まれるか
greater_than	(Numeric, Numeric), (Datetime, Datetime), (Ordinal, Ordinal)	Boolean	より大きいか (np.greater)
greater_than_equal_to	(Numeric, Numeric), (Datetime, Datetime), (Ordinal, Ordinal)	Boolean	同じかより大きいか (np.greater_equal)
less_than	(Numeric, Numeric), (Datetime, Datetime), (Ordinal, Ordinal)	Boolean	より小さいか (np.less)
less_than_equal_to	(Numeric, Numeric), (Datetime, Datetime), (Ordinal, Ordinal)	Boolean	同じかより小さいか (np.less_equal)
greater_than_scalar	Numeric, Datetime, Ordinal	Boolean	特定の値より大きいか
greater_than_equal_to_scalar	Numeric, Datetime, Ordinal	Boolean	特定の値より大きいか (値を含む)
less_than_scalar	Numeric, Datetime, Ordinal	Boolean	特定の値より小さいか
less_than_equal_to_scalar	Numeric, Datetime, Ordinal	Boolean	特定の値より小さいか (値を含む)
equal	(Variable, Variable)	Boolean	同じか (np.equal)
not_equal	(Variable, Variable)	Boolean	同じでないか (np.not_equal)
equal_scalar	(Variable, Variable)	Boolean	特定の値と等しいか
not_equal_scalar	(Variable, Variable)	Boolean	特定の値と等しくないか
add_numeric	(Numeric, Numeric)	Numeric	加算 (np.add)
subtract_numeric	(Numeric, Numeric)	Numeric	減算 (np.subtract)
multiply_numeric	(Numeric, Numeric)	Numeric	乗算 (np.multiply)
divide_numeric	(Numeric, Numeric)	Numeric	除算 (np.divide)
modulo_numeric	(Numeric, Numeric)	Numeric	余算 (np.mod)
add_numeric_scalar	Numeric	Numeric	特定の値を足す
subtract_numeric_scalar	Numeric	Numeric	特定の値を引く
scalar_subtract_numeric_feature	Numeric	Numeric	特定の値から引く
multiply_numeric_scalar	Numeric	Numeric	特定の値を掛ける
divide_numeric_scalar	Numeric	Numeric	特定の値で割る
divide_by_feature	Numeric	Numeric	特定の値を割る
modulo_numeric_scalar	Numeric	Numeric	特定の値で割った余り
modulo_by_feature	Numeric	Numeric	特定の値を割った余り
multiply_boolean	(Boolean, Boolean)	Boolean	ビット同士を比べた AND (np.bitwise_and)
and	(Boolean, Boolean)	Boolean	論理積 (np.logical_and)
or	(Boolean, Boolean)	Boolean	論理和 (np.logical_or)

Aggregation

続いて Aggregation を。

名前	入力型	出力型	説明
count	Index	Numeric	要素数
num_unique	Numeric	Numeric	ユニークな要素数
sum	Numeric	Numeric	和
mean	Numeric	Numeric	平均
std	Numeric	Numeric	標準偏差
median	Numeric	Numeric	中央値
mode	Numeric	Numeric	最頻値
min	Numeric	Numeric	最小値
max	Numeric	Numeric	最大値
first	Variable	-	最初の要素
last	Variable	-	最後の要素
skew	Numeric	Numeric	歪度
num_true	Boolean	Numeric	真の要素数
percent_true	Boolean	Numeric	真の比率
n_most_common	Discrete	Discrete	出現頻度の高い要素 TOP n
avg_time_between	DatetimeTimeIndex	Numeric	平均間隔
any	Boolean	Boolean	いずれかが真であるか
all	Boolean	Boolean	全て真であるか
time_since_last	DatetimeTimeIndex	Numeric	最後の要素からの時間差
time_since_first	DatetimeTimeIndex	Numeric	最初の要素からの時間差
trend	(Numeric, DatetimeTimeIndex)	Numeric	線形回帰した際の傾き
entropy	Categorical	Numeric	エントロピー

いじょう。計算する種類が多かったり max_depth が深いとデータによっては現実的な時間・空間計算量におさまらなくなるので気をつけよう。個人的には、空間計算量を節約するために作った特徴量をジェネレータとかでどんどんほしいところだけど、そういう API はざっと読んだ感じなさそう。順番にデータフレームを結合して最終的な成果物をどんと渡す作りになっている。再帰的に計算をするために、これは仕方ないのかなー、うーん。