Python: LightGBM の学習に使うデータ量と最適なイテレーション数の関係性について

XGBoost は同じデータセットとパラメータを用いた場合、学習に使うデータの量 (行数) と最適なイテレーション数が線形な関係にあることが経験的に知られている ¹。今回は、それが同じ GBDT (Gradient Boosting Decision Tree) の一手法である LightGBM にも適用できる経験則なのかを実験で確認する。

使った環境は次のとおり。

$ sw_vers          
ProductName:    macOS
ProductVersion: 11.2.3
BuildVersion:   20D91
$ python -V           
Python 3.9.2
$ pip list | grep -i lightgbm
lightgbm        3.2.0

下準備

あらかじめ、必要なパッケージをインストールしておく。

$ pip install lightgbm scikit-learn seaborn

実験

以下に、実験用のサンプルコードを示す。サンプルコードでは、sklearn.datasets.make_classification() を使って生成した擬似的な二値分類用のデータセットを使っている。生成したデータセットから、一定の割合で学習用のデータを無作為抽出して、LightGBM のモデルを学習したときの特性を確認している。なお、性能の評価は念のため Nested Validation (outer: stratified hold-out, inner: stratified 5-fold cv) にしている。 outer の予測には inner で学習させたモデルで averaging している。

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from __future__ import annotations

import time

import numpy as np
import pandas as pd
import lightgbm as lgb
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import log_loss


def main():
    # 疑似的な教師信号を作るためのパラメータ
    dist_args = {
        # データ点数
        'n_samples': 100_000,
        # 次元数
        'n_features': 100,
        # その中で意味のあるもの
        'n_informative': 20,
        # 重複や繰り返しはなし
        'n_redundant': 0,
        'n_repeated': 0,
        # タスクの難易度
        'class_sep': 0.65,
        # 二値分類問題
        'n_classes': 2,
        # 生成に用いる乱数
        'random_state': 42,
        # 特徴の順序をシャッフルしない (先頭の次元が informative になる)
        'shuffle': False,
    }
    # 教師データを作る
    x, y = make_classification(**dist_args)
    # Nested Validation (stratified hold-out -> stratified 5 fold cv)
    train_x, test_x, train_y, test_y = train_test_split(x, y,
                                                        test_size=0.3,
                                                        stratify=y,
                                                        shuffle=True,
                                                        random_state=42,
                                                        )
    folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    # 学習用のパラメータ
    lgb_params = {
        # タスク設定
        'objective': 'binary',
        # メトリック
        'metric': 'binary_logloss',
        # 乱数シード
        'seed': 42,
    }

    # 乱数シードを設定する
    np.random.seed(42)

    sampled_rows = []
    best_iterations = []
    test_metrics = []
    learning_times = []
    sampling_rates = np.arange(0.1, 1.0 + 1e-2, 0.1)
    for sampling_rate in sampling_rates:
        train_len = len(train_x)
        sampled_len = int(train_len * sampling_rate)
        sampled_rows.append(sampled_len)

        # 重複なしで無作為抽出する (本当はここも Stratified にした方が良い)
        sampled_indices = np.random.choice(np.arange(train_len),
                                           size=sampled_len,
                                           replace=False)
        sampled_train_x = train_x[sampled_indices]
        sampled_train_y = train_y[sampled_indices]
        train_dataset = lgb.Dataset(sampled_train_x, sampled_train_y)

        # 交差検証
        start_time = time.time()
        cv_result = lgb.cv(params=lgb_params,
                           train_set=train_dataset,
                           num_boost_round=10_000,
                           early_stopping_rounds=100,
                           verbose_eval=100,
                           folds=folds,
                           return_cvbooster=True,
                           )
        end_time = time.time()
        learning_time = end_time - start_time
        learning_times.append(learning_time)

        cvbooster = cv_result['cvbooster']
        best_iterations.append(cvbooster.best_iteration)

        # Fold Averaging でテストデータのメトリックを計算する
        pred_y_folds = cvbooster.predict(test_x)
        pred_y_avg = np.array(pred_y_folds).mean(axis=0)
        test_metric = log_loss(test_y, pred_y_avg)
        test_metrics.append(test_metric)

    # 生の値
    data = {
        'sampling_rates': sampling_rates,
        'sampled_rows': sampled_rows,
        'best_iterations': best_iterations,
        'learning_times': learning_times,
        'test_metrics': test_metrics,
    }
    df = pd.DataFrame(data)
    print(df)

    # グラフにプロットする
    fig = plt.figure(figsize=(8, 12))
    ax1 = fig.add_subplot(3, 1, 1)
    sns.lineplot(data=df,
                 x='sampling_rates',
                 y='best_iterations',
                 label='best iteration',
                 ax=ax1,
                 )
    ax1.grid()
    ax1.legend()
    ax2 = fig.add_subplot(3, 1, 2)
    sns.lineplot(data=df,
                 x='sampling_rates',
                 y='learning_times',
                 label='learning time (sec)',
                 ax=ax2,
                 )
    ax2.grid()
    ax2.legend()
    ax3 = fig.add_subplot(3, 1, 3)
    sns.lineplot(data=df,
                 x='sampling_rates',
                 y='test_metrics',
                 label='test metric (logloss)',
                 ax=ax3,
                 )
    ax3.grid()
    ax3.legend()

    plt.show()


if __name__ == '__main__':
    main()

上記を実行してみよう。計算リソースにもよるけど、それなりに時間がかかるはず。

$ python lgbiter.py 
[LightGBM] [Info] Number of positive: 2764, number of negative: 2836
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003497 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 25500

...

[1500] cv_agg's binary_logloss: 0.118727 + 0.00302704
[1600]    cv_agg's binary_logloss: 0.118301 + 0.00281247
[1700] cv_agg's binary_logloss: 0.117938 + 0.00278925
   sampling_rates  sampled_rows  best_iterations  learning_times  test_metrics
0             0.1          7000              342        6.734761      0.189618
1             0.2         14000              634       12.412657      0.157727
2             0.3         21000              849       18.421927      0.134406
3             0.4         28000             1018       22.645187      0.129939
4             0.5         35000             1162       27.784236      0.122941
5             0.6         42000             1327       33.731716      0.115750
6             0.7         49000             1567       42.821615      0.113003
7             0.8         56000             1614       48.171218      0.109459
8             0.9         63000             1650       60.064258      0.107337
9             1.0         70000             1681       63.199017      0.104814

完了すると、以下のようなグラフが得られる。

f:id:momijiame:20210403001951p:plain — 学習に使うデータ量と最適なイテレーション数の関係性

グラフから、LightGBM においても学習に使うデータ量と最適なイテレーション数は概ね線形な関係にあることが確認できた。また、学習に使うデータ量と学習にかかる時間に関しても概ね線形な関係にあることが見て取れる。一方で、学習に使うデータが増えても予測精度は非線形な改善にとどまっており、この点も直感には反していない。

いじょう。