Python: gensim の FAST_VERSION 定数の意味について

Python の gensim には自然言語処理 (NLP) に関する様々な実装がある。そして、その中のいくつかのモジュールには FAST_VERSION という定数が定義されている。この定数は環境によって異なる値を取って、値によってパフォーマンスが大きく異なる場合がある。

今回は、この数値が何を表しているかについて調べた。結論から先に述べると、この定数は次のような対応関係にある。

-1
- Cython で書かれた拡張モジュールが使えない
0
- Cython で書かれた拡張モジュールが使える
- BLAS のドット積を計算する関数 (dsdot()) が倍精度浮動小数点型 (double) を返す
1
- Cython で書かれた拡張モジュールが使える
- BLAS のドット積を計算する関数 (dsdot()) が単精度浮動小数点型 (float) を返す
2
- Cython で書かれた拡張モジュールが使える
- BLAS のドット積を計算する関数 (dsdot()) が使えない
  - 代わりに Cython でループする

なお、パフォーマンス的には 0 > 1 > 2 > -1 だと考えられる。また、これはあくまで「現時点で既知の値に関しては」なので、将来的に変わったり異なるモジュールは異なる意味になる可能性がある。

使った環境は次のとおり。

$ sw_vers               
ProductName:    Mac OS X
ProductVersion: 10.14.6
BuildVersion:   18G4032
$ python -V                                                                       
Python 3.7.7
$ pip list | grep -i gensim                                           
gensim          3.8.3

下準備

とりあえず gensim をインストールしておく。

$ pip install gensim

gensim の FAST_VERSION について

たとえば、Word2Vec や fastText のモジュールに FAST_VERSION という定数が定義されている。 Word2Vec のものについて、定数の値を追ってみよう。はじめに次の場所で、特定のモジュールがインポートできないときに -1 を取る。

gensim/word2vec.py at 3.8.3 · RaRe-Technologies/gensim · GitHub

インポートしようとしたモジュールでは Cython の拡張モジュールが使われている。定数を初期化しているのは以下の関数で、それぞれの意味について記述されている。

gensim/word2vec_inner.pyx at 3.8.3 · RaRe-Technologies/gensim · GitHub

意味は前述したとおりで、0 または 1 なら Cython の拡張モジュールで BLAS の dsdot() 関数が使える。 2 のときは使えないので、代わりにドット積を計算するのに Cython のループを使うことになる。

Cython の拡張モジュールが使えない場合 (FAST_VERSION == -1) のパフォーマンス

試しに、Cython の拡張モジュールが使えない環境を用意した。

$ python -c "import gensim;print(gensim.models.word2vec.FAST_VERSION)" 2>/dev/null
-1
$ python -c "import gensim;print(gensim.models.fasttext.FAST_VERSION)" 2>/dev/null
-1

Text8 コーパスを使って Word2Vec を学習させてみよう。はじめに、コーパスをダウンロードして展開する。

$ wget http://mattmahoney.net/dc/text8.zip
$ unzip text8.zip

次のようなベンチマーク用のモジュールを用意する。

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import logging

from gensim.models.word2vec import Text8Corpus
from gensim.models import Word2Vec


def main():
    # 学習の過程をログに残す
    log_format = "%(threadName)s - %(name)s - %(levelname)s - %(message)s"
    logging.basicConfig(format=log_format, level=logging.INFO)

    # Text8 コーパスを Word2Vec で学習する
    corpus = Text8Corpus('text8')
    Word2Vec(
        corpus,
        sg=1,  # Skip-gram タスクを最適化する
        hs=1,  # 損失の計算に Hierarchical Softmax を使う
        negative=5,  # ポジティブサンプルに対するネガティブサンプル (ノイジーワード) の比率
        size=50,  # 埋め込む次元数
        iter=3,  # 学習エポック数
    )


if __name__ == "__main__":
    main()

上記に適用な名前をつけたら実行してみよう。なお、実行が完了まで見守る必要はない。学習時に出力されるスループットさえ確認できれば良い。

$ python benchmark.py 
unable to import 'smart_open.gcs', disabling that module
MainThread - gensim.models.base_any2vec - WARNING - consider setting layer size to a multiple of 4 for greater performance
/Users/amedama/.virtualenvs/py37/lib/python3.7/site-packages/gensim/models/base_any2vec.py:743: UserWarning: C extension not loaded, training will be slow. Install a C compiler and reinstall gensim for fast training.
  "C extension not loaded, training will be slow. "
MainThread - gensim.models.word2vec - INFO - collecting all words and their counts
MainThread - gensim.models.word2vec - INFO - PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
...
MainThread - gensim.models.base_any2vec - INFO - training model with 3 workers on 71290 vocabulary and 50 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
MainThread - gensim.models.base_any2vec - INFO - EPOCH 1 - PROGRESS: at 0.06% examples, 151 words/s, in_qsize 5, out_qsize 0
MainThread - gensim.models.base_any2vec - INFO - EPOCH 1 - PROGRESS: at 0.18% examples, 447 words/s, in_qsize 6, out_qsize 0
MainThread - gensim.models.base_any2vec - INFO - EPOCH 1 - PROGRESS: at 0.24% examples, 309 words/s, in_qsize 5, out_qsize 0
MainThread - gensim.models.base_any2vec - INFO - EPOCH 1 - PROGRESS: at 0.29% examples, 379 words/s, in_qsize 5, out_qsize 0

上記を見ると、だいたい 300 ~ 400 words/s くらいかな？ちなみに、実行直後には「C 拡張が使えないので遅くなるよ」という旨の警告が表示されている。

Cython の拡張モジュールが使える場合 (FAST_VERSION == 0) のパフォーマンス

続いては拡張モジュールが使える場合で、特に BLAS の dsdot() 関数が倍精度浮動小数点型 (double) を返す環境を用意した。

$ python -c "import gensim;print(gensim.models.word2vec.FAST_VERSION)" 2>/dev/null
0

この環境は、たとえば C コンパイラや OpenBLAS が使える環境でソースコードから gensim をビルドすればできると思う。たぶん。

$ xcode-select --install
$ brew install openblas
$ pip install -U --no-binary gensim gensim

この環境でも、先ほどのベンチマーク用の実行してみよう。

$ python benchmark.py                                                             
unable to import 'smart_open.gcs', disabling that module
MainThread - gensim.models.base_any2vec - WARNING - consider setting layer size to a multiple of 4 for greater performance
MainThread - gensim.models.word2vec - INFO - collecting all words and their counts
MainThread - gensim.models.word2vec - INFO - PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
...
MainThread - gensim.models.base_any2vec - INFO - training model with 3 workers on 71290 vocabulary and 50 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
MainThread - gensim.models.base_any2vec - INFO - EPOCH 1 - PROGRESS: at 1.47% examples, 177603 words/s, in_qsize 5, out_qsize 0
MainThread - gensim.models.base_any2vec - INFO - EPOCH 1 - PROGRESS: at 3.06% examples, 187492 words/s, in_qsize 5, out_qsize 0
MainThread - gensim.models.base_any2vec - INFO - EPOCH 1 - PROGRESS: at 4.64% examples, 189657 words/s, in_qsize 5, out_qsize 0
MainThread - gensim.models.base_any2vec - INFO - EPOCH 1 - PROGRESS: at 6.23% examples, 190525 words/s, in_qsize 5, out_qsize 0
...

すると、こちらの環境では 180k ~ 190k words/s のスループットが出ている。 Cython の拡張モジュールを使う場合と使わない場合で、だいたい 500 倍の違いが出た。

おまけ: Word2Vec や fastText の学習時に CPU の論理コアを使い切る

ちなみに現行の Word2Vec や fastText を学習させるときはワーカープロセスの数が 3 で固定されている。そのため CPU のコア数が多い環境では、そこが学習のボトルネックになる。

CPU のコアを使い切りたいときは、multiprocessing モジュールをインポートして...

import multiprocessing

Word2Vec や FastText の引数として workers に環境の論理コア数を指定する。

    workers=multiprocessing.cpu_count(),  # 論理コア数分のワーカープロセスを使って学習する

同じようにベンチマークを実行してみよう。

$ python benchmark.py
...
MainThread - gensim.models.base_any2vec - INFO - EPOCH 1 - PROGRESS: at 2.41% examples, 294657 words/s, in_qsize 16, out_qsize 0
MainThread - gensim.models.base_any2vec - INFO - EPOCH 1 - PROGRESS: at 5.11% examples, 312823 words/s, in_qsize 15, out_qsize 0
MainThread - gensim.models.base_any2vec - INFO - EPOCH 1 - PROGRESS: at 8.00% examples, 323193 words/s, in_qsize 15, out_qsize 0
MainThread - gensim.models.base_any2vec - INFO - EPOCH 1 - PROGRESS: at 10.58% examples, 321791 words/s, in_qsize 15, out_qsize 1
MainThread - gensim.models.base_any2vec - INFO - EPOCH 1 - PROGRESS: at 13.35% examples, 324695 words/s, in_qsize 15, out_qsize 0
...

今回の環境は CPU の論理コアが 8 あるので、さらにスループットを上げることができた。

めでたしめでたし。