Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions

Victor Ferrari; Rafael Sousa; Marcio Pereira; João P. L. de Carvalho; José Nelson Amaral; José Moreira; Guido Araujo

畳み込みスライシング最適化と ISA 拡張機能を使用して直接畳み込みを進める

畳み込みは、機械学習モデルの推論のために実行する必要がある、最も計算集約的な操作の 1 つです。畳み込みを計算する従来のアプローチは、Im2Col + BLAS メソッドとして知られています。この論文では、機械学習コンパイラに統合できる MLIR/LLVM コード生成ツールチェーンに基づく直接畳み込みアルゴリズムである SConv を提案します。このアルゴリズムは以下を導入します: (a) Convolution Slicing Analysis (CSA) - キャッシュ階層でのタイルの再利用に焦点を当てた、畳み込み固有の 3D キャッシュブロッキング分析パス。 (b) Convolution Slicing Optimization (CSO) - CSA を使用してタイル化された直接畳み込みマクロカーネルを生成するコード生成パス。 (c) Vector-Based Packing (VBP) - ユニタリストライドによる畳み込みのベクトルレジスタシフト命令に基づく、アーキテクチャ固有の最適化された入力テンソルパッキングソリューション。完全な ONNX-MLIR 機械学習モデルからの 393 回の畳み込みで実施された実験は、Im2Col 変換の排除と高速パッキングルーチンの使用により、完全なモデルの推論で、Intel で 2.0x - 3.9x の総パッキング時間の短縮をもたらすことを示しています。 IBM POWER10 上の x86 および 3.6x - 7.2x。エンドツーエンドの機械学習モデル推論のための現在の BLAS 実装に基づく Im2Col + BLAS メソッドの速度向上は、Intel x86 で 9% から 25%、IBM POWER10 アーキテクチャで 10% から 42% の範囲です。モデル推論の合計畳み込みスピードアップは、Intel x86 で 12% から 27%、IBM POWER10 で 26% から 46% です。 SConv は、テストされた 219 のインスタンスの 83% 以上で、点ごとの畳み込みを計算するときに BLAS GEMM よりも優れています。

Convolution is one of the most computationally intensive operations that must be performed for machine-learning model inference. A traditional approach to compute convolutions is known as the Im2Col + BLAS method. This paper proposes SConv: a direct-convolution algorithm based on a MLIR/LLVM code-generation toolchain that can be integrated into machine-learning compilers . This algorithm introduces: (a) Convolution Slicing Analysis (CSA) - a convolution-specific 3D cache-blocking analysis pass that focuses on tile reuse over the cache hierarchy; (b) Convolution Slicing Optimization (CSO) - a code-generation pass that uses CSA to generate a tiled direct-convolution macro-kernel; and (c) Vector-Based Packing (VBP) - an architecture-specific optimized input-tensor packing solution based on vector-register shift instructions for convolutions with unitary stride. Experiments conducted on 393 convolutions from full ONNX-MLIR machine-learning models indicate that the elimination of the Im2Col transformation and the use of fast packing routines result in a total packing time reduction, on full model inference, of 2.0x - 3.9x on Intel x86 and 3.6x - 7.2x on IBM POWER10. The speed-up over an Im2Col + BLAS method based on current BLAS implementations for end-to-end machine-learning model inference is in the range of 9% - 25% for Intel x86 and 10% - 42% for IBM POWER10 architectures. The total convolution speedup for model inference is 12% - 27% on Intel x86 and 26% - 46% on IBM POWER10. SConv also outperforms BLAS GEMM, when computing pointwise convolutions, in more than 83% of the 219 tested instances.

updated: Wed Mar 08 2023 17:23:39 GMT+0000 (UTC)

published: Wed Mar 08 2023 17:23:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト