Attentional-Biased Stochastic Gradient Descent

Qi Qi; Yi Xu; Rong Jin; Wotao Yin; Tianbao Yang

注意バイアス確率的勾配降下法

この論文では、深層学習におけるデータの不均衡の問題に対処するためのシンプルかつ効果的な方法 (ABSGD) を紹介します。私たちの方法は、ミニバッチの各勾配に個々の重要度の重みを割り当てるために注意メカニズムを活用するモメンタム SGD の単純な変更です。データの不均衡に取り組むための多くの既存のヒューリスティック主導の方法とは異なり、私たちの方法は、理論的に正当化された分散ロバスト最適化 (DRO) に基づいており、情報正規化 DRO 問題の定常点に収束することが保証されています。サンプリングされたデータの個々のレベルの重みは、データのスケーリングされた損失値の指数関数に体系的に比例します。スケーリング係数は、情報正規化 DRO のフレームワークで正規化パラメーターとして解釈されます。既存のクラスレベルの重み付けスキームと比較して、私たちの方法は、各クラス内の個々の例間の多様性を捉えることができます.ミニバッチの確率的勾配を計算するために 3 つの逆方向伝搬を必要とするメタ学習を使用する既存の個別レベルの重み付け方法と比較して、標準的な深層学習方法のように各反復で 1 つの逆方向伝搬のみを使用することで、この方法はより効率的です。特徴抽出層の学習と分類器層の学習のバランスをとるために、SGD を事前トレーニングに使用し、続いて ABSGD を堅牢な分類器の学習と下位層の微調整に使用する 2 段階の方法を採用しています。いくつかのベンチマークデータセットに関する実証研究は、提案された方法の有効性を示しています。

In this paper, we present a simple yet effective method (ABSGD) for addressing the data imbalance issue in deep learning. Our method is a simple modification to momentum SGD where we leverage an attentional mechanism to assign an individual importance weight to each gradient in the mini-batch. Unlike many existing heuristic-driven methods for tackling data imbalance, our method is grounded in theoretically justified distributionally robust optimization (DRO), which is guaranteed to converge to a stationary point of an information-regularized DRO problem. The individual-level weight of a sampled data is systematically proportional to the exponential of a scaled loss value of the data, where the scaling factor is interpreted as the regularization parameter in the framework of information-regularized DRO. Compared with existing class-level weighting schemes, our method can capture the diversity between individual examples within each class. Compared with existing individual-level weighting methods using meta-learning that require three backward propagations for computing mini-batch stochastic gradients, our method is more efficient with only one backward propagation at each iteration as in standard deep learning methods. To balance between the learning of feature extraction layers and the learning of the classifier layer, we employ a two-stage method that uses SGD for pretraining followed by ABSGD for learning a robust classifier and finetuning lower layers. Our empirical studies on several benchmark datasets demonstrate the effectiveness of the proposed method.

updated: Sun Dec 25 2022 18:39:33 GMT+0000 (UTC)

published: Sun Dec 13 2020 03:41:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト