Attentional Biased Stochastic Gradient for Imbalanced Classification

Qi Qi; Yi Xu; Rong Jin; Wotao Yin; Tianbao Yang

不均衡な分類のための注意バイアス確率的勾配

この論文では、深層学習におけるデータの不均衡の問題に対処するためのシンプルで効果的な方法（ABSGD）を紹介します。私たちの方法は、注意メカニズムを活用してミニバッチの各勾配に個別の重要度の重みを割り当てる、運動量SGDの単純な変更です。データの不均衡に取り組むための多くの既存のヒューリスティック主導の方法とは異なり、私たちの方法は、理論的に正当化された分布ロバスト最適化（DRO）に基づいており、情報正規化DRO問題の停留点に収束することが保証されています。サンプリングされたデータの個人レベルの重みは、データのスケーリングされた損失値の指数に体系的に比例します。スケーリング係数は、情報正規化DROのフレームワークの正規化パラメーターとして解釈されます。既存のクラスレベルの重み付けスキームと比較して、私たちの方法は、各クラス内の個々の例間の多様性をキャプチャできます。ミニバッチ確率的勾配を計算するために3つの後方伝播を必要とするメタ学習を使用する既存の個人レベルの重み付け方法と比較して、この方法は、標準の深層学習方法と同様に、各反復で1つの後方伝播のみでより効率的です。特徴抽出層の学習と分類器層の学習のバランスをとるために、事前トレーニングにSGDを使用し、続いて堅牢な分類器の学習と下位層の微調整にABSGDを使用する2段階の方法を採用しています。いくつかのベンチマークデータセットに関する実証研究は、提案された方法の有効性を示しています。

In this paper, we present a simple yet effective method (ABSGD) for addressing the data imbalance issue in deep learning. Our method is a simple modification to momentum SGD where we leverage an attentional mechanism to assign an individual importance weight to each gradient in the mini-batch. Unlike many existing heuristic-driven methods for tackling data imbalance, our method is grounded in theoretically justified distributionally robust optimization (DRO), which is guaranteed to converge to a stationary point of an information-regularized DRO problem. The individual-level weight of a sampled data is systematically proportional to the exponential of a scaled loss value of the data, where the scaling factor is interpreted as the regularization parameter in the framework of information-regularized DRO. Compared with existing class-level weighting schemes, our method can capture the diversity between individual examples within each class. Compared with existing individual-level weighting methods using meta-learning that require three backward propagations for computing mini-batch stochastic gradients, our method is more efficient with only one backward propagation at each iteration as in standard deep learning methods. To balance between the learning of feature extraction layers and the learning of the classifier layer, we employ a two-stage method that uses SGD for pretraining followed by ABSGD for learning a robust classifier and finetuning lower layers. Our empirical studies on several benchmark datasets demonstrate the effectiveness of the proposed method.

updated: Sun Oct 10 2021 06:45:56 GMT+0000 (UTC)

published: Sun Dec 13 2020 03:41:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト