Transformer with Peak Suppression and Knowledge Guidance for Fine-grained Image Recognition

Xinda Liu; Lili Wang; Xiaoguang Han

きめ細かい画像認識のためのピーク抑制と知識ガイダンスを備えた変圧器

識別の手がかりは通常、単一の画像からであろうと複数の画像からであろうと断片化されているため、きめ細かい画像認識は困難です。それらの重要な改善にもかかわらず、ほとんどの既存の方法は、他の領域の有益な詳細を無視し、他の関連する画像からの手がかりを考慮せずに、単一の画像から最も識別力のある部分に焦点を合わせています。本論文では、きめ細かい画像認識の難しさを新しい視点から分析し、単一画像内の識別特徴の多様化と識別手がかりの集約を尊重する、ピーク抑制モジュールと知識ガイダンスモジュールを備えたトランスアーキテクチャを提案します。複数の画像の中で。具体的には、ピーク抑制モジュールは、最初に線形射影を利用して、入力画像を順次トークンに変換します。次に、トランスフォーマーエンコーダーによって生成されたアテンション応答に基づいてトークンをブロックします。このモジュールは、特徴学習プロセスの最も識別力のある部分への注意をペナルティし、したがって、無視された領域の情報活用を強化します。知識ガイダンスモジュールは、ピーク抑制モジュールから生成された画像ベースの表現を学習可能な知識埋め込みセットと比較して、知識応答係数を取得します。その後、分類スコアとして応答係数を使用して、知識学習を分類問題として形式化します。知識の埋め込みと画像ベースの表現はトレーニング中に更新されるため、知識の埋め込みにはさまざまな画像の識別の手がかりが含まれます。最後に、取得した知識の埋め込みを包括的な表現として画像ベースの表現に組み込み、パフォーマンスを大幅に向上させます。 6つの人気のあるデータセットの広範な評価は、提案された方法の利点を示しています。

Fine-grained image recognition is challenging because discriminative clues are usually fragmented, whether from a single image or multiple images. Despite their significant improvements, most existing methods still focus on the most discriminative parts from a single image, ignoring informative details in other regions and lacking consideration of clues from other associated images. In this paper, we analyze the difficulties of fine-grained image recognition from a new perspective and propose a transformer architecture with the peak suppression module and knowledge guidance module, which respects the diversification of discriminative features in a single image and the aggregation of discriminative clues among multiple images. Specifically, the peak suppression module first utilizes a linear projection to convert the input image into sequential tokens. It then blocks the token based on the attention response generated by the transformer encoder. This module penalizes the attention to the most discriminative parts in the feature learning process, therefore, enhancing the information exploitation of the neglected regions. The knowledge guidance module compares the image-based representation generated from the peak suppression module with the learnable knowledge embedding set to obtain the knowledge response coefficients. Afterwards, it formalizes the knowledge learning as a classification problem using response coefficients as the classification scores. Knowledge embeddings and image-based representations are updated during training so that the knowledge embedding includes discriminative clues for different images. Finally, we incorporate the acquired knowledge embeddings into the image-based representations as comprehensive representations, leading to significantly higher performance. Extensive evaluations on the six popular datasets demonstrate the advantage of the proposed method.

updated: Fri Dec 10 2021 06:14:42 GMT+0000 (UTC)

published: Wed Jul 14 2021 08:07:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト