QKVA grid: Attention in Image Perspective and Stacked DETR

Wenyuan Sheng

QKVAグリッド：画像の視点とスタックされたDETRでの注意

Stacked-DETR（SDETR）という名前の新しいモデルを紹介します。これは、正規のDETRの主要なアイデアを継承しています。 DETRは、トレーニングのコストを簡素化することと、パフォーマンスを向上させるためにスタックアーキテクチャを導入することの2つの方向で改善されます。前者には、注意ブロックの内側に焦点を当て、注意のプロセスを説明するための新しい視点であるQKVAグリッドを提案します。これにより、画像の問題とマルチヘッドの影響に対してアテンションがどのように機能するかをさらに詳しく知ることができます。これらの2つのアイデアは、シングルヘッドエンコーダ層の設計に貢献します。後者の場合、SDETRはDETRに対して大幅な改善（+ 1.1AP、+ 3.4APs）に達します。特に小さなオブジェクトでのパフォーマンスに関しては、SDETRは、DETRの欠点であった最適化されたFasterR-CNNベースラインに対してより良い結果を達成します。私たちの変更は、DETRのコードに基づいています。トレーニングコードと事前トレーニング済みモデルは、https：//github.com/shengwenyuan/sdetrで入手できます。

We present a new model named Stacked-DETR(SDETR), which inherits the main ideas in canonical DETR. We improve DETR in two directions: simplifying the cost of training and introducing the stacked architecture to enhance the performance. To the former, we focus on the inside of the Attention block and propose the QKVA grid, a new perspective to describe the process of attention. By this, we can step further on how Attention works for image problems and the effect of multi-head. These two ideas contribute the design of single-head encoder-layer. To the latter, SDETR reaches great improvement(+1.1AP, +3.4APs) to DETR. Especially to the performance on small objects, SDETR achieves better results to the optimized Faster R-CNN baseline, which was a shortcoming in DETR. Our changes are based on the code of DETR. Training code and pretrained models are available at https://github.com/shengwenyuan/sdetr.

updated: Sat Jul 09 2022 18:05:43 GMT+0000 (UTC)

published: Sat Jul 09 2022 18:05:43 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト