You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection

Yuxin Fang; Bencheng Liao; Xinggang Wang; Jiemin Fang; Jiyang Qi; Rui Wu; Jianwei Niu; Wenyu Liu

1 つのシーケンスしか見ていない: 物体検出によるビジョンのトランスフォーマーの再考

Transformer は、2D 空間構造に関する最小限の知識で、純粋なシーケンス間の観点から 2D オブジェクトレベルの認識を実行できますか?この質問に答えるために、可能な限り修正と誘導バイアスを最小限に抑えた単純なビジョントランスフォーマーに基づいた一連のオブジェクト検出モデルである、You Only Look at One Sequence (YOLOS) を紹介します。中規模の ImageNet-1k データセットで事前にトレーニングされた YOLOS のみが、COCO で競合するオブジェクト検出パフォーマンスをすでに達成できることがわかりました。たとえば、BERT-Base から直接採用された YOLOS-Base は、42.0 ボックス AP を達成できます。また、現在の事前トレーニングスキームの制限と同様に、物体検出による視覚における Transformer のモデルスケーリング戦略の影響と制限についても説明します。コードとモデルの重みは、https://github.com/hustvl/YOLOS で入手できます。

Can Transformer perform 2D object-level recognition from a pure sequence-to-sequence perspective with minimal knowledge about the 2D spatial structure? To answer this question, we present You Only Look at One Sequence (YOLOS), a series of object detection models based on the naïve Vision Transformer with the fewest possible modifications as well as inductive biases. We find that YOLOS pre-trained on the mid-sized ImageNet-1k dataset only can already achieve competitive object detection performance on COCO, e.g., YOLOS-Base directly adopted from BERT-Base can achieve 42.0 box AP. We also discuss the impacts as well as limitations of current pre-train schemes and model scaling strategies for Transformer in vision through object detection. Code and model weights are available at https://github.com/hustvl/YOLOS.

updated: Mon Jun 21 2021 02:28:30 GMT+0000 (UTC)

published: Tue Jun 01 2021 17:54:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト