Vision Transformer Adapter for Dense Predictions

Zhe Chen; Yuchen Duan; Wenhai Wang; Junjun He; Tong Lu; Jifeng Dai; Yu Qiao

高密度予測用のビジョントランスアダプター

この作業では、Vision Transformer（ViT）用のシンプルで強力なアダプターを調査します。視覚固有の誘導バイアスをアーキテクチャに導入する最近のビジュアルトランスとは異なり、ViTは、画像の事前情報が不足しているため、高密度の予測タスクでパフォーマンスが低下します。この問題を解決するために、ViTの欠陥を修正し、追加のアーキテクチャを介して誘導バイアスを導入することでビジョン固有のモデルと同等のパフォーマンスを実現できるVision Transformer Adapter（ViT-Adapter）を提案します。具体的には、フレームワークのバックボーンは、マルチモーダルデータで事前トレーニングできるバニラトランスフォーマーです。ダウンストリームタスクを微調整する場合、モダリティ固有のアダプターを使用して、データとタスクの事前情報をモデルに導入し、これらのタスクに適したものにします。オブジェクト検出、インスタンスセグメンテーション、セマンティックセグメンテーションなど、複数のダウンストリームタスクでのViT-Adapterの有効性を検証します。特に、HTC ++を使用する場合、ViT-Adapter-LはCOCO test-devで60.1ボックスAPと52.1マスクAPを生成し、Swin-Lを1.4ボックスAPと1.0マスクAPで上回ります。セマンティックセグメンテーションの場合、ViT-Adapter-Lは、ADE20KvalでSwinV2-Gより0.6ポイント高い60.5mIoUの新しい最先端技術を確立します。提案されたViTアダプターが視覚固有の変圧器の代替品として役立ち、将来の研究を促進することを願っています。コードとモデルはhttps://github.com/czczup/ViT-Adapterでリリースされます。

This work investigates a simple yet powerful adapter for Vision Transformer (ViT). Unlike recent visual transformers that introduce vision-specific inductive biases into their architectures, ViT achieves inferior performance on dense prediction tasks due to lacking prior information of images. To solve this issue, we propose a Vision Transformer Adapter (ViT-Adapter), which can remedy the defects of ViT and achieve comparable performance to vision-specific models by introducing inductive biases via an additional architecture. Specifically, the backbone in our framework is a vanilla transformer that can be pre-trained with multi-modal data. When fine-tuning on downstream tasks, a modality-specific adapter is used to introduce the data and tasks' prior information into the model, making it suitable for these tasks. We verify the effectiveness of our ViT-Adapter on multiple downstream tasks, including object detection, instance segmentation, and semantic segmentation. Notably, when using HTC++, our ViT-Adapter-L yields 60.1 box AP and 52.1 mask AP on COCO test-dev, surpassing Swin-L by 1.4 box AP and 1.0 mask AP. For semantic segmentation, our ViT-Adapter-L establishes a new state-of-the-art of 60.5 mIoU on ADE20K val, 0.6 points higher than SwinV2-G. We hope that the proposed ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. The code and models will be released at https://github.com/czczup/ViT-Adapter.

updated: Wed May 18 2022 01:27:12 GMT+0000 (UTC)

published: Tue May 17 2022 17:59:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト