UniFormer: Unifying Convolution and Self-attention for Visual Recognition

Kunchang Li; Yali Wang; Junhao Zhang; Peng Gao; Guanglu Song; Yu Liu; Hongsheng Li; Yu Qiao

UniFormer：視覚認識のための畳み込みと自己注意の統合

これらの視覚データには大きなローカル冗長性と複雑なグローバル依存性があるため、画像やビデオから識別表現を学習することは困難な作業です。畳み込みニューラルネットワーク（CNN）とビジョントランスフォーマー（ViT）は、過去数年間で2つの主要なフレームワークでした。 CNNは、小さな近隣内での畳み込みによってローカルの冗長性を効率的に減らすことができますが、受容野が限られているため、グローバルな依存関係をキャプチャすることは困難です。あるいは、ViTは自己注意を介して長距離の依存関係を効果的にキャプチャできますが、すべてのトークン間のブラインド類似性の比較は高い冗長性につながります。これらの問題を解決するために、畳み込みと自己注意のメリットを簡潔なトランスフォーマー形式にシームレスに統合できる新しいUnified transFormer（UniFormer）を提案します。通常のトランスフォーマーブロックとは異なり、UniFormerブロックのリレーションアグリゲーターは、浅いレイヤーと深いレイヤーにそれぞれローカルトークンアフィニティとグローバルトークンアフィニティを備えており、効率的で効果的な表現学習のために冗長性と依存性の両方に取り組むことができます。最後に、UniFormerブロックを新しい強力なバックボーンに柔軟にスタックし、分類から高密度予測まで、画像からビデオドメインまでのさまざまなビジョンタスクに採用します。追加のトレーニングデータがなくても、UniFormerはImageNet-1K分類で86.3のトップ1精度を達成します。 ImageNet-1Kの事前トレーニングのみで、幅広いダウンストリームタスクで最先端のパフォーマンスを簡単に実現できます。たとえば、Kinetics-400 / 600、60.9 /71.2で82.9 / 84.8のトップ1精度を実現します。 Something-Something V1 / V2ビデオ分類タスクでトップ1の精度、COCOオブジェクト検出タスクで53.8ボックスAPと46.4マスクAP、ADE20Kセマンティックセグメンテーションタスクで50.8 mIoU、COCOポーズ推定タスクで77.4AP。コードはhttps://github.com/Sense-X/UniFormerで入手できます。

It is a challenging task to learn discriminative representation from images and videos, due to large local redundancy and complex global dependency in these visual data. Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. Though CNNs can efficiently decrease local redundancy by convolution within a small neighborhood, the limited receptive field makes it hard to capture global dependency. Alternatively, ViTs can effectively capture long-range dependency via self-attention, while blind similarity comparisons among all the tokens lead to high redundancy. To resolve these problems, we propose a novel Unified transFormer (UniFormer), which can seamlessly integrate the merits of convolution and self-attention in a concise transformer format. Different from the typical transformer blocks, the relation aggregators in our UniFormer block are equipped with local and global token affinity respectively in shallow and deep layers, allowing to tackle both redundancy and dependency for efficient and effective representation learning. Finally, we flexibly stack our UniFormer blocks into a new powerful backbone, and adopt it for various vision tasks from image to video domain, from classification to dense prediction. Without any extra training data, our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification. With only ImageNet-1K pre-training, it can simply achieve state-of-the-art performance in a broad range of downstream tasks, e.g., it obtains 82.9/84.8 top-1 accuracy on Kinetics-400/600, 60.9/71.2 top-1 accuracy on Something-Something V1/V2 video classification tasks, 53.8 box AP and 46.4 mask AP on COCO object detection task, 50.8 mIoU on ADE20K semantic segmentation task, and 77.4 AP on COCO pose estimation task. Code is available at https://github.com/Sense-X/UniFormer.

updated: Mon Jan 24 2022 04:39:39 GMT+0000 (UTC)

published: Mon Jan 24 2022 04:39:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト