HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions

Yongming Rao; Wenliang Zhao; Yansong Tang; Jie Zhou; Ser-Nam Lim; Jiwen Lu

HorNet: 再帰的ゲート畳み込みによる効率的な高次空間相互作用

視覚における最近の進歩トランスフォーマーは、内積自己注意に基づく新しい空間モデリングメカニズムによって駆動されるさまざまなタスクで大きな成功を収めています。このホワイトペーパーでは、ビジョントランスフォーマーの背後にある重要な要素、つまり入力適応型、長距離および高次の空間相互作用も、畳み込みベースのフレームワークで効率的に実装できることを示します。ゲート付き畳み込みと再帰的設計で高次空間相互作用を実行する再帰的ゲート付き畳み込み (g^nConv) を提示します。新しい操作は非常に柔軟でカスタマイズ可能で、畳み込みのさまざまなバリアントと互換性があり、重要な余分な計算を導入することなく、自己注意の 2 次相互作用を任意の次数に拡張します。 g^nConv はプラグアンドプレイモジュールとして機能し、さまざまなビジョントランスフォーマーや畳み込みベースのモデルを改善できます。この操作に基づいて、HorNet という名前の汎用ビジョンバックボーンの新しいファミリーを構築します。 ImageNet 分類、COCO オブジェクト検出、および ADE20K セマンティックセグメンテーションに関する広範な実験では、同様の全体的なアーキテクチャとトレーニング構成で、HorNet が Swin Transformers および ConvNeXt よりも大幅に優れていることが示されています。また、HorNet は、より多くのトレーニングデータとより大きなモデルサイズへの優れたスケーラビリティも示しています。ビジュアルエンコーダーでの有効性とは別に、g^nConv をタスク固有のデコーダーに適用して、より少ない計算で密な予測パフォーマンスを一貫して改善できることも示します。私たちの結果は、g^nConv が視覚トランスフォーマーと CNN の両方のメリットを効果的に組み合わせたビジュアルモデリングの新しい基本モジュールになり得ることを示しています。コードは https://github.com/raoyongming/HorNet で入手できます。

Recent progress in vision Transformers exhibits great success in various tasks driven by the new spatial modeling mechanism based on dot-product self-attention. In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework. We present the Recursive Gated Convolution (g^nConv) that performs high-order spatial interactions with gated convolutions and recursive designs. The new operation is highly flexible and customizable, which is compatible with various variants of convolution and extends the two-order interactions in self-attention to arbitrary orders without introducing significant extra computation. g^nConv can serve as a plug-and-play module to improve various vision Transformers and convolution-based models. Based on the operation, we construct a new family of generic vision backbones named HorNet. Extensive experiments on ImageNet classification, COCO object detection and ADE20K semantic segmentation show HorNet outperform Swin Transformers and ConvNeXt by a significant margin with similar overall architecture and training configurations. HorNet also shows favorable scalability to more training data and larger model sizes. Apart from the effectiveness in visual encoders, we also show g^nConv can be applied to task-specific decoders and consistently improve dense prediction performance with less computation. Our results demonstrate that g^nConv can be a new basic module for visual modeling that effectively combines the merits of both vision Transformers and CNNs. Code is available at https://github.com/raoyongming/HorNet

updated: Tue Oct 11 2022 08:02:10 GMT+0000 (UTC)

published: Thu Jul 28 2022 17:59:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト