PaCC-Net: Position Aware Circular Convolution with Merits from ConvNets and Transformer

Haokui Zhang; Wenze Hu; Xiaoyu Wang

PaCC-Net：ConvNetsとTransformerのメリットを備えた位置認識巡回畳み込み

最近、ビジョントランスフォーマーは、大規模な畳み込みベースのモデルを大幅に上回る印象的な結果を示し始めました。ただし、モバイルデバイスまたはリソースに制約のあるデバイスの小さなモデルの領域では、ConvNetには、パフォーマンスとモデルの複雑さの両方で独自の利点があります。ビジョントランスフォーマーのメリットをConvNetに融合することで、これらの利点をさらに強化する、純粋なConvNetベースのバックボーンモデルであるPaCC-Netを提案します。具体的には、位置認識巡回畳み込み（PaCC）を提案します。これは、局所的な畳み込みのように位置に敏感な機能を生成しながら、グローバルな受容野を誇る軽量の畳み込み操作です。 PaCCとsqueeze-exictationopsを組み合わせて、モデルブロックのようなメタフォーマーを形成します。これには、トランスフォーマーのような注意メカニズムがあります。前述のブロックは、プラグアンドプレイ方式で使用して、ConvNetまたはトランスの関連ブロックを置き換えることができます。実験結果は、提案されたPaCC-Netが、一般的なビジョンタスクおよびデータセットで一般的な軽量ConvNetおよびビジョントランスベースモデルよりも優れたパフォーマンスを実現する一方で、パラメーターが少なく、推論速度が速いことを示しています。 ImageNet-1kでの分類では、PaCC-Netは約500万のパラメーターで78.6％のトップ1精度を達成し、11％のパラメーターと13％の計算コストを節約しますが、0.2％高い精度と23％速い推論速度を実現します（ARMベースのRockchip RK3288））MobileViTと比較して、パラメータの0.5倍しか使用していませんが、DeITと比較して2.7％の精度が得られています。 MS-COCOオブジェクト検出およびPASCALVOCセグメンテーションタスクでは、PaCC-Netのパフォーマンスも向上しています。ソースコードはhttps://github.com/hkzhang91/PaCC-Netで入手できます。

Recently, vision transformers started to show impressive results which outperform large convolution based models significantly. However, in the area of small models for mobile or resource constrained devices, ConvNet still has its own advantages in both performance and model complexity. We propose PaCC-Net, a pure ConvNet based backbone model that further strengthens these advantages by fusing the merits of vision transformers into ConvNets. Specifically, we propose position aware circular convolution (PaCC), a light-weight convolution op which boasts a global receptive field while producing location sensitive features as in local convolutions. We combine the PaCCs and squeeze-exictation ops to form a meta-former like model block, which further has the attention mechanism like transformers. The aforementioned block can be used in plug-and-play manner to replace relevant blocks in ConvNets or transformers. Experiment results show that the proposed PaCC-Net achieves better performance than popular light-weight ConvNets and vision transformer based models in common vision tasks and datasets, while having fewer parameters and faster inference speed. For classification on ImageNet-1k, PaCC-Net achieves 78.6% top-1 accuracy with about 5.0 million parameters, saving 11% parameters and 13% computational cost but gaining 0.2% higher accuracy and 23% faster inference speed (on ARM based Rockchip RK3288) compared with MobileViT, and uses only 0.5 times parameters but gaining 2.7% accuracy compared with DeIT. On MS-COCO object detection and PASCAL VOC segmentation tasks, PaCC-Net also shows better performance. Source code is available at https://github.com/hkzhang91/PaCC-Net

updated: Tue Jul 12 2022 06:14:21 GMT+0000 (UTC)

published: Tue Mar 08 2022 09:25:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト