VidConv: A modernized 2D ConvNet for Efficient Video Recognition

Chuong H. Nguyen; Su Huynh; Vinh Nguyen; Ngoc Nguyen

VidConv：効率的なビデオ認識のための最新の2DConvNet

2020年に導入されて以来、Vision Transformers（ViT）は多くのビジョンタスクの記録を着実に破り、ConvNetに代わる「必要なもの」と呼ばれることがよくあります。それにもかかわらず、ViTは一般に計算、メモリを消費し、さらに、最近の調査によると、標準のConvNetは、適切に再設計およびトレーニングされた場合、精度とスケーラビリティの点でViTと有利に競合する可能性があります。このホワイトペーパーでは、ConvNetの最新の構造を採用して、アクションの新しいバックボーンを設計します。特に、私たちの主な目標は、標準操作のみがサポートされているFPGAボードなどの産業用製品の展開に役立つことです。したがって、私たちのネットワークは、3D畳み込み、長距離アテンションプラグイン、またはトランスフォーマーブロック：はるかに少ないエポック（5x-10x）でトレーニングされている間、バックボーンは（2 + 1）Dおよび3D畳み込みを使用する方法を上回り、同等の結果を達成します■2つのベンチマークデータセットでViTを使用。

Since being introduced in 2020, Vision Transformers (ViT) has been steadily breaking the record for many vision tasks and are often described as ``all-you-need" to replace ConvNet. Despite that, ViTs are generally computational, memory-consuming, and unfriendly for embedded devices. In addition, recent research shows that standard ConvNet if redesigned and trained appropriately can compete favorably with ViT in terms of accuracy and scalability. In this paper, we adopt the modernized structure of ConvNet to design a new backbone for action recognition. Particularly, our main target is to serve for industrial product deployment, such as FPGA boards in which only standard operations are supported. Therefore, our network simply consists of 2D convolutions, without using any 3D convolution, long-range attention plugin, or Transformer blocks. While being trained with much fewer epochs (5x-10x), our backbone surpasses the methods using (2+1)D and 3D convolution, and achieve comparable results with ViT on two benchmark datasets.

updated: Fri Jul 08 2022 09:33:46 GMT+0000 (UTC)

published: Fri Jul 08 2022 09:33:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト