Searching for Two-Stream Models in Multivariate Space for Video Recognition

Xinyu Gong; Heng Wang; Zheng Shou; Matt Feiszli; Zhangyang Wang; Zhicheng Yan

ビデオ認識のための多変量空間での2ストリームモデルの検索

従来のビデオモデルは、複雑な時空間機能をキャプチャするために単一のストリームに依存しています。 SlowFastネットワークやAssembleNetなどの2ストリームビデオモデルに関する最近の研究では、補完的な機能を学習し、より強力なパフォーマンスを実現するために、別々のストリームを規定しています。ただし、両方のストリームと中間のフュージョンブロックを手動で設計することは困難な作業であり、非常に大きな設計スペースを探索する必要があります。このような手動の調査は時間がかかり、計算リソースが限られていて調査が不十分な場合、アーキテクチャが最適ではなくなることがよくあります。この作品では、巨大な空間で2ストリームのビデオモデルを効率的に検索できる実用的なニューラルアーキテクチャ検索アプローチを紹介します。 2ストリームモデルを設計する際のさまざまな選択肢を取り込むために、6つの検索変数を含む多変量検索空間を設計します。さらに、個々のストリーム、フュージョンブロック、アテンションブロックのアーキテクチャを次々に検索することにより、プログレッシブ検索手順を提案します。パフォーマンスが大幅に向上した2ストリームモデルを設計スペースで自動的に検出できることを示します。検索した2ストリームモデル、つまりAuto-TSNetは、標準ベンチマークで他のモデルを常に上回っています。 Kineticsでは、SlowFastモデルと比較して、Auto-TSNet-LモデルはFLOPSを約11分の1に削減し、同じ精度を78.9％達成します。 Something-Something-V2では、Auto-TSNet-Mは、ビデオあたり50 GFLOPS未満を使用する他の方法に比べて、精度を少なくとも2％向上させます。

Conventional video models rely on a single stream to capture the complex spatial-temporal features. Recent work on two-stream video models, such as SlowFast network and AssembleNet, prescribe separate streams to learn complementary features, and achieve stronger performance. However, manually designing both streams as well as the in-between fusion blocks is a daunting task, requiring to explore a tremendously large design space. Such manual exploration is time-consuming and often ends up with sub-optimal architectures when computational resources are limited and the exploration is insufficient. In this work, we present a pragmatic neural architecture search approach, which is able to search for two-stream video models in giant spaces efficiently. We design a multivariate search space, including 6 search variables to capture a wide variety of choices in designing two-stream models. Furthermore, we propose a progressive search procedure, by searching for the architecture of individual streams, fusion blocks, and attention blocks one after the other. We demonstrate two-stream models with significantly better performance can be automatically discovered in our design space. Our searched two-stream models, namely Auto-TSNet, consistently outperform other models on standard benchmarks. On Kinetics, compared with the SlowFast model, our Auto-TSNet-L model reduces FLOPS by nearly 11 times while achieving the same accuracy 78.9%. On Something-Something-V2, Auto-TSNet-M improves the accuracy by at least 2% over other methods which use less than 50 GFLOPS per video.

updated: Mon Aug 30 2021 02:03:28 GMT+0000 (UTC)

published: Mon Aug 30 2021 02:03:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト