TFCNet: Temporal Fully Connected Networks for Static Unbiased Temporal Reasoning

Shiwen Zhang

TFCNet：静的な偏りのない時間的推論のための時間的完全接続ネットワーク

時間的推論は、視覚知能にとって重要な機能の1つです。コンピュータビジョン研究コミュニティでは、時間的推論は通常、ビデオ分類の形で研究されており、近年、多くの最先端のニューラルネットワーク構造とデータセットベンチマーク、特に3DCNNとKineticsが提案されています。ただし、最近のいくつかの研究では、現在のビデオ分類ベンチマークには静的な特徴に対する強いバイアスが含まれているため、時間モデリング機能を正確に反映できないことがわかりました。静的バイアスを排除することを目的とした新しいビデオ分類ベンチマークが提案され、これらの新しいベンチマークでの実験により、現在のクリップベースの3DCNNがRNN構造と最近のビデオトランスフォーマーよりも優れていることが示されています。この論文では、ビデオレベルのサンプリング戦略を使用した場合、3D CNNとその効率的な深さ方向のバリアントが、静的に偏りのない時間的推論ベンチマークでRNNと最近のビジョントランスフォーマーを大幅に上回ることができることを発見しました。さらに、時間的次元に沿って完全に接続された層を近似してビデオレベルの受容野を取得し、時空間的推論能力を強化する、効率的かつ効果的なコンポーネントである時間的完全接続ブロック（TFCブロック）を提案します。 TFCブロックがビデオレベルの3DCNN（V3D）に挿入されると、提案されたTFCNetは、合成時間推論ベンチマークCATER、および実世界の静的バイアスのないデータセットDiving48で、以前のすべての方法を超える新しい最先端の結果を確立します。

Temporal Reasoning is one important functionality for vision intelligence. In computer vision research community, temporal reasoning is usually studied in the form of video classification, for which many state-of-the-art Neural Network structures and dataset benchmarks are proposed in recent years, especially 3D CNNs and Kinetics. However, some recent works found that current video classification benchmarks contain strong biases towards static features, thus cannot accurately reflect the temporal modeling ability. New video classification benchmarks aiming to eliminate static biases are proposed, with experiments on these new benchmarks showing that the current clip-based 3D CNNs are outperformed by RNN structures and recent video transformers. In this paper, we find that 3D CNNs and their efficient depthwise variants, when video-level sampling strategy is used, are actually able to beat RNNs and recent vision transformers by significant margins on static-unbiased temporal reasoning benchmarks. Further, we propose Temporal Fully Connected Block (TFC Block), an efficient and effective component, which approximates fully connected layers along temporal dimension to obtain video-level receptive field, enhancing the spatiotemporal reasoning ability. With TFC blocks inserted into Video-level 3D CNNs (V3D), our proposed TFCNets establish new state-of-the-art results on synthetic temporal reasoning benchmark, CATER, and real world static-unbiased dataset, Diving48, surpassing all previous methods.

updated: Fri Mar 11 2022 13:58:05 GMT+0000 (UTC)

published: Fri Mar 11 2022 13:58:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト