Dissected 3D CNNs: Temporal Skip Connections for Efficient Online Video Processing

Okan Köpüklü; Stefan Hörmann; Fabian Herzog; Hakan Cevikalp; Gerhard Rigoll

解剖された3DCNN：効率的なオンラインビデオ処理のための一時的なスキップ接続

3Dカーネル（3D-CNN）を使用した畳み込みニューラルネットワークは、ビデオフレーム内の時空間特徴の抽出における優位性により、現在、ビデオ認識タスクで最先端の結果を達成しています。最先端の結果を次々と超える成功した3D-CNNアーキテクチャが数多くあります。ただし、それらのほぼすべてがオフラインで動作するように設計されており、オンライン操作中にいくつかの深刻なハンディキャップが発生します。まず、従来の3D-CNNは、出力機能がクリップ内の最新のフレームではなく完全な入力クリップを表すため、動的ではありません。第二に、それらは固有の時間的ダウンサンプリングのために時間分解能を維持していません。最後に、3D-CNNは、固定された時間入力サイズで使用するように制約されており、柔軟性が制限されています。これらの欠点に対処するために、ネットワークの中間ボリュームが分析され、将来の計算のために深度（時間）次元にわたって伝播され、オンライン操作での計算数が大幅に削減される、分析された3D-CNNを提案します。アクション分類の場合、分析されたバージョンのResNetモデルは、オンライン操作で実行する計算が77〜90％少なくなり、Kinetics-600データセットで従来の3D-ResNetモデルよりも約5％優れた分類精度を達成します。さらに、解剖された3D-CNNの利点は、パフォーマンスを一貫して改善するいくつかのビジョンタスクに私たちのアプローチを展開することによって示されます。

Convolutional Neural Networks with 3D kernels (3D-CNNs) currently achieve state-of-the-art results in video recognition tasks due to their supremacy in extracting spatiotemporal features within video frames. There have been many successful 3D-CNN architectures surpassing the state-of-the-art results successively. However, nearly all of them are designed to operate offline creating several serious handicaps during online operation. Firstly, conventional 3D-CNNs are not dynamic since their output features represent the complete input clip instead of the most recent frame in the clip. Secondly, they are not temporal resolution-preserving due to their inherent temporal downsampling. Lastly, 3D-CNNs are constrained to be used with fixed temporal input size limiting their flexibility. In order to address these drawbacks, we propose dissected 3D-CNNs, where the intermediate volumes of the network are dissected and propagated over depth (time) dimension for future calculations, substantially reducing the number of computations at online operation. For action classification, the dissected version of ResNet models performs 77-90% fewer computations at online operation while achieving ~5% better classification accuracy on the Kinetics-600 dataset than conventional 3D-ResNet models. Moreover, the advantages of dissected 3D-CNNs are demonstrated by deploying our approach onto several vision tasks, which consistently improved the performance.

updated: Mon Oct 18 2021 13:47:49 GMT+0000 (UTC)

published: Wed Sep 30 2020 12:48:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト