Maximizing Spatio-Temporal Entropy of Deep 3D CNNs for Efficient Video Recognition

Junyan Wang; Zhenhong Sun; Yichen Qian; Dong Gong; Xiuyu Sun; Ming Lin; Maurice Pagnucco; Yang Song

効率的なビデオ認識のためのディープ 3D CNN の時空間エントロピーの最大化

3D 畳み込みニューラルネットワーク (CNN) は、ビデオ認識の一般的なオプションです。時間情報を取得するために、シーケンスに沿って 3D 畳み込みが計算され、立方体的に大きくなり、計算コストが高くなります。計算コストを削減するために、以前の方法では、近似または自動検索を使用して手動で設計された 3D/2D CNN 構造に依存しており、モデリング能力を犠牲にしたり、トレーニングに時間がかかったりしていました。この作業では、モデルの複雑さを考慮して、3D CNN 用に調整された新しいトレーニング不要のニューラルアーキテクチャ検索アプローチを介して、効率的な 3D CNN アーキテクチャを自動的に設計することを提案します。 3D CNN の表現力を効率的に測定するために、3D CNN を情報システムとして定式化し、最大エントロピー原則に基づいて分析エントロピースコアを導き出します。具体的には、特徴マップのサイズとカーネルサイズの相関関係を深さ方向に動的に活用することにより、空間次元と時間次元の視覚情報の不一致を処理するための洗練係数を備えた時空間エントロピースコア (STEntr-Score) を提案します。次に、高度に効率的で表現力豊かな 3D CNN アーキテクチャ、つまりエントロピーベースの 3D CNN (E3D ファミリー) を、ネットワークパラメーターをトレーニングせずに進化的アルゴリズムを介して、特定の計算予算の下で STEntr-Score を最大化することによって効率的に検索できます。 Something-Something V1\&V2 と Kinetics400 に関する広範な実験により、E3D ファミリがより高い計算効率で最先端のパフォーマンスを達成することが実証されています。コードは https://github.com/alibaba/lightweight-neural-architecture-search で入手できます。

3D convolution neural networks (CNNs) have been the prevailing option for video recognition. To capture the temporal information, 3D convolutions are computed along the sequences, leading to cubically growing and expensive computations. To reduce the computational cost, previous methods resort to manually designed 3D/2D CNN structures with approximations or automatic search, which sacrifice the modeling ability or make training time-consuming. In this work, we propose to automatically design efficient 3D CNN architectures via a novel training-free neural architecture search approach tailored for 3D CNNs considering the model complexity. To measure the expressiveness of 3D CNNs efficiently, we formulate a 3D CNN as an information system and derive an analytic entropy score, based on the Maximum Entropy Principle. Specifically, we propose a spatio-temporal entropy score (STEntr-Score) with a refinement factor to handle the discrepancy of visual information in spatial and temporal dimensions, through dynamically leveraging the correlation between the feature map size and kernel size depth-wisely. Highly efficient and expressive 3D CNN architectures, i.e. entropy-based 3D CNNs (E3D family), can then be efficiently searched by maximizing the STEntr-Score under a given computational budget, via an evolutionary algorithm without training the network parameters. Extensive experiments on Something-Something V1\&V2 and Kinetics400 demonstrate that the E3D family achieves state-of-the-art performance with higher computational efficiency. Code is available at https://github.com/alibaba/lightweight-neural-architecture-search.

updated: Sun Mar 05 2023 15:11:53 GMT+0000 (UTC)

published: Sun Mar 05 2023 15:11:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト