HCMS: Hierarchical and Conditional Modality Selection for Efficient Video Recognition

Zejia Weng; Zuxuan Wu; Hengduo Li; Jingjing Chen; Yu-Gang Jiang

HCMS: 効率的なビデオ認識のための階層的および条件付きモダリティ選択

ビデオは本質的にマルチモーダルです。従来のビデオ認識パイプラインは通常、マルチモーダル機能を融合してパフォーマンスを向上させます。ただし、これは計算コストが高いだけでなく、ビデオごとに異なるモダリティによる予測が必要であるという事実を無視しています。このペーパーでは、効率的なビデオ認識のためのシンプルで効率的なマルチモーダル学習フレームワークである、階層的および条件付きモダリティ選択 (HCMS) を紹介します。 HCMS は、デフォルトで低コストのモダリティ、つまりオーディオの手がかりで動作し、入力ごとに、外観や動きの手がかりなど、計算コストの高いモダリティを使用するかどうかをオンザフライで動的に決定します。これは、階層的に編成された 3 つの LSTM のコラボレーションによって実現されます。特に、高コストのモダリティで動作する LSTM にはゲーティングモジュールが含まれています。ゲーティングモジュールは、低レベルの機能と履歴情報を入力として取り、対応するモダリティをアクティブにするかどうかを適応的に決定します。それ以外の場合は、単に履歴情報を再利用します。 FCVID と ActivityNet という 2 つの大規模なビデオベンチマークで大規模な実験を行い、結果は、提案されたアプローチがマルチモーダル情報を効果的に探索して分類パフォーマンスを向上させ、必要な計算量を大幅に削減できることを示しています。

Videos are multimodal in nature. Conventional video recognition pipelines typically fuse multimodal features for improved performance. However, this is not only computationally expensive but also neglects the fact that different videos rely on different modalities for predictions. This paper introduces Hierarchical and Conditional Modality Selection (HCMS), a simple yet efficient multimodal learning framework for efficient video recognition. HCMS operates on a low-cost modality, i.e., audio clues, by default, and dynamically decides on-the-fly whether to use computationally-expensive modalities, including appearance and motion clues, on a per-input basis. This is achieved by the collaboration of three LSTMs that are organized in a hierarchical manner. In particular, LSTMs that operate on high-cost modalities contain a gating module, which takes as inputs lower-level features and historical information to adaptively determine whether to activate its corresponding modality; otherwise it simply reuses historical information. We conduct extensive experiments on two large-scale video benchmarks, FCVID and ActivityNet, and the results demonstrate the proposed approach can effectively explore multimodal information for improved classification performance while requiring much less computation.

updated: Tue Dec 06 2022 04:13:50 GMT+0000 (UTC)

published: Tue Apr 20 2021 04:47:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト