ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

Samyak Jain; Pradeep Yarlagadda; Shreyank Jyoti; Shyamgopal Karthik; Ramanathan Subramanian; Vineet Gandhi

ViNet：視聴覚顕著性予測のための視覚モダリティの限界を押し上げる

視聴覚顕著性予測のためのViNetアーキテクチャを提案します。 ViNetは、完全に畳み込みのエンコーダ-デコーダアーキテクチャです。エンコーダーは、行動認識のために訓練されたネットワークからの視覚的特徴を使用し、デコーダーは、複数の階層からの特徴を組み合わせて、トリリニア補間と3D畳み込みを介して顕著性マップを推測します。 ViNetの全体的なアーキテクチャは概念的に単純です。これは因果関係があり、リアルタイム（60 fps）で実行されます。 ViNetは入力としてオーディオを使用せず、9つの異なるデータセット（3つのビジュアルのみと6つのオーディオビジュアルデータセット）で最先端のオーディオビジュアル顕著性予測モデルよりも優れています。 ViNetは、AVEデータセットのCC、SIM、およびAUCメトリックでの人間のパフォーマンスも上回っており、私たちの知る限り、これを実現した最初のネットワークです。また、オーディオ機能をデコーダーに追加することにより、ViNetアーキテクチャのバリエーションについても説明します。驚いたことに、十分なトレーニングを行うと、ネットワークは入力オーディオにとらわれなくなり、入力に関係なく同じ出力を提供します。興味深いことに、視聴覚の顕著性を予測するための以前の最先端モデルtsiami2020stavisでも同様の動作が見られます。私たちの調査結果は、ディープラーニングベースの視聴覚顕著性予測に関する以前の研究とは対照的であり、より効果的な方法でオーディオを組み込んだ将来の探索への明確な道を示唆しています。コードと事前トレーニング済みモデルは、https：//github.com/samyak0210/ViNetで入手できます。

We propose the ViNet architecture for audio-visual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time (60 fps). ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). ViNet also surpasses human performance on the CC, SIM and AUC metrics for the AVE dataset, and to our knowledge, it is the first network to do so. We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our surprise, upon sufficient training, the network becomes agnostic to the input audio and provides the same output irrespective of the input. Interestingly, we also observe similar behaviour in the previous state-of-the-art models tsiami2020stavis for audio-visual saliency prediction. Our findings contrast with previous works on deep learning-based audio-visual saliency prediction, suggesting a clear avenue for future explorations incorporating audio in a more effective manner. The code and pre-trained models are available at https://github.com/samyak0210/ViNet.

updated: Thu Mar 18 2021 06:00:20 GMT+0000 (UTC)

published: Fri Dec 11 2020 07:28:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト