Rethinking the constraints of multimodal fusion: case study in Weakly-Supervised Audio-Visual Video Parsing

Jianning Wu; Zhuqing Jiang; Shiping Wen; Aidong Men; Haiying Wang

マルチモーダルフュージョンの制約を再考する: 弱い教師付きオーディオビジュアルビデオ解析のケーススタディ

マルチモーダルタスクの場合、優れた特徴抽出ネットワークは可能な限り情報を抽出し、抽出された特徴の埋め込みと他のモーダルな特徴の埋め込みが優れた相互理解を持つことを保証する必要があります。後者は、多くの場合、機能融合において前者よりも重要です。したがって、最適な特徴抽出ネットワークコロケーションを選択することは、マルチモーダルタスクにおける非常に重要なサブ問題です。既存の研究のほとんどは、この問題を無視するか、エルゴディックなアプローチを採用しています。この問題は、この論文では最適化問題としてモデル化されています。数学における極値変換の一般的な実践を参照することにより、最適化問題を比較上界の問題に変換するための新しい方法を提案した。従来の方法に比べ、時間コストを削減できます。一方、マルチモーダル時系列問題において特徴類似性と特徴的セマンティック類似性が一致しないという共通の問題を目指して、コントラスト学習のアイデアを参照し、マルチモーダル時系列対照損失(MTSC)を提案します。上記の問題に基づいて、視聴覚ビデオの解析タスクでのアプローチの実現可能性を示しました。実質的な分析は、私たちの方法がさまざまなモーダル機能の融合を促進することを確認します。

For multimodal tasks, a good feature extraction network should extract information as much as possible and ensure that the extracted feature embedding and other modal feature embedding have an excellent mutual understanding. The latter is often more critical in feature fusion than the former. Therefore, selecting the optimal feature extraction network collocation is a very important subproblem in multimodal tasks. Most of the existing studies ignore this problem or adopt an ergodic approach. This problem is modeled as an optimization problem in this paper. A novel method is proposed to convert the optimization problem into an issue of comparative upper bounds by referring to the general practice of extreme value conversion in mathematics. Compared with the traditional method, it reduces the time cost. Meanwhile, aiming at the common problem that the feature similarity and the feature semantic similarity are not aligned in the multimodal time-series problem, we refer to the idea of contrast learning and propose a multimodal time-series contrastive loss(MTSC). Based on the above issues, We demonstrated the feasibility of our approach in the audio-visual video parsing task. Substantial analyses verify that our methods promote the fusion of different modal features.

updated: Sun May 30 2021 05:13:30 GMT+0000 (UTC)

published: Sun May 30 2021 05:13:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト