Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos

Lianyang Ma; Yu Yao; Tao Liang; Tongliang Liu

ビデオのマルチモーダル感情分析のためのマルチスケール協調マルチモーダルトランスフォーマー

ビデオのマルチモーダル感情分析は、多くの実際のアプリケーションで重要なタスクであり、通常、視覚的、言語的、音響的行動を含むマルチモーダルストリームを統合する必要があります。マルチモーダル融合のロバスト性を向上させるために、既存の方法のいくつかは、異なるモダリティが互いに通信し、トランスを介してクロスモーダル相互作用をモーダルにすることを可能にします。ただし、これらのメソッドは、対話中にシングルスケール表現のみを使用しますが、異なるレベルのセマンティック情報を含むマルチスケール表現を利用することを忘れます。その結果、トランスフォーマーによって学習された表現は、特に整列されていないマルチモーダルデータに対してバイアスがかかる可能性があります。本論文では、マルチモーダル感情分析のためのマルチスケール協調マルチモーダルトランス（MCMulT）アーキテクチャを提案します。全体として、「マルチスケール」メカニズムは、きめ細かいクロスモーダル相互作用に使用される各モダリティのさまざまなレベルのセマンティック情報を活用することができます。一方、各モダリティは、ソースモダリティの複数のレベルの機能からクロスモーダル相互作用を統合することにより、その機能階層を学習します。このようにして、モダリティの各ペアは、協調的な方法でそれぞれ機能階層を段階的に構築します。経験的結果は、MCMulTモデルが、アラインされていないマルチモーダルシーケンスで既存のアプローチよりも優れているだけでなく、アラインされたマルチモーダルシーケンスでも強力なパフォーマンスを発揮することを示しています。

Multimodal sentiment analysis in videos is a key task in many real-world applications, which usually requires integrating multimodal streams including visual, verbal and acoustic behaviors. To improve the robustness of multimodal fusion, some of the existing methods let different modalities communicate with each other and modal the crossmodal interaction via transformers. However, these methods only use the single-scale representations during the interaction but forget to exploit multi-scale representations that contain different levels of semantic information. As a result, the representations learned by transformers could be biased especially for unaligned multimodal data. In this paper, we propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis. On the whole, the "multi-scale" mechanism is capable of exploiting the different levels of semantic information of each modality which are used for fine-grained crossmodal interactions. Meanwhile, each modality learns its feature hierarchies via integrating the crossmodal interactions from multiple level features of its source modality. In this way, each pair of modalities progressively builds feature hierarchies respectively in a cooperative manner. The empirical results illustrate that our MCMulT model not only outperforms existing approaches on unaligned multimodal sequences but also has strong performance on aligned multimodal sequences.

updated: Fri Jun 17 2022 02:58:20 GMT+0000 (UTC)

published: Thu Jun 16 2022 07:47:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト