Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis

Licai Sun; Zheng Lian; Bin Liu; Jianhua Tao

ロバストなマルチモーダル感情分析のためのデュアルレベル機能復元を備えた効率的なマルチモーダルトランスフォーマー

ユーザーが作成したオンライン動画の急増に伴い、マルチモーダル感情分析 (MSA) が最近ますます注目を集めています。大幅な進歩にもかかわらず、堅牢な MSA に向けた途上にはまだ 2 つの大きな課題があります。 2) 現実的な設定で通常発生する、ランダムなモダリティ機能の欠落に対する脆弱性。このホワイトペーパーでは、デュアルレベル機能復元を備えた効率的なマルチモーダルトランスフォーマー（EMT-DLFR）と名付けられた、それらに対処するための一般的で統一されたフレームワークを提案します。具体的には、EMT は、各モダリティからの発話レベルの表現をグローバルなマルチモーダルコンテキストとして使用して、ローカルのユニモーダル機能と相互作用し、相互に促進します。これは、以前のローカル-ローカルクロスモーダルインタラクションメソッドの 2 次スケーリングコストを回避するだけでなく、パフォーマンスの向上にもつながります。一方では、不完全なモダリティ設定でモデルの堅牢性を向上させるために、DLFR は低レベルの特徴再構築を実行して、モデルが不完全なデータからセマンティック情報を学習するよう暗黙的に促します。一方、完全なデータと不完全なデータを 1 つのサンプルの 2 つの異なるビューとして革新的に見なし、シャム表現学習を利用してそれらの高レベルの表現を明示的に引き付けます。 3 つの一般的なデータセットでの包括的な実験は、完全なモダリティ設定と不完全なモダリティ設定の両方で、この方法が優れたパフォーマンスを達成することを示しています。

With the proliferation of user-generated online videos, Multimodal Sentiment Analysis (MSA) has attracted increasing attention recently. Despite significant progress, there are still two major challenges on the way towards robust MSA: 1) inefficiency when modeling cross-modal interactions in unaligned multimodal data; and 2) vulnerability to random modality feature missing which typically occurs in realistic settings. In this paper, we propose a generic and unified framework to address them, named Efficient Multimodal Transformer with Dual-Level Feature Restoration (EMT-DLFR). Concretely, EMT employs utterance-level representations from each modality as the global multimodal context to interact with local unimodal features and mutually promote each other. It not only avoids the quadratic scaling cost of previous local-local cross-modal interaction methods but also leads to better performance. To improve model robustness in the incomplete modality setting, on the one hand, DLFR performs low-level feature reconstruction to implicitly encourage the model to learn semantic information from incomplete data. On the other hand, it innovatively regards complete and incomplete data as two different views of one sample and utilizes siamese representation learning to explicitly attract their high-level representations. Comprehensive experiments on three popular datasets demonstrate that our method achieves superior performance in both complete and incomplete modality settings.

updated: Mon May 22 2023 02:27:07 GMT+0000 (UTC)

published: Tue Aug 16 2022 08:02:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト