TFusion: Transformer based N-to-One Multimodal Fusion Block

Zecheng Liu; Jia Wei; Rui Li

TFusion: Transformer ベースの N 対 1 のマルチモーダルフュージョンブロック

人は、視覚、聴覚、嗅覚、触覚など、さまざまな感覚で世界を認識しています。複数のモダリティからの情報を処理および融合することで、人工知能は私たちの周りの世界をより簡単に理解できるようになります。しかし、モダリティが不足していると、さまざまな状況で使用可能なモダリティの数が異なり、N 対 1 の融合の問題が発生します。この問題を解決するために、TFusion と呼ばれる変圧器ベースのフュージョンブロックを提案します。事前設定された定式化または畳み込みベースの方法とは異なり、提案されたブロックは、欠落しているモダリティを合成またはゼロパディングすることなく、利用可能なモダリティを融合することを自動的に学習します。具体的には、上流の処理モデルから抽出された特徴表現がトークンとして投影され、トランスフォーマー層に供給されて、潜在的なマルチモーダル相関が生成されます。次に、特定のモダリティへの依存を減らすために、モーダルアテンションメカニズムを導入して共有表現を構築します。これは、ダウンストリームの決定モデルによって適用できます。提案された TFusion ブロックは、既存のマルチモーダル解析ネットワークに簡単に統合できます。この作業では、TFusion をさまざまなバックボーンネットワークに適用して、マルチモーダルな人間活動の認識と脳腫瘍のセグメンテーションタスクを行います。広範な実験結果は、TFusion ブロックが競合する融合戦略よりも優れたパフォーマンスを達成することを示しています。

People perceive the world with different senses, such as sight, hearing, smell, and touch. Processing and fusing information from multiple modalities enables Artificial Intelligence to understand the world around us more easily. However, when there are missing modalities, the number of available modalities is different in diverse situations, which leads to an N-to-One fusion problem. To solve this problem, we propose a transformer based fusion block called TFusion. Different from preset formulations or convolution based methods, the proposed block automatically learns to fuse available modalities without synthesizing or zero-padding missing ones. Specifically, the feature representations extracted from upstream processing model are projected as tokens and fed into transformer layers to generate latent multimodal correlations. Then, to reduce the dependence on particular modalities, a modal attention mechanism is introduced to build a shared representation, which can be applied by the downstream decision model. The proposed TFusion block can be easily integrated into existing multimodal analysis networks. In this work, we apply TFusion to different backbone networks for multimodal human activity recognition and brain tumor segmentation tasks. Extensive experimental results show that the TFusion block achieves better performance than the competing fusion strategies.

updated: Fri Aug 26 2022 16:42:14 GMT+0000 (UTC)

published: Fri Aug 26 2022 16:42:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト