Brain encoding models based on multimodal transformers can transfer across language and vision

Jerry Tang; Meng Du; Vy A. Vo; Vasudev Lal; Alexander G. Huth

マルチモーダルトランスフォーマーに基づく脳エンコードモデルは、言語と視覚を超えて転送可能

エンコーディングモデルは、人間の脳が言語と視覚で概念をどのように表現するかを評価するために使用されてきました。言語と視覚は同様の概念表現に依存していますが、現在のエンコードモデルは通常、各モダリティに対する脳の反応を個別にトレーニングおよびテストします。マルチモーダル事前トレーニングの最近の進歩により、言語と視覚における概念の整合性のある表現を抽出できるトランスフォーマーが生み出されました。この研究では、マルチモーダルトランスフォーマーからの表現を使用して、fMRI 応答をストーリーや映画に転送できるエンコードモデルをトレーニングしました。私たちは、あるモダリティに対する脳の反応について訓練された符号化モデルが、特に概念的な意味を表す皮質領域において、他のモダリティに対する脳の反応を首尾よく予測できることを発見しました。これらのエンコーディングモデルをさらに分析すると、言語と視覚における概念表現の基礎となる共通の意味論的側面が明らかになりました。マルチモーダルトランスフォーマーとユニモーダルトランスフォーマーの表現を使用してトレーニングされたエンコードモデルを比較すると、マルチモーダルトランスフォーマーは言語と視覚における概念のより調整された表現を学習することがわかりました。私たちの結果は、マルチモーダル変換器がマルチモーダル処理の脳の能力についてどのように洞察を提供できるかを示しています。

Encoding models have been used to assess how the human brain represents concepts in language and vision. While language and vision rely on similar concept representations, current encoding models are typically trained and tested on brain responses to each modality in isolation. Recent advances in multimodal pretraining have produced transformers that can extract aligned representations of concepts in language and vision. In this work, we used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies. We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality, particularly in cortical regions that represent conceptual meaning. Further analysis of these encoding models revealed shared semantic dimensions that underlie concept representations in language and vision. Comparing encoding models trained using representations from multimodal and unimodal transformers, we found that multimodal transformers learn more aligned representations of concepts in language and vision. Our results demonstrate how multimodal transformers can provide insights into the brain's capacity for multimodal processing.

updated: Sat May 20 2023 17:38:44 GMT+0000 (UTC)

published: Sat May 20 2023 17:38:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト