Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic Features

Changde Du; Kaicheng Fu; Jinpeng Li; Huiguang He

脳視覚言語特徴のマルチモーダル学習による視覚神経表現の解読

人間の視覚神経表現を解読することは、視覚処理メカニズムを明らかにし、脳のようなインテリジェントマシンを開発する上で、科学的に大きな意義を持つ挑戦的なタスクです。ほとんどの既存の方法は、トレーニング用の対応するニューラルデータを持たない新しいカテゴリに一般化することは困難です。 2 つの主な理由は、1) ニューラルデータの基礎となるマルチモーダルセマンティックナレッジの活用不足、および 2) 対になった (刺激応答) トレーニングデータの数が少ないことです。これらの制限を克服するために、この論文では、脳視覚言語機能のマルチモーダル学習を使用する BraVL と呼ばれる一般的なニューラルデコーディング手法を紹介します。マルチモーダルな深層生成モデルを介して、脳、視覚、および言語機能間の関係をモデル化することに重点を置いています。具体的には、専門家の混合物の定式化を活用して、3つのモダリティすべての一貫した共同生成を可能にする潜在的なコードを推測します。脳活動データが限られている場合に、より一貫性のある共同表現を学習し、データ効率を向上させるために、モダリティ内およびモダリティ間の相互情報量最大化の正則化用語を利用します。特に、BraVL モデルは、さまざまな半教師付きシナリオでトレーニングして、追加のカテゴリから取得した視覚的およびテキスト的特徴を組み込むことができます。最後に、3 つのトライモーダルマッチングデータセットを構築し、大規模な実験により、いくつかの興味深い結論と認知的洞察が得られました。 2) 視覚的特徴と言語的特徴の組み合わせを使用した解読モデルは、それらのいずれかを単独で使用したものよりもはるかに優れたパフォーマンスを発揮します。 3) 視覚刺激のセマンティクスを表現するために、視覚には言語的影響が伴う場合があります。コードとデータ: https://github.com/ChangdeDu/BraVL.

Decoding human visual neural representations is a challenging task with great scientific significance in revealing vision-processing mechanisms and developing brain-like intelligent machines. Most existing methods are difficult to generalize to novel categories that have no corresponding neural data for training. The two main reasons are 1) the under-exploitation of the multimodal semantic knowledge underlying the neural data and 2) the small number of paired (stimuli-responses) training data. To overcome these limitations, this paper presents a generic neural decoding method called BraVL that uses multimodal learning of brain-visual-linguistic features. We focus on modeling the relationships between brain, visual and linguistic features via multimodal deep generative models. Specifically, we leverage the mixture-of-product-of-experts formulation to infer a latent code that enables a coherent joint generation of all three modalities. To learn a more consistent joint representation and improve the data efficiency in the case of limited brain activity data, we exploit both intra- and inter-modality mutual information maximization regularization terms. In particular, our BraVL model can be trained under various semi-supervised scenarios to incorporate the visual and textual features obtained from the extra categories. Finally, we construct three trimodal matching datasets, and the extensive experiments lead to some interesting conclusions and cognitive insights: 1) decoding novel visual categories from human brain activity is practically possible with good accuracy; 2) decoding models using the combination of visual and linguistic features perform much better than those using either of them alone; 3) visual perception may be accompanied by linguistic influences to represent the semantics of visual stimuli. Code and data: https://github.com/ChangdeDu/BraVL.

updated: Thu Oct 13 2022 05:49:33 GMT+0000 (UTC)

published: Thu Oct 13 2022 05:49:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト