Semantic scene descriptions as an objective of human vision

Adrien Doerig; Tim C Kietzmann; Emily Allen; Yihan Wu; Thomas Naselaris; Kendrick Kay; Ian Charest

人間の視覚の目的としての意味論的シーン記述

視覚的なシーンの意味を解釈するには、その構成オブジェクトの識別だけでなく、オブジェクトの相互関係の豊富なセマンティックキャラクタリゼーションも必要です。ここでは、複雑な自然シーンによって誘発される人間の脳反応の大規模な 7T fMRI データセットに最新の計算技術を適用することにより、視覚意味変換の根底にある神経メカニズムを研究します。人間が生成したシーン記述に言語ディープラーニングモデルを適用して得られたセマンティックエンベディングを使用して、セマンティックシーン記述をエンコードする脳領域の広く分散したネットワークを識別します。重要なことに、これらのセマンティック埋め込みは、従来のオブジェクトカテゴリラベルよりも、これらの領域でのアクティビティをより適切に説明します。さらに、参加者がセマンティックタスクに積極的に関与していないという事実にもかかわらず、それらは活動の効果的な予測因子であり、視覚セマンティック変換がデフォルトの視覚モードであることを示唆しています。この見解を支持して、シーンのキャプションの非常に正確な再構成が脳活動のパターンから直接線形にデコードできることを示します。最後に、セマンティック埋め込みでトレーニングされた再帰型畳み込みニューラルネットワークは、脳活動の予測においてセマンティック埋め込みよりもさらに優れており、脳の視覚意味変換の機構モデルを提供します。まとめると、これらの実験結果と計算結果は、視覚入力を豊富な意味的シーン記述に変換することが視覚系の中心的な目的である可能性があり、この新しい目的に努力を集中させることが、人間の脳における視覚情報処理のモデルの改善につながる可能性があることを示唆しています。

Interpreting the meaning of a visual scene requires not only identification of its constituent objects, but also a rich semantic characterization of object interrelations. Here, we study the neural mechanisms underlying visuo-semantic transformations by applying modern computational techniques to a large-scale 7T fMRI dataset of human brain responses elicited by complex natural scenes. Using semantic embeddings obtained by applying linguistic deep learning models to human-generated scene descriptions, we identify a widely distributed network of brain regions that encode semantic scene descriptions. Importantly, these semantic embeddings better explain activity in these regions than traditional object category labels. In addition, they are effective predictors of activity despite the fact that the participants did not actively engage in a semantic task, suggesting that visuo-semantic transformations are a default mode of vision. In support of this view, we then show that highly accurate reconstructions of scene captions can be directly linearly decoded from patterns of brain activity. Finally, a recurrent convolutional neural network trained on semantic embeddings further outperforms semantic embeddings in predicting brain activity, providing a mechanistic model of the brain's visuo-semantic transformations. Together, these experimental and computational results suggest that transforming visual input into rich semantic scene descriptions may be a central objective of the visual system, and that focusing efforts on this new objective may lead to improved models of visual information processing in the human brain.

updated: Fri Sep 23 2022 17:34:33 GMT+0000 (UTC)

published: Fri Sep 23 2022 17:34:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト