When Pigs Fly: Contextual Reasoning in Synthetic and Natural Scenes

Philipp Bomatter; Mengmi Zhang; Dimitar Karev; Spandan Madan; Claire Tseng; Gabriel Kreiman

豚が飛ぶとき：合成および自然シーンにおける文脈的推論

コンテキストは、人間とマシンビジョンの両方にとって基本的に重要です。たとえば、空中の物体は豚よりも飛行機である可能性が高くなります。コンテキストの豊富な概念には、物理規則、統計的共起、相対的なオブジェクトサイズなどのいくつかの側面が組み込まれています。以前の作業は、シーンのコンテキストを研究するためにWebからクラウドソーシングされたコンテキスト外の写真に焦点を当てていましたが、コンテキスト違反の性質と程度を制御することは困難な作業でした。ここでは、シーンコンテキストをきめ細かく制御できる多様な合成Out-of-Context Dataset（OCD）を紹介します。 3Dシミュレーションエンジンを活用することで、仮想家庭環境の36のオブジェクトカテゴリにわたって、重力、オブジェクトの共起、および相対的なサイズを体系的に制御します。 OCDを使用して、人間と機械の両方の視覚に対する文脈的手がかりの影響についての洞察を得るために、一連の実験を実施しました。精神物理学の実験を行って、文脈外認識の人間のベンチマークを確立し、それを最先端のコンピュータービジョンモデルと比較して、2つの間のギャップを定量化しました。マルチヘッドアテンションを介してオブジェクトとコンテキスト情報を融合する、コンテキストアウェア認識トランスモデルを提案します。私たちのモデルは、コンテキスト推論に役立つ情報をキャプチャし、OCDやその他のコンテキスト外データセット全体のベースラインモデルと比較して、コンテキスト外条件での人間レベルのパフォーマンスと優れた堅牢性を実現します。すべてのソースコードとデータは、https：//github.com/kreimanlab/WhenPigsFlyContextで公開されています。

Context is of fundamental importance to both human and machine vision; e.g., an object in the air is more likely to be an airplane than a pig. The rich notion of context incorporates several aspects including physics rules, statistical co-occurrences, and relative object sizes, among others. While previous work has focused on crowd-sourced out-of-context photographs from the web to study scene context, controlling the nature and extent of contextual violations has been a daunting task. Here we introduce a diverse, synthetic Out-of-Context Dataset (OCD) with fine-grained control over scene context. By leveraging a 3D simulation engine, we systematically control the gravity, object co-occurrences and relative sizes across 36 object categories in a virtual household environment. We conducted a series of experiments to gain insights into the impact of contextual cues on both human and machine vision using OCD. We conducted psychophysics experiments to establish a human benchmark for out-of-context recognition, and then compared it with state-of-the-art computer vision models to quantify the gap between the two. We propose a context-aware recognition transformer model, fusing object and contextual information via multi-head attention. Our model captures useful information for contextual reasoning, enabling human-level performance and better robustness in out-of-context conditions compared to baseline models across OCD and other out-of-context datasets. All source code and data are publicly available at https://github.com/kreimanlab/WhenPigsFlyContext

updated: Wed Aug 11 2021 05:43:42 GMT+0000 (UTC)

published: Tue Apr 06 2021 01:05:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト