TraVLR: Now You See It, Now You Don't! Evaluating Cross-Modal Transfer of Visio-Linguistic Reasoning

Keng Ji Chow; Samson Tan; Min-Yen Kan

TraVLR：今、あなたはそれを見る、今あなたは見ない！ Visioのクロスモーダル転送の評価-言語的推論

多数のVisio-Linguistic（V + L）表現学習方法が開発されていますが、既存のデータセットは、統一された空間で視覚的および言語的概念を表す範囲を評価していません。クロスリンガル転送と心理言語学の文献に触発されて、V + Lモデルの新しい評価設定であるゼロショットクロスモーダル転送を提案します。また、既存のV + Lベンチマークは、データセット全体のグローバル精度スコアを報告することが多く、モデルが失敗して成功する特定の推論タスクを特定することが困難になります。この問題に対処し、クロスモーダル転送の評価を可能にするために、4つのV + L推論タスクで構成される合成データセットであるTraVLRを紹介します。各例は、関連情報を失うことなく、トレーニング/テスト中にいずれかのモダリティを削除できるように、シーンをバイモーダルにエンコードします。 TraVLRのトレーニングとテストの分布も、タスク関連の次元に沿って制約され、分布外の一般化の評価を可能にします。 4つの最先端のV + Lモデルを評価し、同じモダリティからのテストセットで良好に機能するものの、すべてのモデルがクロスモーダルに転送できず、1つのモダリティの追加または削除に対応する成功が限られていることを発見しました。。以前の研究と一致して、これらのモデルは単純な空間関係を学習するために大量のデータを必要とすることもわかりました。研究コミュニティのオープンチャレンジとしてTraVLRをリリースします。

Numerous visio-linguistic (V+L) representation learning methods have been developed, yet existing datasets do not evaluate the extent to which they represent visual and linguistic concepts in a unified space. Inspired by the crosslingual transfer and psycholinguistics literature, we propose a novel evaluation setting for V+L models: zero-shot cross-modal transfer. Existing V+L benchmarks also often report global accuracy scores on the entire dataset, rendering it difficult to pinpoint the specific reasoning tasks that models fail and succeed at. To address this issue and enable the evaluation of cross-modal transfer, we present TraVLR, a synthetic dataset comprising four V+L reasoning tasks. Each example encodes the scene bimodally such that either modality can be dropped during training/testing with no loss of relevant information. TraVLR's training and testing distributions are also constrained along task-relevant dimensions, enabling the evaluation of out-of-distribution generalisation. We evaluate four state-of-the-art V+L models and find that although they perform well on the test set from the same modality, all models fail to transfer cross-modally and have limited success accommodating the addition or deletion of one modality. In alignment with prior work, we also find these models to require large amounts of data to learn simple spatial relationships. We release TraVLR as an open challenge for the research community.

updated: Sun Nov 21 2021 07:22:44 GMT+0000 (UTC)

published: Sun Nov 21 2021 07:22:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト