Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

Da Yin; Liunian Harold Li; Ziniu Hu; Nanyun Peng; Kai-Wei Chang

ビジョンを広げる：地理的に多様な視覚的常識推論

常識は、すべての人が共有する知識として定義されています。ただし、特定の種類の常識知識は文化や地理的な場所と相関関係があり、ローカルでのみ共有されます。たとえば、結婚式のシナリオは、歴史的および宗教的要因の影響を受ける習慣が異なるため、地域によって異なります。しかし、そのような地域の特徴は、以前の研究では一般的に省略されています。この論文では、地理的多様な視覚的常識推論データセット（GD-VCR）を構築して、文化的および地理的位置固有の常識を理解する視覚および言語モデルの能力をテストします。特に、2つの最先端の視覚と言語モデル、VCRでトレーニングされたVisualBERTとViLBERTを研究します。これは、主に西部地域からの画像を使用した標準的なマルチモーダル常識ベンチマークです。次に、トレーニングされたモデルがGD-VCRの質問に答えるためにどの程度一般化できるかを評価します。東アジア、南アジア、アフリカを含む非西部地域の両方のモデルのパフォーマンスは、西部地域のパフォーマンスよりも大幅に低いことがわかります。パフォーマンスの格差の背後にある理由を分析し、次のようなQAペアでパフォーマンスのギャップが大きいことを発見しました。1）結婚式、宗教活動、お祭りなどの文化関連のシナリオに関係している。 2）低次の知覚と認識ではなく、高レベルの地理的に多様な常識的な推論が必要です。データセットとコードはhttps://github.com/WadeYin9712/GD-VCRでリリースされています。

Commonsense is defined as the knowledge that is shared by everyone. However, certain types of commonsense knowledge are correlated with culture and geographic locations and they are only shared locally. For example, the scenarios of wedding ceremonies vary across regions due to different customs influenced by historical and religious factors. Such regional characteristics, however, are generally omitted in prior work. In this paper, we construct a Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test vision-and-language models' ability to understand cultural and geo-location-specific commonsense. In particular, we study two state-of-the-art Vision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standard multimodal commonsense benchmark with images primarily from Western regions. We then evaluate how well the trained models can generalize to answering the questions in GD-VCR. We find that the performance of both models for non-Western regions including East Asia, South Asia, and Africa is significantly lower than that for Western region. We analyze the reasons behind the performance disparity and find that the performance gap is larger on QA pairs that: 1) are concerned with culture-related scenarios, e.g., weddings, religious activities, and festivals; 2) require high-level geo-diverse commonsense reasoning rather than low-order perception and recognition. Dataset and code are released at https://github.com/WadeYin9712/GD-VCR.

updated: Tue Sep 14 2021 17:52:55 GMT+0000 (UTC)

published: Tue Sep 14 2021 17:52:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト