Visually Grounded Commonsense Knowledge Acquisition

Yuan Yao; Tianyu Yu; Ao Zhang; Mengdi Li; Ruobing Xie; Cornelius Weber; Zhiyuan Liu; Hai-Tao Zheng; Stefan Wermter; Tat-Seng Chua; Maosong Sun

視覚に基づく常識知識の習得

大規模な常識知識ベースは、常識知識の自動抽出 (CKE) が基本的で困難な問題である幅広い AI アプリケーションを強化します。テキストからの CKE は、テキストに固有のまばらさと常識のレポートバイアスに悩まされていることで知られています。一方、視覚には、実世界のエンティティ (人、缶、ボトル) に関する豊富な常識知識が含まれており、根拠のある常識知識を獲得するための有望な情報源として役立ちます。この作業では、CKE を遠方教師ありマルチインスタンス学習問題として定式化する CLEVER を提示します。モデルは、画像インスタンスに対する人間の注釈なしで、エンティティペアに関する画像のバッグから常識的な関係を要約することを学習します。この問題に対処するために、CLEVER はビジョン言語の事前トレーニングモデルを活用してバッグ内の各画像を深く理解し、バッグから有益なインスタンスを選択して、新しい対照的注意メカニズムを介して常識的なエンティティ関係を要約します。保留および人間による評価の包括的な実験結果は、CLEVER が有望な品質で常識知識を抽出できることを示しており、事前にトレーニングされた言語モデルベースの方法よりも 3.9 AUC および 6.4 mAUC ポイント優れています。予測された常識スコアは、スピアマン係数 0.78 で人間の判断と強い相関を示しています。さらに、抽出された常識は、合理的な解釈可能性を備えたイメージに基づいて作成することもできます。データとコードは https://github.com/thunlp/CLEVER で入手できます。

Large-scale commonsense knowledge bases empower a broad range of AI applications, where the automatic extraction of commonsense knowledge (CKE) is a fundamental and challenging problem. CKE from text is known for suffering from the inherent sparsity and reporting bias of commonsense in text. Visual perception, on the other hand, contains rich commonsense knowledge about real-world entities, e.g., (person, can_hold, bottle), which can serve as promising sources for acquiring grounded commonsense knowledge. In this work, we present CLEVER, which formulates CKE as a distantly supervised multi-instance learning problem, where models learn to summarize commonsense relations from a bag of images about an entity pair without any human annotation on image instances. To address the problem, CLEVER leverages vision-language pre-training models for deep understanding of each image in the bag, and selects informative instances from the bag to summarize commonsense entity relations via a novel contrastive attention mechanism. Comprehensive experimental results in held-out and human evaluation show that CLEVER can extract commonsense knowledge in promising quality, outperforming pre-trained language model-based methods by 3.9 AUC and 6.4 mAUC points. The predicted commonsense scores show strong correlation with human judgment with a 0.78 Spearman coefficient. Moreover, the extracted commonsense can also be grounded into images with reasonable interpretability. The data and codes can be obtained at https://github.com/thunlp/CLEVER.

updated: Sat Mar 25 2023 07:16:48 GMT+0000 (UTC)

published: Tue Nov 22 2022 07:00:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト