Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training

Wenliang Dai; Zihan Liu; Ziwei Ji; Dan Su; Pascale Fung

もっともらしいことは忠実ではないかもしれません: 視覚言語の事前訓練における対象の幻覚の調査

大規模な視覚言語事前訓練済み (VLP) モデルは、視覚情報に基づいてテキストを生成するときに、存在しない視覚オブジェクトを幻覚させる傾向があります。この論文では、3つの側面から対象幻覚の問題を体系的に研究しています。まず、最近の最先端の VLP モデルを調べて、それらがまだ頻繁に幻覚を起こしていることを示し、標準的な指標 (CIDEr など) でより良いスコアを達成するモデルはより不誠実である可能性があります。次に、リージョンベース、グリッドベース、パッチベースなど、VLP のさまざまなタイプの画像エンコードが幻覚にどのように影響するかを調査します。驚くべきことに、パッチベースの機能が最高のパフォーマンスを発揮し、パッチの解像度が小さいほどオブジェクトの幻覚が大幅に減少することがわかりました。第三に、さまざまな VLP の目的を分離し、トークンレベルの画像とテキストの配置と制御された生成が幻覚を減らすために重要であることを示します。それに基づいて、オブジェクトの幻覚をさらに軽減するために、ObjMLM という名前のシンプルで効果的な VLP 損失を提案します。結果は、2 つのベンチマーク (ドメイン内評価の COCO Caption とドメイン外評価の NoCaps) でテストした場合、オブジェクトの幻覚を最大 17.4% 削減することを示しています。

Large-scale vision-language pre-trained (VLP) models are prone to hallucinate non-existent visual objects when generating text based on visual information. In this paper, we systematically study the object hallucination problem from three aspects. First, we examine recent state-of-the-art VLP models, showing that they still hallucinate frequently, and models achieving better scores on standard metrics (e.g., CIDEr) could be more unfaithful. Second, we investigate how different types of image encoding in VLP influence hallucination, including region-based, grid-based, and patch-based. Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination. Third, we decouple various VLP objectives and demonstrate that token-level image-text alignment and controlled generation are crucial to reducing hallucination. Based on that, we propose a simple yet effective VLP loss named ObjMLM to further mitigate object hallucination. Results show that it reduces object hallucination by up to 17.4% when tested on two benchmarks (COCO Caption for in-domain and NoCaps for out-of-domain evaluation).

updated: Fri Feb 10 2023 04:11:26 GMT+0000 (UTC)

published: Fri Oct 14 2022 10:27:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト