Evaluating and Improving Factuality in Multimodal Abstractive Summarization

David Wan; Mohit Bansal

マルチモーダル抽象要約における事実性の評価と改善

抽象的文書要約の事実性を評価するための現在の測定基準は、人間の判断と高い相関関係を達成していますが、それらは視覚モダリティを説明していないため、視覚と言語の要約には適切ではありません。 CLIPScore と BERTScore の単純な重み付けの組み合わせである CLIPBERTScore を提案し、それぞれ画像要約と文書要約の間の堅牢性と強力な事実検出パフォーマンスを活用します。次に、マルチモーダル事実性メトリクスの品質を評価するためのメタ評価ベンチマークがないため、ドキュメントと画像に関する事実性の人間の判断を収集します。ゼロショット設定での 2 つのメトリックのこの単純な組み合わせが、ドキュメントの要約に関する既存の事実性メトリックよりも高い相関関係を達成し、既存のマルチモーダル要約メトリックよりも優れており、タスク用に特別に微調整された強力なマルチモーダル事実性メトリックと競合的に機能することを示します。私たちの徹底的な分析は、4 つの事実性指標評価ベンチマークにおける CLIPBERTScore とそのコンポーネントの堅牢性と高い相関関係を示しています。最後に、CLIPBERTScore メトリクスの 2 つの実用的なダウンストリームアプリケーションを示します。トレーニング中に焦点を当てる重要な画像を選択するため、および強化学習の報酬として、自動および人間による評価に関してマルチモーダルサマリー生成の事実性を向上させるためです。私たちのデータとコードは、https://github.com/meetdavidwan/faithful-multimodal-summ で公開されています。

Current metrics for evaluating factuality for abstractive document summarization have achieved high correlations with human judgment, but they do not account for the vision modality and thus are not adequate for vision-and-language summarization. We propose CLIPBERTScore, a simple weighted combination of CLIPScore and BERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary, respectively. Next, due to the lack of meta-evaluation benchmarks to evaluate the quality of multimodal factuality metrics, we collect human judgments of factuality with respect to documents and images. We show that this simple combination of two metrics in the zero-shot setting achieves higher correlations than existing factuality metrics for document summarization, outperforms an existing multimodal summarization metric, and performs competitively with strong multimodal factuality metrics specifically fine-tuned for the task. Our thorough analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks. Finally, we demonstrate two practical downstream applications of our CLIPBERTScore metric: for selecting important images to focus on during training, and as a reward for reinforcement learning to improve factuality of multimodal summary generation w.r.t automatic and human evaluation. Our data and code are publicly available at https://github.com/meetdavidwan/faithful-multimodal-summ

updated: Fri Nov 04 2022 16:50:40 GMT+0000 (UTC)

published: Fri Nov 04 2022 16:50:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト