Training Multimedia Event Extraction With Generated Images and Captions

Zilin Du; Yunxin Li; Xu Guo; Yidan Sun; Boyang Li

生成された画像とキャプションを使用したマルチメディアイベント抽出のトレーニング

現代のニュース報道ではマルチメディアコンテンツがますます取り上げられており、マルチメディアイベントの抽出に関する研究が活発化しています。ただし、このタスクには注釈付きのマルチモーダルトレーニングデータが不足しており、人工的に生成されたトレーニングデータは実世界のデータからの分布のシフトを受けます。本稿では、人工的に生成されたマルチモーダルトレーニングデータをうまく活用し、最先端のパフォーマンスを実現するクロスモダリティ拡張マルチメディアイベント学習（CAMEL）を提案します。まず、テキストと画像の 2 つのラベル付きユニモーダルデータセットから開始し、安定拡散などの既製の画像ジェネレーターや BLIP などの画像キャプションを使用して、欠落しているモダリティを生成します。その後、結果として得られるマルチモーダルデータセットでネットワークをトレーニングします。ドメイン全体で効果的な堅牢な機能を学習するために、反復的で段階的なトレーニング戦略を考案します。実質的な実験により、CAMEL が M2E2 ベンチマークで最先端 (SOTA) のベースラインを上回っていることが示されています。特にマルチメディアイベントでは、以前の SOTA をイベント言及識別で 4.2% F1、引数識別で 9.8% F1 上回りました。これは、CAMEL が 2 つのモダリティから相乗的な表現を学習していることを示しています。私たちの研究は、構造化予測において合成トレーニングデータの力を解き放つレシピを実証しています。

Contemporary news reporting increasingly features multimedia content, motivating research on multimedia event extraction. However, the task lacks annotated multimodal training data and artificially generated training data suffer from distribution shift from real-world data. In this paper, we propose Cross-modality Augmented Multimedia Event Learning (CAMEL), which successfully utilizes artificially generated multimodal training data and achieves state-of-the-art performance. We start with two labeled unimodal datasets in text and image respectively, and generate the missing modality using off-the-shelf image generators like Stable Diffusion and image captioners like BLIP. After that, we train the network on the resultant multimodal datasets. In order to learn robust features that are effective across domains, we devise an iterative and gradual training strategy. Substantial experiments show that CAMEL surpasses state-of-the-art (SOTA) baselines on the M2E2 benchmark. On multimedia events in particular, we outperform the prior SOTA by 4.2% F1 on event mention identification and by 9.8% F1 on argument identification, which indicates that CAMEL learns synergistic representations from the two modalities. Our work demonstrates a recipe to unleash the power of synthetic training data in structured prediction.

updated: Fri Aug 11 2023 04:55:40 GMT+0000 (UTC)

published: Thu Jun 15 2023 09:01:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト