DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents

Fuxiao Liu; Hao Tan; Chris Tensmeyer

DocumentCLIP: リフロー文書内の図と本文のリンク

ビジョン言語の事前トレーニングモデルは、画像とテキストの位置関係を理解することで、マルチメディアアプリケーションのサポートに大きな成功を収めています。既存の視覚言語事前トレーニングモデルは、主に 1 つのテキストに関連付けられた 1 つの画像を理解することに焦点を当てていますが、多くの場合、複数の画像を含む複数の文で構成される文書内レベルでの位置合わせは無視されます。この研究では、文書内の画像と長いテキストの間の相互作用を理解するために視覚言語の事前学習モデルを強制する顕著性を認識した対照学習フレームワークである DocumentCLIP を提案します。私たちのモデルは、ニュース記事、雑誌、製品説明など、言語的および視覚的に豊かなコンテンツを含む現実世界のマルチモーダルな文書の理解に役立ちます。私たちの知る限り、私たちは対照学習によってマルチモーダルな文書内リンクを探索した最初の研究者です。さらに、事前トレーニングのために、さまざまなトピックや構造を提供する大規模な Wikipedia データセットを収集します。実験では、DocumentCLIP が監視設定で最先端のベースラインを上回るパフォーマンスを発揮するだけでなく、人間による評価後の実際の環境で最高のゼロショットパフォーマンスを達成することも示しています。私たちのコードは https://github.com/FuxiaoLiu/DocumentCLIP で入手できます。

Vision-language pretraining models have achieved great success in supporting multimedia applications by understanding the alignments between images and text. While existing vision-language pretraining models primarily focus on understanding single image associated with a single piece of text, they often ignore the alignment at the intra-document level, consisting of multiple sentences with multiple images. In this work, we propose DocumentCLIP, a salience-aware contrastive learning framework to enforce vision-language pretraining models to comprehend the interaction between images and longer text within documents. Our model is beneficial for the real-world multimodal document understanding like news article, magazines, product descriptions, which contain linguistically and visually richer content. To the best of our knowledge, we are the first to explore multimodal intra-document links by contrastive learning. In addition, we collect a large Wikipedia dataset for pretraining, which provides various topics and structures. Experiments show DocumentCLIP not only outperforms the state-of-the-art baselines in the supervised setting, but also achieves the best zero-shot performance in the wild after human evaluation. Our code is available at https://github.com/FuxiaoLiu/DocumentCLIP.

updated: Fri Sep 01 2023 18:41:58 GMT+0000 (UTC)

published: Fri Jun 09 2023 23:51:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト