VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification

Souhail Bakkali; Zuheng Ming; Mickael Coustaty; Marçal Rusiñol; Oriol Ramos Terrades

VLCDoC: クロスモーダル文書分類のための視覚言語対比事前トレーニングモデル

文書データからのマルチモーダル学習は、意味的に意味のある特徴を事前学習として学習可能な下流タスクに事前トレーニングできるため、最近大きな成功を収めています。この論文では、モダリティ内およびモダリティ間の関係を考慮しながら、言語と視覚の手がかりを通じてクロスモーダル表現を学習することで文書分類問題にアプローチします。異なるモダリティの特徴を共同表現空間に統合する代わりに、提案された方法は高レベルの相互作用を利用し、モダリティ内およびモダリティ間の効果的な注意の流れから関連する意味情報を学習します。提案された学習目標は、モダリティ内アライメントタスクとモダリティ間アライメントタスクの間で考案され、タスクごとの類似度分布は、結合表現空間で正のサンプルペアを縮小すると同時に負のサンプルペアを対比することによって計算されます。公的文書分類データセットに関する広範な実験により、低スケールおよび大規模データセットに対するモデルの有効性と一般性が実証されました。

Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream task. In this paper, we approach the document classification problem by learning cross-modal representations through language and vision cues, considering intra- and inter-modality relationships. Instead of merging features from different modalities into a joint representation space, the proposed method exploits high-level interactions and learns relevant semantic information from effective attention flows within and across modalities. The proposed learning objective is devised between intra- and inter-modality alignment tasks, where the similarity distribution per task is computed by contracting positive sample pairs while simultaneously contrasting negative ones in the joint representation space}. Extensive experiments on public document classification datasets demonstrate the effectiveness and the generality of our model on low-scale and large-scale datasets.

updated: Thu May 11 2023 15:31:06 GMT+0000 (UTC)

published: Tue May 24 2022 12:28:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト