D^2TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization

Yunlong Liang; Fandong Meng; Jiaan Wang; Jinan Xu; Yufeng Chen; Jie Zhou

D^2TV: 多対多のマルチモーダル要約のための二重知識蒸留とターゲット指向ビジョンモデリング

多対多マルチモーダル要約 (M^3S) タスクは、任意の言語のドキュメント入力と対応する画像シーケンスを使用して、任意の言語で要約を生成することを目的としています。本質的にマルチモーダル単一言語要約 (MMS) とマルチモーダルクロス言語要約 (MXLS) で構成されます。タスク。 MMS または MXLS には多くの研究が費やされており、近年ますます注目を集めていますが、M^3S タスクに注目している研究はほとんどありません。さらに、既存の研究は主に、1) MMS のパフォーマンスを考慮せずに知識蒸留によって MXLS を強化するために MMS を利用すること、または 2) 暗黙的な学習または明示的に複雑なトレーニング目標を使用して概要に無関係な視覚的特徴をフィルタリングすることによって MMS モデルを改善することに焦点を当てています。この論文では、まず一般的かつ実践的なタスク、つまり M^3S を紹介します。さらに、M^3S タスクのための二重知識の蒸留とターゲット指向のビジョンモデリングフレームワークを提案します。具体的には、二重知識蒸留法により、MMS と MXLS の知識が相互に転送され、両方の情報が相互に要求されることが保証されます。ターゲット指向の視覚機能を提供するために、シンプルかつ効果的なターゲット指向の対比対物レンズが設計され、不必要な視覚情報を破棄します。多対多の設定に関する広範な実験により、提案されたアプローチの有効性が示されています。さらに、多対多のマルチモーダル要約 (M^3Sum) データセットも提供します。

Many-to-many multimodal summarization (M^3S) task aims to generate summaries in any language with document inputs in any language and the corresponding image sequence, which essentially comprises multimodal monolingual summarization (MMS) and multimodal cross-lingual summarization (MXLS) tasks. Although much work has been devoted to either MMS or MXLS and has obtained increasing attention in recent years, little research pays attention to the M^3S task. Besides, existing studies mainly focus on 1) utilizing MMS to enhance MXLS via knowledge distillation without considering the performance of MMS or 2) improving MMS models by filtering summary-unrelated visual features with implicit learning or explicitly complex training objectives. In this paper, we first introduce a general and practical task, i.e., M^3S. Further, we propose a dual knowledge distillation and target-oriented vision modeling framework for the M^3S task. Specifically, the dual knowledge distillation method guarantees that the knowledge of MMS and MXLS can be transferred to each other and thus mutually prompt both of them. To offer target-oriented visual features, a simple yet effective target-oriented contrastive objective is designed and responsible for discarding needless visual information. Extensive experiments on the many-to-many setting show the effectiveness of the proposed approach. Additionally, we will contribute a many-to-many multimodal summarization (M^3Sum) dataset.

updated: Mon Oct 16 2023 10:04:37 GMT+0000 (UTC)

published: Mon May 22 2023 06:47:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト