MSD: Saliency-aware Knowledge Distillation for Multimodal Understanding

Woojeong Jin; Maziar Sanjabi; Shaoliang Nie; Liang Tan; Xiang Ren; Hamed Firooz

MSD：マルチモーダル理解のための顕著性を意識した知識の抽出

モデルのサイズを縮小しながらパフォーマンスを維持するために、知識を大きな「教師」モデルから小さな「学生」モデルに転送する知識蒸留（KD）に依存することがよくあります。ただし、視覚言語タスクなどのマルチモーダルデータセットのKDは比較的未踏であり、さまざまなモダリティがさまざまなタイプの情報を提示するため、マルチモーダル情報の消化は困難です。本稿では、知識蒸留における各モダリティの重要性と効果を調査するために、大規模な実証的研究を実施します。さらに、マルチモーダル知識蒸留フレームワークであるモダリティ固有の蒸留（MSD）を導入し、各モダリティ内での教師の行動を学習することにより、マルチモーダルタスクに関する教師からの知識を伝達します。このアイデアは、各モダリティに補助損失項を導入することにより、教師のモダリティ固有の予測を模倣することを目的としています。さらに、各モダリティは予測に対して異なる顕著性を持っているため、各モダリティの顕著性スコアを定義し、補助損失の顕著性ベースの重み付けスキームを調査します。さらに、これらの損失項で最適な重みを学習するための重み学習アプローチを研究します。経験的分析では、KDの各モダリティの顕著性を調べ、MSDの重み付けスキームの有効性を示し、4つのマルチモーダルデータセットでKDよりも優れたパフォーマンスを達成することを示します。

To reduce a model size but retain performance, we often rely on knowledge distillation (KD) which transfers knowledge from a large "teacher" model to a smaller "student" model. However, KD on multimodal datasets such as vision-language tasks is relatively unexplored, and digesting multimodal information is challenging since different modalities present different types of information. In this paper, we perform a large-scale empirical study to investigate the importance and effects of each modality in knowledge distillation. Furthermore, we introduce a multimodal knowledge distillation framework, modality-specific distillation (MSD), to transfer knowledge from a teacher on multimodal tasks by learning the teacher's behavior within each modality. The idea aims at mimicking a teacher's modality-specific predictions by introducing auxiliary loss terms for each modality. Furthermore, because each modality has different saliency for predictions, we define saliency scores for each modality and investigate saliency-based weighting schemes for the auxiliary losses. We further study a weight learning approach to learn the optimal weights on these loss terms. In our empirical analysis, we examine the saliency of each modality in KD, demonstrate the effectiveness of the weighting scheme in MSD, and show that it achieves better performance than KD on four multimodal datasets.

updated: Thu Oct 21 2021 18:09:01 GMT+0000 (UTC)

published: Wed Jan 06 2021 05:45:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト