Geodesic Multi-Modal Mixup for Robust Fine-Tuning

Junhyuk So; Changdae Oh; Yongtaek Lim; Hoyoon Byun; Minchul Shin; Kyungwoo Song

ロバストな微調整のための測地線マルチモーダルミックスアップ

事前トレーニング済みの大規模モデルは、転送可能な埋め込みを提供し、さまざまなダウンストリームタスクで有望なパフォーマンスを示します。ただし、学習された埋め込みの分析は十分に検討されておらず、クロスモーダルタスクの転送可能性は改善される可能性があります。このペーパーでは、均一性と配置の観点からマルチモーダル埋め込みを理解するための視点を提供します。 CLIPなどのマルチモーダル学習モデルによって学習された表現には、異種のデータセットごとに2つの分離された埋め込みスペースがあり、アライメントが少ないことが新たにわかりました。さらに、2 つのモダリティの間には未調査の大きな中間領域があり、均一性が低くなります。その結果、アライメントと均一性の欠如は、下流のタスクの表現の堅牢性と転送可能性を制限する可能性があります。この目的のために、より良い均一性とアライメントスコアを促進する堅牢な表現のための新しいエンドツーエンドの微調整方法を提供します。まず、画像とテキストの表現を混合して、超球面埋め込み空間でハードネガティブサンプルを生成する測地線マルチモーダルミックスアップを提案します。第 2 に、ハードネガティブサンプル、および対照的な損失を伴う通常のネガティブサンプルとポジティブサンプルのマルチモーダルモデルを微調整します。検索、分類、および構造認識タスクに関する広範な実験を通じて、測地線マルチモーダル Mixup が堅牢な表現を学習し、さまざまなダウンストリームタスクのパフォーマンスを向上させることを実証します。

Pre-trained large-scale models provide a transferable embedding, and they show promising performance on diverse downstream tasks. However, the analysis of learned embedding has not been explored well, and the transferability for cross-modal tasks can be improved. This paper provides a perspective to understand multi-modal embedding in terms of uniformity and alignment. We newly find that the representation learned by multi-modal learning models such as CLIP has two separated embedding spaces for each heterogeneous dataset with less alignment. Besides, there are unexplored large intermediate areas between the two modalities with less uniformity. As a result, lack of alignment and uniformity might restrict the robustness and transferability of the representation for the downstream task. To this end, we provide a new end-to-end fine-tuning method for robust representation that encourages better uniformity and alignment score. First, we propose a Geodesic Multi-Modal Mixup that mixes the representation of image and text to generate the hard negative samples on the hyperspherical embedding space. Second, we fine-tune the multi-modal model on hard negative samples as well as normal negatives and positive samples with contrastive loss. Through extensive experiments on retrieval, classification, and structure-awareness task, we demonstrate that our geodesic multi-modal Mixup learns a robust representation and provides improved performance on various downstream tasks.

updated: Wed Oct 19 2022 07:42:56 GMT+0000 (UTC)

published: Tue Mar 08 2022 07:34:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト