HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention

Shijie Geng; Jianbo Yuan; Yu Tian; Yuxiao Chen; Yongfeng Zhang

HiCLIP: 階層を意識した注意による対照的な言語イメージの事前トレーニング

大規模な対照的視覚言語事前トレーニング (CLIP) の成功は、視覚認識とマルチモーダルコンテンツ理解の両方に恩恵をもたらしました。簡潔な設計により、CLIP は、クロスアテンションフュージョンレイヤーが重い他のビジョン言語モデルよりも推論効率が向上し、幅広いダウンストリームタスクで人気のある選択肢となっています。ただし、CLIP は、画像やテキストで伝えられる高レベルできめ細かいセマンティクスの階層的性質を明示的に捉えていません。これは、視覚言語の理解と推論にとって間違いなく重要です。この目的のために、CLIP のビジュアルブランチと言語ブランチの両方に階層を意識したアテンション、つまり Hierarchy-aware CLIP (HiCLIP) を装備し、教師なしで画像とテキストの両方からレイヤーごとにセマンティック階層を段階的に発見します。その結果、このような階層的な集約により、クロスモーダルの調整が大幅に改善されます。 HiCLIP の利点を実証するために、推論中の教師なし階層誘導に関する定性分析と、視覚認識と視覚言語の下流タスクの両方に関する広範な定量的実験を行います。

The success of large-scale contrastive vision-language pretraining (CLIP) has benefited both visual recognition and multimodal content understanding. The concise design brings CLIP the advantage in inference efficiency against other vision-language models with heavier cross-attention fusion layers, making it a popular choice for a wide spectrum of downstream tasks. However, CLIP does not explicitly capture the hierarchical nature of high-level and fine-grained semantics conveyed in images and texts, which is arguably critical to vision-language understanding and reasoning. To this end, we equip both the visual and language branches in CLIP with hierarchy-aware attentions, namely Hierarchy-aware CLIP (HiCLIP), to progressively discover semantic hierarchies layer-by-layer from both images and texts in an unsupervised manner. As a result, such hierarchical aggregation significantly improves the cross-modal alignment. To demonstrate the advantages of HiCLIP, we conduct qualitative analysis on its unsupervised hierarchy induction during inference, as well as extensive quantitative experiments on both visual recognition and vision-language downstream tasks.

updated: Mon Mar 06 2023 09:44:01 GMT+0000 (UTC)

published: Mon Mar 06 2023 09:44:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト