Improved Probabilistic Image-Text Representations

Sanghyuk Chun

確率論的な画像テキスト表現の改善

基本的なビジョン言語 (VL) タスクである画像テキストマッチング (ITM) タスクは、多重性と不完全な注釈から生じる固有の曖昧さという問題に悩まされています。決定論的関数は曖昧さを捉えるのに十分強力ではないため、この課題に取り組むために確率的埋め込みの探索が求められています。しかし、既存の確率的 ITM アプローチには 2 つの重要な欠点があります。モンテカルロ近似による大量の計算の負担と、大量の偽陰性による損失飽和の問題です。この問題を克服するために、この論文では、閉じた形式の解法で新しい確率的距離を導入することにより、改良された確率的クロスモーダル埋め込み (PCME++ と呼ばれる) を紹介します。さらに、PCME++ をさらに強化するために 2 つの最適化手法が提案されています。まず、大量の偽陰性下での損失飽和問題を防ぐために擬陽性を組み込むこと。 2 つ目は、確率的マッチングのための混合サンプルデータの拡張です。 MS-COCO Caption と 2 つの拡張ベンチマーク、CxC および ECCV Caption に関する実験結果は、最先端の ITM 手法と比較した PCME++ の有効性を実証しています。 PCME++ の堅牢性は、ノイズの多い画像とテキストの対応下でも評価されます。さらに、ゼロショット分類の自動プロンプトチューニングにおける PCME++ の潜在的な適用可能性が示されています。コードは https://naver-ai.github.io/pcmepp/ で入手できます。

Image-Text Matching (ITM) task, a fundamental vision-language (VL) task, suffers from the inherent ambiguity arising from multiplicity and imperfect annotations. Deterministic functions are not sufficiently powerful to capture ambiguity, prompting the exploration of probabilistic embeddings to tackle the challenge. However, the existing probabilistic ITM approach encounters two key shortcomings; the burden of heavy computations due to the Monte Carlo approximation, and the loss saturation issue in the face of abundant false negatives. To overcome the issues, this paper presents an improved Probabilistic Cross-Modal Embeddings (named PCME++) by introducing a new probabilistic distance with a closed-form solution. In addition, two optimization techniques are proposed to enhance PCME++ further; first, the incorporation of pseudo-positives to prevent the loss saturation problem under massive false negatives; second, mixed sample data augmentation for probabilistic matching. Experimental results on MS-COCO Caption and two extended benchmarks, CxC and ECCV Caption, demonstrate the effectiveness of PCME++ compared to state-of-the-art ITM methods. The robustness of PCME++ is also evaluated under noisy image-text correspondences. In addition, the potential applicability of PCME++ in automatic prompt tuning for zero-shot classification is shown. The code is available at https://naver-ai.github.io/pcmepp/.

updated: Mon May 29 2023 16:02:09 GMT+0000 (UTC)

published: Mon May 29 2023 16:02:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト