CLAMP: Prompt-based Contrastive Learning for Connecting Language and Animal Pose

Xu Zhang; Wen Wang; Zhe Chen; Yufei Xu; Jing Zhang; Dacheng Tao

CLAMP: 言語と動物のポーズを接続するためのプロンプトベースの対照的学習

動物のポーズの推定は、トレーニングデータが限られていることと、種内および種間の分散が大きいため、既存の画像ベースの方法では困難です。視覚言語研究の進歩に動機付けられて、テキストで動物のキーポイントを説明するための豊富な事前知識を提供することにより、事前にトレーニングされた言語モデル (CLIP など) が動物のポーズの推定を容易にすることを提案します。ただし、テキストベースの説明と動物のポーズに関するキーポイントベースの視覚的特徴との間のギャップが重要になる可能性があるため、事前トレーニング済みの言語モデルと視覚的な動物のキーポイントとの間の効果的な接続を構築することは重要であることがわかりました。この問題に対処するために、Language and Animal Pose (CLAMP) を効果的に接続するための新しいプロンプトベースの対照的な学習スキームを紹介します。 CLAMP は、ネットワークトレーニング中にテキストプロンプトを動物のキーポイントに適合させることで、ギャップを埋めようとします。適応は、空間認識プロセスと機能認識プロセスに分解され、それに応じて 2 つの新しい対照的な損失が考案されます。実際には、CLAMP は新しいクロスモーダルな動物姿勢推定パラダイムを有効にします。実験結果は、私たちの方法が教師あり、少数ショット、およびゼロショット設定で最先端のパフォーマンスを達成し、画像ベースの方法よりも大幅に優れていることを示しています。ソースコードは公開されます。

Animal pose estimation is challenging for existing image-based methods because of limited training data and large intra- and inter-species variances. Motivated by the progress of visual-language research, we propose that pre-trained language models (e.g., CLIP) can facilitate animal pose estimation by providing rich prior knowledge for describing animal keypoints in text. However, we found that building effective connections between pre-trained language models and visual animal keypoints is non-trivial since the gap between text-based descriptions and keypoint-based visual features about animal pose can be significant. To address this issue, we introduce a novel prompt-based Contrastive learning scheme for connecting Language and AniMal Pose (CLAMP) effectively. The CLAMP attempts to bridge the gap by adapting the text prompts to the animal keypoints during network training. The adaptation is decomposed into spatial-aware and feature-aware processes, and two novel contrastive losses are devised correspondingly. In practice, the CLAMP enables a new cross-modal animal pose estimation paradigm. Experimental results show that our method achieves state-of-the-art performance under the supervised, few-shot, and zero-shot settings, outperforming image-based methods by a large margin. The source code will be made publicly available.

updated: Sat Nov 19 2022 13:04:47 GMT+0000 (UTC)

published: Thu Jun 23 2022 14:51:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト