CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model

Dingkang Liang; Jiahao Xie; Zhikang Zou; Xiaoqing Ye; Wei Xu; Xiang Bai

CrowdCLIP: 視覚言語モデルによる教師なし群衆カウント

監視された群衆のカウントは、特に密集したシーンでは困難で費用がかかる、コストのかかる手動のラベル付けに大きく依存しています。この問題を軽減するために、CrowdCLIP という群衆カウントのための新しい教師なしフレームワークを提案します。核となるアイデアは、2 つの観察結果に基づいて構築されています。 2) 群集パッチとカウントテキストの間に自然なマッピングがあります。私たちの知る限りでは、CrowdCLIP は視覚言語の知識を調査してカウントの問題を解決した最初の企業です。具体的には、トレーニング段階で、サイズでソートされた群集パッチに一致するランキングテキストプロンプトを構築して、画像エンコーダーの学習をガイドすることにより、マルチモーダルランキングロスを利用します。テスト段階では、画像パッチの多様性に対処するために、最初に非常に潜在的な群衆パッチを選択し、次にそれらをさまざまなカウント間隔で言語空間にマッピングする、シンプルで効果的なプログレッシブフィルタリング戦略を提案します。 5 つの挑戦的なデータセットでの広範な実験は、提案された CrowdCLIP が以前の教師なしの最先端のカウント方法と比較して優れたパフォーマンスを達成することを示しています。特に、CrowdCLIP は、クロスデータセット設定の下で、一部の一般的な完全に教師ありの方法よりも優れています。ソースコードは、https://github.com/dk-liang/CrowdCLIP で入手できます。

Supervised crowd counting relies heavily on costly manual labeling, which is difficult and expensive, especially in dense scenes. To alleviate the problem, we propose a novel unsupervised framework for crowd counting, named CrowdCLIP. The core idea is built on two observations: 1) the recent contrastive pre-trained vision-language model (CLIP) has presented impressive performance on various downstream tasks; 2) there is a natural mapping between crowd patches and count text. To the best of our knowledge, CrowdCLIP is the first to investigate the vision language knowledge to solve the counting problem. Specifically, in the training stage, we exploit the multi-modal ranking loss by constructing ranking text prompts to match the size-sorted crowd patches to guide the image encoder learning. In the testing stage, to deal with the diversity of image patches, we propose a simple yet effective progressive filtering strategy to first select the highly potential crowd patches and then map them into the language space with various counting intervals. Extensive experiments on five challenging datasets demonstrate that the proposed CrowdCLIP achieves superior performance compared to previous unsupervised state-of-the-art counting methods. Notably, CrowdCLIP even surpasses some popular fully-supervised methods under the cross-dataset setting. The source code will be available at https://github.com/dk-liang/CrowdCLIP.

updated: Sun Apr 09 2023 12:56:54 GMT+0000 (UTC)

published: Sun Apr 09 2023 12:56:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト