Counting Varying Density Crowds Through Density Guided Adaptive Selection CNN and Transformer Estimation

Yuehai Chen; Jing Yang; Badong Chen; Shaoyi Du

密度誘導適応選択 CNN とトランスフォーマー推定によるさまざまな密度の群集のカウント

実際の群衆カウントアプリケーションでは、画像内の群衆密度は大きく異なります。密度の変化に直面すると、人間は密度の低い領域でターゲットを見つけて数え、密度の高い領域でその数を推測する傾向があります。 CNN は、固定サイズの畳み込みカーネルを使用して局所的な情報相関に焦点を当てており、Transformer はグローバルな自己注意メカニズムを使用してセマンティッククラウド情報を効果的に抽出できることがわかりました。したがって、CNN は低密度地域では群衆を正確に特定して推定できますが、高密度地域では密度を適切に認識するのは困難です。逆に、Transformer は高密度領域では信頼性が高くなりますが、疎な領域ではターゲットを見つけることができません。 CNN も Transformer も、この種の密度の変化をうまく処理できません。この問題に対処するために、異なる密度領域に対して適切なカウントブランチを適応的に選択できる CNN および Transformer Adaptive Selection Network (CTASNet) を提案します。まず、CTASNet は CNN と Transformer の予測結果を生成します。次に、CNN/Transformer が低/高密度領域に適していることを考慮して、CNN と Transformer の予測を自動的に組み合わせるように、密度に基づく適応選択モジュールが設計されています。さらに、アノテーションノイズの影響を軽減するために、Correntropy ベースの最適なトランスポートロスを導入します。 4 つの挑戦的な群集カウントデータセットに関する広範な実験により、提案された方法が検証されました。

In real-world crowd counting applications, the crowd densities in an image vary greatly. When facing density variation, humans tend to locate and count the targets in low-density regions, and reason the number in high-density regions. We observe that CNN focus on the local information correlation using a fixed-size convolution kernel and the Transformer could effectively extract the semantic crowd information by using the global self-attention mechanism. Thus, CNN could locate and estimate crowds accurately in low-density regions, while it is hard to properly perceive the densities in high-density regions. On the contrary, Transformer has a high reliability in high-density regions, but fails to locate the targets in sparse regions. Neither CNN nor Transformer can well deal with this kind of density variation. To address this problem, we propose a CNN and Transformer Adaptive Selection Network (CTASNet) which can adaptively select the appropriate counting branch for different density regions. Firstly, CTASNet generates the prediction results of CNN and Transformer. Then, considering that CNN/Transformer is appropriate for low/high-density regions, a density guided adaptive selection module is designed to automatically combine the predictions of CNN and Transformer. Moreover, to reduce the influences of annotation noise, we introduce a Correntropy based optimal transport loss. Extensive experiments on four challenging crowd counting datasets have validated the proposed method.

updated: Fri Oct 14 2022 01:15:04 GMT+0000 (UTC)

published: Tue Jun 21 2022 02:05:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト