TransCrowd: Weakly-Supervised Crowd Counting with Transformer

Dingkang Liang; Xiwu Chen; Wei Xu; Yu Zhou; Xiang Bai

TransCrowd：Transformerを使用した弱教師あり群集カウント

主流の群集カウント方法は通常、畳み込みニューラルネットワーク（CNN）を利用して密度マップを回帰し、ポイントレベルの注釈を必要とします。ただし、各人にポイントで注釈を付けることは、費用と手間がかかるプロセスです。テスト段階では、ポイントレベルの注釈はカウント精度を評価するために考慮されません。つまり、ポイントレベルの注釈は冗長です。したがって、カウントレベルの注釈にのみ依存する、より経済的なラベリング方法である、弱く監視されたカウント方法を開発することが望ましい。現在の弱く監視されたカウント方法は、CNNを採用して、画像からカウントへのパラダイムによって群集の総数を回帰します。ただし、コンテキストモデリングの受容野が限られていることは、これらの弱く監視されたCNNベースの方法の本質的な制限です。したがって、これらの方法では、満足のいくパフォーマンスを実現できず、実際のアプリケーションは限られています。 Transformerは、NLPで人気のあるシーケンス間予測モデルであり、グローバルな受容野が含まれています。本論文では、トランスフォーマーに基づくシーケンス対カウントの観点から、弱教師あり群集カウント問題を再定式化するTransCrowdを提案します。提案されたTransCrowdは、Transformerの自己注意メカニズムを使用して、セマンティッククラウド情報を効果的に抽出できることを確認しました。私たちの知る限り、これは群衆カウント研究に純粋なトランスフォーマーを採用した最初の作品です。 5つのベンチマークデータセットでの実験は、提案されたTransCrowdが、すべての弱く監視されたCNNベースのカウント方法と比較して優れたパフォーマンスを達成し、いくつかの一般的な完全に監視されたカウント方法と比較して非常に競争力のあるカウントパフォーマンスを獲得することを示しています。コードはhttps://github.com/dk-liang/TransCrowdで入手できます。

The mainstream crowd counting methods usually utilize the convolution neural network (CNN) to regress a density map, requiring point-level annotations. However, annotating each person with a point is an expensive and laborious process. During the testing phase, the point-level annotations are not considered to evaluate the counting accuracy, which means the point-level annotations are redundant. Hence, it is desirable to develop weakly-supervised counting methods that just rely on count level annotations, a more economical way of labeling. Current weakly-supervised counting methods adopt the CNN to regress a total count of the crowd by an image-to-count paradigm. However, having limited receptive fields for context modeling is an intrinsic limitation of these weakly-supervised CNN-based methods. These methods thus can not achieve satisfactory performance, limited applications in the real-word. The Transformer is a popular sequence-to-sequence prediction model in NLP, which contains a global receptive field. In this paper, we propose TransCrowd, which reformulates the weakly-supervised crowd counting problem from the perspective of sequence-to-count based on Transformer. We observe that the proposed TransCrowd can effectively extract the semantic crowd information by using the self-attention mechanism of Transformer. To the best of our knowledge, this is the first work to adopt a pure Transformer for crowd counting research. Experiments on five benchmark datasets demonstrate that the proposed TransCrowd achieves superior performance compared with all the weakly-supervised CNN-based counting methods and gains highly competitive counting performance compared with some popular fully-supervised counting methods. Code is available at https://github.com/dk-liang/TransCrowd.

updated: Mon Apr 19 2021 08:12:50 GMT+0000 (UTC)

published: Mon Apr 19 2021 08:12:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト