TransCrowd: weakly-supervised crowd counting with transformers

Dingkang Liang; Xiwu Chen; Wei Xu; Yu Zhou; Xiang Bai

TransCrowd: トランスフォーマーを使用した教師付きの弱い群衆カウント

主流の群衆カウント方法は通常、畳み込みニューラルネットワーク (CNN) を利用して密度マップを回帰させ、ポイントレベルの注釈を必要とします。ただし、各人にポイントを付けて注釈を付けるのは、費用がかかり面倒なプロセスです。テスト段階では、ポイントレベルの注釈はカウントの精度を評価するために考慮されません。つまり、ポイントレベルの注釈は冗長です。したがって、より経済的なラベリング方法である、カウントレベルの注釈のみに依存する、教師付きの弱いカウント方法を開発することが望ましいです。現在の弱教師付きカウント方法は、CNN を採用して、画像からカウントへのパラダイムによって群衆の合計数を回帰します。ただし、コンテキストモデリングの受容野が限られていることは、これらの弱い教師付き CNN ベースの方法の本質的な制限です。したがって、これらの方法は、現実世界での用途が限られているため、満足のいくパフォーマンスを達成できません。トランスフォーマーは、自然言語処理 (NLP) で一般的なシーケンス間予測モデルであり、グローバルな受容野が含まれています。この論文では、transCrowd を提案します。これは、トランスフォーマーに基づくシーケンスからカウントの観点から、弱教師付きクラウドカウント問題を再定式化したものです。提案された TransCrowd は、トランスフォーマーのセルフアテンションメカニズムを使用することにより、セマンティッククラウド情報を効果的に抽出できることがわかります。私たちの知る限りでは、群衆計数研究に純粋な変圧器を採用したのはこれが初めてです。 5 つのベンチマークデータセットでの実験は、提案された TransCrowd がすべての弱教師あり CNN ベースのカウント方法と比較して優れたパフォーマンスを達成し、いくつかの一般的な完全教師ありカウント方法と比較して非常に競争力のあるカウントパフォーマンスを獲得することを示しています。

The mainstream crowd counting methods usually utilize the convolution neural network (CNN) to regress a density map, requiring point-level annotations. However, annotating each person with a point is an expensive and laborious process. During the testing phase, the point-level annotations are not considered to evaluate the counting accuracy, which means the point-level annotations are redundant. Hence, it is desirable to develop weakly-supervised counting methods that just rely on count-level annotations, a more economical way of labeling. Current weakly-supervised counting methods adopt the CNN to regress a total count of the crowd by an image-to-count paradigm. However, having limited receptive fields for context modeling is an intrinsic limitation of these weakly-supervised CNN-based methods. These methods thus cannot achieve satisfactory performance, with limited applications in the real world. The transformer is a popular sequence-to-sequence prediction model in natural language processing (NLP), which contains a global receptive field. In this paper, we propose TransCrowd, which reformulates the weakly-supervised crowd counting problem from the perspective of sequence-to-count based on transformers. We observe that the proposed TransCrowd can effectively extract the semantic crowd information by using the self-attention mechanism of transformer. To the best of our knowledge, this is the first work to adopt a pure transformer for crowd counting research. Experiments on five benchmark datasets demonstrate that the proposed TransCrowd achieves superior performance compared with all the weakly-supervised CNN-based counting methods and gains highly competitive counting performance compared with some popular fully-supervised counting methods.

updated: Thu Sep 08 2022 07:08:18 GMT+0000 (UTC)

published: Mon Apr 19 2021 08:12:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト