Global-Local Temporal Representations For Video Person Re-Identification

ビデオ人物の再識別のためのグローバルローカル時間表現

この論文では、ビデオ人物の再識別（ReID）のために、ビデオシーケンスのマルチスケール時間的合図を活用するために、Global-Local Temporal Representation（GLTR）を提案します。 GLTRは、最初に隣接するフレーム間の短期的な時間的キューをモデリングし、次に連続しないフレーム間の長期的な関係をキャプチャすることで構築されます。具体的には、短期の時間的キューは、歩行者の動きと外観を表すために、異なる時間的膨張率を持つ並列膨張畳み込みによってモデル化されます。長期的な関係は、ビデオシーケンスのオクルージョンとノイズを軽減するために、一時的な自己注意モデルによってキャプチャされます。短期および長期の一時的なキューは、単純な単一ストリームCNNによって最終的なGLTRとして集約されます。 GLTRは、広く使用されている4つのビデオReIDデータセットで、身体部位のキューまたはメトリック学習で学習した既存の機能よりもかなり優れていることを示しています。たとえば、MARSデータセットのランク1の精度を再ランク付けすることなく87.02％に達成し、現在の最新技術よりも優れています。

This paper proposes the Global-Local Temporal Representation (GLTR) to exploit the multi-scale temporal cues in video sequences for video person Re-Identification (ReID). GLTR is constructed by first modeling the short-term temporal cues among adjacent frames, then capturing the long-term relations among inconsecutive frames. Specifically, the short-term temporal cues are modeled by parallel dilated convolutions with different temporal dilation rates to represent the motion and appearance of pedestrian. The long-term relations are captured by a temporal self-attention model to alleviate the occlusions and noises in video sequences. The short and long-term temporal cues are aggregated as the final GLTR by a simple single-stream CNN. GLTR shows substantial superiority to existing features learned with body part cues or metric learning on four widely-used video ReID datasets. For instance, it achieves Rank-1 Accuracy of 87.02% on MARS dataset without re-ranking, better than current state-of-the art.

updated: Tue Aug 27 2019 06:57:03 GMT+0000 (UTC)

published: Tue Aug 27 2019 06:57:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト