Pose-Aided Video-based Person Re-Identification via Recurrent Graph Convolutional Network

Honghu Pan; Qiao Liu; Yongyong Chen; Yunqi He; Yuan Zheng; Feng Zheng; Zhenyu He

再帰グラフ畳み込みネットワークを介したポーズ支援ビデオベースの人物再識別

ビデオベースの人物再識別 (ReID) の既存の方法は、主に、特徴抽出器と特徴アグリゲーターを介して、特定の歩行者の外観特徴を学習します。ただし、異なる歩行者が同様の外観を持つ場合、外観モデルは失敗します。歩行者ごとに歩行姿勢や体格が異なることを考慮し、映像検索のための外観特徴を超えた識別ポーズ特徴を学習することを提案します。具体的には、外観の特徴と姿勢の特徴を別々に学習し、それらを連結して推論する 2 ブランチアーキテクチャを実装します。ポーズ機能を学習するために、まず既製のポーズ検出器を使用して各フレームの歩行者のポーズを検出し、ポーズシーケンスを使用して時間グラフを作成します。次に、リカレントグラフ畳み込みネットワーク (RGCN) を利用して、時間ポーズグラフのノード埋め込みを学習します。これは、フレーム内ノードの近傍集約とフレーム間グラフ間でのメッセージパッシングを同時に実現するグローバルな情報伝播メカニズムを考案します。最後に、各ノードと各フレームの重要性を学習するために自己注意メカニズムを使用するノード埋め込みから時間グラフ表現を取得するために、ノード注意と時間注意からなる二重注意方法を提案します。 3 つのビデオベースの ReID データセット、つまり Mars、DukeMTMC、iLIDS-VID で提案された方法を検証します。その実験結果は、学習したポーズ機能が既存の外観モデルのパフォーマンスを効果的に改善できることを示しています。

Existing methods for video-based person re-identification (ReID) mainly learn the appearance feature of a given pedestrian via a feature extractor and a feature aggregator. However, the appearance models would fail when different pedestrians have similar appearances. Considering that different pedestrians have different walking postures and body proportions, we propose to learn the discriminative pose feature beyond the appearance feature for video retrieval. Specifically, we implement a two-branch architecture to separately learn the appearance feature and pose feature, and then concatenate them together for inference. To learn the pose feature, we first detect the pedestrian pose in each frame through an off-the-shelf pose detector, and construct a temporal graph using the pose sequence. We then exploit a recurrent graph convolutional network (RGCN) to learn the node embeddings of the temporal pose graph, which devises a global information propagation mechanism to simultaneously achieve the neighborhood aggregation of intra-frame nodes and message passing among inter-frame graphs. Finally, we propose a dual-attention method consisting of node-attention and time-attention to obtain the temporal graph representation from the node embeddings, where the self-attention mechanism is employed to learn the importance of each node and each frame. We verify the proposed method on three video-based ReID datasets, i.e., Mars, DukeMTMC and iLIDS-VID, whose experimental results demonstrate that the learned pose feature can effectively improve the performance of existing appearance models.

updated: Fri Sep 23 2022 13:20:33 GMT+0000 (UTC)

published: Fri Sep 23 2022 13:20:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト