TVPR: Text-to-Video Person Retrieval and a New Benchmark

Fan Ni; Xu Zhang; Jianhui Wu; Guan-Nan Dong; Aichun Zhu; Hui Liu; Yue Zhang

TVPR: テキストからビデオへの人物検索と新しいベンチマーク

テキストベースの人物検索の既存の方法のほとんどは、テキストから画像への人物検索に焦点を当てています。それにもかかわらず、孤立したフレームによって提供される動的な情報が不足しているため、人物が孤立したフレームで隠れていたり、さまざまな動きの詳細がテキストの説明で与えられている場合、パフォーマンスが妨げられます。この論文では、孤立したフレームの制限を効果的に克服することを目的とした、テキストからビデオへの人物検索(TVPR)と呼ばれる新しいタスクを提案します。自然言語で人物ビデオを記述するデータセットやベンチマークが存在しないため、人の外観、動作、環境との相互作用などの詳細な自然言語注釈を含む大規模なクロスモーダル人物ビデオデータセットを構築します。これはテキストと呼ばれます。 -to-Video Person Re-identification (TVPReid) データセット。一般公開されます。この目的を達成するために、テキストからビデオへの人物検索ネットワーク (TVPRN) が提案されています。具体的には、TVPRN は、人物ビデオの視覚表現と動作表現を融合することによってビデオ表現を取得します。これにより、時間的なオクルージョンや孤立したフレーム内の可変動作の詳細の欠如に対処できます。一方、事前トレーニングされた BERT を使用してキャプション表現とキャプションとビデオ表現の関係を取得し、最も関連性の高い人物ビデオを明らかにします。提案された TVPRN の有効性を評価するために、TVPReid データセットに対して広範な実験が行われました。私たちの知る限り、TVPRN はテキストベースの人物検索タスクにビデオを使用することに成功した最初の試みであり、TVPReid データセットで最先端のパフォーマンスを達成しました。 TVPReid データセットは、将来の研究に役立てるために一般公開されます。

Most existing methods for text-based person retrieval focus on text-to-image person retrieval. Nevertheless, due to the lack of dynamic information provided by isolated frames, the performance is hampered when the person is obscured in isolated frames or variable motion details are given in the textual description. In this paper, we propose a new task called Text-to-Video Person Retrieval(TVPR) which aims to effectively overcome the limitations of isolated frames. Since there is no dataset or benchmark that describes person videos with natural language, we construct a large-scale cross-modal person video dataset containing detailed natural language annotations, such as person's appearance, actions and interactions with environment, etc., termed as Text-to-Video Person Re-identification (TVPReid) dataset, which will be publicly available. To this end, a Text-to-Video Person Retrieval Network (TVPRN) is proposed. Specifically, TVPRN acquires video representations by fusing visual and motion representations of person videos, which can deal with temporal occlusion and the absence of variable motion details in isolated frames. Meanwhile, we employ the pre-trained BERT to obtain caption representations and the relationship between caption and video representations to reveal the most relevant person videos. To evaluate the effectiveness of the proposed TVPRN, extensive experiments have been conducted on TVPReid dataset. To the best of our knowledge, TVPRN is the first successful attempt to use video for text-based person retrieval task and has achieved state-of-the-art performance on TVPReid dataset. The TVPReid dataset will be publicly available to benefit future research.

updated: Fri Jul 14 2023 06:34:00 GMT+0000 (UTC)

published: Fri Jul 14 2023 06:34:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト