2rd Place Solutions in the HC-STVG track of Person in Context Challenge 2021

YiYu; XinyingWang; WeiHu; XunLuo; ChengLi

Person in Context Challenge2021のHC-STVGトラックの2位ソリューション

このテクニカルレポートでは、文に基づいてトリミングされていないビデオで時空間人物をローカライズするためのソリューションを紹介します。 3rd Person in Context（PIC）チャレンジのHC-STVGトラックで2番目のvIOU（0.30025）を達成します。私たちのソリューションは3つの部分で構成されています：1）人間の属性情報が文から抽出され、テスト段階でチューブの提案を除外し、トレーニング段階で外観情報を学習するために分類器を監視するのに役立ちます。 2）YoloV5で人間を検出し、DeepSortフレームワークに基づいて人間を追跡しますが、元のReIDネットワークをFastReIDに置き換えます。 3）視覚トランスフォーマーを使用して、対象者の時空間チューブをローカライズするためのクロスモーダル表現を抽出します。

In this technical report, we present our solution to localize a spatio-temporal person in an untrimmed video based on a sentence. We achieve the second vIOU(0.30025) in the HC-STVG track of the 3rd Person in Context(PIC) Challenge. Our solution contains three parts: 1) human attributes information is extracted from the sentence, it is helpful to filter out tube proposals in the testing phase and supervise our classifier to learn appearance information in the training phase. 2) we detect humans with YoloV5 and track humans based on the DeepSort framework but replace the original ReID network with FastReID. 3) a visual transformer is used to extract cross-modal representations for localizing a spatio-temporal tube of the target person.

updated: Mon Jun 14 2021 05:18:34 GMT+0000 (UTC)

published: Mon Jun 14 2021 05:18:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト