Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

Ziyun Zeng; Yuying Ge; Xihui Liu; Bin Chen; Ping Luo; Shu-Tao Xia; Yixiao Ge

自然な文字知識から伝達可能な時空間表現を学ぶ

大規模なビデオデータの事前トレーニングは、近年、転送可能な時空間表現を学習するための一般的なレシピとなっています。ある程度の進歩にもかかわらず、既存の方法はほとんどが高度に精選されたデータセット (K400 など) に限定されており、すぐに使える不十分な表現を示します。これは、時空間セマンティクスではなく、ピクセルレベルの知識のみをキャプチャするためであり、ビデオ理解のさらなる進歩を妨げていると主張しています。画像とテキストの事前トレーニング (CLIP など) の大成功に触発されて、私たちは言語セマンティクスを活用して、転送可能な時空間表現学習を促進するための第一歩を踏み出しました。新しい口実タスク、Turning to Video for Transcript Sorting (TVTS) を導入します。これは、学習したビデオ表現に注目して、シャッフルされた ASR スクリプトを並べ替えます。説明的なキャプションに依存せず、純粋にビデオから学習します。つまり、自然に書き起こされた音声の知識を活用して、時間の経過とともにノイズが多いが有用なセマンティクスを提供します。私たちの方法は、物語のトランスクリプトを再編成し、現実世界の大規模なキュレーションされていないビデオデータにシームレスに適用できるように、時間の経過とともに何が起こっているかをコンテキスト化するビジョンモデルを強制します。私たちの方法は、さまざまなベンチマークですぐに使用できる強力な時空間表現を示しています。たとえば、線形プローブを介して SSV2 で VideoMAE を +13.6% 上回っています。コードは https://github.com/TencentARC/TVTS で入手できます。

Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years. Despite some progress, existing methods are mostly limited to highly curated datasets (e.g., K400) and exhibit unsatisfactory out-of-the-box representations. We argue that it is due to the fact that they only capture pixel-level knowledge rather than spatiotemporal semantics, which hinders further progress in video understanding. Inspired by the great success of image-text pre-training (e.g., CLIP), we take the first step to exploit language semantics to boost transferable spatiotemporal representation learning. We introduce a new pretext task, Turning to Video for Transcript Sorting (TVTS), which sorts shuffled ASR scripts by attending to learned video representations. We do not rely on descriptive captions and learn purely from video, i.e., leveraging the natural transcribed speech knowledge to provide noisy but useful semantics over time. Our method enforces the vision model to contextualize what is happening over time so that it can re-organize the narrative transcripts, and can seamlessly apply to large-scale uncurated video data in the real world. Our method demonstrates strong out-of-the-box spatiotemporal representations on diverse benchmarks, e.g., +13.6% gains over VideoMAE on SSV2 via linear probing. The code is available at https://github.com/TencentARC/TVTS.

updated: Sun Mar 12 2023 13:49:09 GMT+0000 (UTC)

published: Fri Sep 30 2022 07:39:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト