HierVL: Learning Hierarchical Video-Language Embeddings

Kumar Ashutosh; Rohit Girdhar; Lorenzo Torresani; Kristen Grauman

HierVL: 階層的なビデオ言語埋め込みの学習

ビデオ言語の埋め込みは、セマンティクスを視覚的表現に注入するための有望な手段ですが、既存の方法では、数秒のビデオクリップとそれに付随するテキストの間の短期的な関連付けしかキャプチャできません。 HierVL は、長期的な関連付けと短期的な関連付けの両方を同時に説明する新しい階層的なビデオ言語埋め込みです。トレーニングデータとして、人間の行動のタイムスタンプ付きのテキスト説明を伴うビデオを、長いビデオ全体のアクティビティの高レベルのテキストサマリーと共に取得します (Ego4D で利用可能)。クリップレベルとビデオレベルの両方でテキストとビジュアルの配置を促進する、階層的な対照的なトレーニング目標を導入します。クリップレベルの制約では、段階的な説明を使用してその瞬間に何が起こっているかを捉えますが、ビデオレベルの制約では要約テキストを使用して、なぜそれが起こっているのかを捉えます。つまり、アクティビティと意図のより広いコンテキストを捉えます。俳優の。私たちの階層的なスキームは、単一レベルの対応物よりも優れたクリップ表現と、長期的なビデオモデリングを必要とするタスクで SotA の結果を達成する長期的なビデオ表現を生み出します。 HierVL は、ゼロショット設定と微調整設定の両方で、複数の困難なダウンストリームタスク (EPIC-KITCHENS-100、Charades-Ego、HowTo100M) に正常に移行します。

Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos accompanied by timestamped text descriptions of human actions, together with a high-level text summary of the activity throughout the long video (as are available in Ego4D). We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. While the clip-level constraints use the step-by-step descriptions to capture what is happening in that instant, the video-level constraints use the summary text to capture why it is happening, i.e., the broader context for the activity and the intent of the actor. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA results on tasks requiring long-term video modeling. HierVL successfully transfers to multiple challenging downstream tasks (in EPIC-KITCHENS-100, Charades-Ego, HowTo100M) in both zero-shot and fine-tuned settings.

updated: Thu Jun 08 2023 14:29:35 GMT+0000 (UTC)

published: Thu Jan 05 2023 21:53:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト