TubeDETR: Spatio-Temporal Video Grounding with Transformers

Antoine Yang; Antoine Miech; Josef Sivic; Ivan Laptev; Cordelia Schmid

TubeDETR：トランスフォーマーによる時空間ビデオグラウンディング

与えられたテキストクエリに対応するビデオで時空間チューブをローカライズする問題を検討します。これは、時間的、空間的、およびマルチモーダル相互作用の共同で効率的なモデリングを必要とする挑戦的なタスクです。このタスクに対処するために、テキスト調整されたオブジェクト検出のためのそのようなモデルの最近の成功に触発されたトランスベースのアーキテクチャであるTubeDETRを提案します。私たちのモデルには、特に次のものが含まれます：（i）まばらにサンプリングされたフレーム上の空間マルチモーダル相互作用をモデル化する効率的なビデオおよびテキストエンコーダー、および（ii）時空間ローカリゼーションを共同で実行する時空間デコーダー。広範なアブレーション研究を通じて、提案されたコンポーネントの利点を示します。また、時空間ビデオグラウンディングタスクに対する完全なアプローチを評価し、挑戦的なVidSTGおよびHC-STVGベンチマークで最先端の技術を超える改善を示します。コードとトレーニング済みモデルは、https：//antoyang.github.io/tubedetr.htmlで公開されています。

We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query. This is a challenging task that requires the joint and efficient modeling of temporal, spatial and multi-modal interactions. To address this task, we propose TubeDETR, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection. Our model notably includes: (i) an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and (ii) a space-time decoder that jointly performs spatio-temporal localization. We demonstrate the advantage of our proposed components through an extensive ablation study. We also evaluate our full approach on the spatio-temporal video grounding task and demonstrate improvements over the state of the art on the challenging VidSTG and HC-STVG benchmarks. Code and trained models are publicly available at https://antoyang.github.io/tubedetr.html.

updated: Thu Jun 09 2022 13:22:50 GMT+0000 (UTC)

published: Wed Mar 30 2022 16:31:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト