ViGT: Proposal-free Video Grounding with Learnable Token in Transformer

Kun Li; Dan Guo; Meng Wang

ViGT: Transformer の学習可能なトークンを使用した提案不要のビデオグラウンディング

ビデオグラウンディング (VG) タスクは、豊富な言語記述に基づいて、トリミングされていないビデオ内のクエリされたアクションまたはイベントを特定することを目的としています。既存のプロポーザルフリーの方法は、ビデオとクエリの間の複雑な相互作用に囚われており、VG のクロスモーダル特徴融合と特徴相関が強調されすぎています。この論文では、トランスフォーマーで回帰トークン学習を実行する新しい境界回帰パラダイムを提案します。特に、マルチモーダルまたはクロスモーダル特徴ではなく、学習可能な回帰トークンを使用して時間境界を予測する、シンプルだが効果的な提案不要のフレームワーク、つまり Video Grounding Transformer (ViGT) を紹介します。 ViGT では、学習可能なトークンの利点は次のように現れます。 (1) トークンはビデオやクエリとは無関係であり、元のビデオやクエリに対するデータの偏りを回避します。 (2) トークンは、ビデオおよびクエリ機能からのグローバルコンテキスト集約を同時に実行します。まず、共有特徴エンコーダを使用して、ビデオとクエリの両方を結合特徴空間に投影した後、クロスモーダル同時アテンション (つまり、ビデオからクエリへのアテンション、およびクエリからビデオへのアテンション) を実行して、それぞれの識別特徴を強調表示しました。モダリティ。さらに、学習可能な回帰トークン [REG] を、ビジョン言語トランスフォーマーの入力としてビデオおよびクエリ機能と連結しました。最後に、トークン [REG] を利用してターゲットの瞬間と視覚的特徴を予測し、各タイムスタンプの前景と背景の確率を制限しました。提案された ViGT は、ANet Captions、TACoS、YouCookII の 3 つの公開データセットで良好なパフォーマンスを示しました。広範なアブレーション研究と定性分析により、ViGT の解釈可能性がさらに検証されました。

The video grounding (VG) task aims to locate the queried action or event in an untrimmed video based on rich linguistic descriptions. Existing proposal-free methods are trapped in complex interaction between video and query, overemphasizing cross-modal feature fusion and feature correlation for VG. In this paper, we propose a novel boundary regression paradigm that performs regression token learning in a transformer. Particularly, we present a simple but effective proposal-free framework, namely Video Grounding Transformer (ViGT), which predicts the temporal boundary using a learnable regression token rather than multi-modal or cross-modal features. In ViGT, the benefits of a learnable token are manifested as follows. (1) The token is unrelated to the video or the query and avoids data bias toward the original video and query. (2) The token simultaneously performs global context aggregation from video and query features. First, we employed a sharing feature encoder to project both video and query into a joint feature space before performing cross-modal co-attention (i.e., video-to-query attention and query-to-video attention) to highlight discriminative features in each modality. Furthermore, we concatenated a learnable regression token [REG] with the video and query features as the input of a vision-language transformer. Finally, we utilized the token [REG] to predict the target moment and visual features to constrain the foreground and background probabilities at each timestamp. The proposed ViGT performed well on three public datasets: ANet Captions, TACoS and YouCookII. Extensive ablation studies and qualitative analysis further validated the interpretability of ViGT.

updated: Fri Aug 11 2023 08:30:08 GMT+0000 (UTC)

published: Fri Aug 11 2023 08:30:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト