Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention

Cristian Rodriguez-Opazo; Edison Marrese-Taylor; Fatemeh Sadat Saleh; Hongdong Li; Stephen Gould

ガイド付き注意を使用したビデオの自然言語クエリの提案なしの時間モーメントの局在化

このホワイトペーパーでは、自然言語をクエリとして使用して、トリミングされていない長いビデオの時間モーメントの局在化の問題を研究します。クエリとしてトリミングされていないビデオと文が与えられた場合、目標はクエリ文に対応するビデオの関連する視覚的瞬間の開始と終了を決定することです。これまでの研究では、提案とランクのアプローチによってこのタスクに取り組んでいますが、3つの主要なコンポーネントに依存する、より効率的でエンドツーエンドのトレーニング可能な、提案のないアプローチを導入します。ビジュアルドメイン、ビデオの最も関連性の高い部分に参加するようモデルをガイドする新しい損失関数、注釈の不確実性をモデル化するソフトラベル。 2つのベンチマークデータセット、Charades-STAおよびActivityNet-Captionsでメソッドを評価します。実験結果は、我々のアプローチが両方のデータセットで最先端の方法より優れていることを示しています。

This paper studies the problem of temporal moment localization in a long untrimmed video using natural language as the query. Given an untrimmed video and a sentence as the query, the goal is to determine the starting, and the ending, of the relevant visual moment in the video, that corresponds to the query sentence. While previous works have tackled this task by a propose-and-rank approach, we introduce a more efficient, end-to-end trainable, and proposal-free approach that relies on three key components: a dynamic filter to transfer language information to the visual domain, a new loss function to guide our model to attend the most relevant parts of the video, and soft labels to model annotation uncertainty. We evaluate our method on two benchmark datasets, Charades-STA and ActivityNet-Captions. Experimental results show that our approach outperforms state-of-the-art methods on both datasets.

updated: Thu Mar 12 2020 10:02:31 GMT+0000 (UTC)

published: Tue Aug 20 2019 09:22:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト