A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus

Bowen Zhang; Hexiang Hu; Joonseok Lee; Ming Zhao; Sheide Chammas; Vihan Jain; Eugene Ie; Fei Sha

ビデオコーパスのモーメントローカリゼーションのための階層型マルチモーダルエンコーダ

テキストクエリに意味的に一致する長いビデオの短いセグメントを特定することは、言語ベースのビデオ検索、ブラウジング、およびナビゲーションにおいて重要なアプリケーションの可能性を秘めている挑戦的なタスクです。一般的な検索システムは、ビデオ全体または事前定義されたビデオセグメントのいずれかでクエリに応答しますが、すべての可能なセグメントを徹底的に検索することが困難な、トリミングされていないセグメント化されていないビデオで未定義のセグメントをローカライズすることは困難です。顕著な課題は、ビデオの表現が時間領域のさまざまなレベルの粒度を説明する必要があることです。この問題に取り組むために、ビデオを粗粒度のクリップレベルと細粒度のフレームレベルの両方でエンコードして、複数のサブタスク、つまりビデオ検索に基づいてさまざまなスケールで情報を抽出するHierArchical Multi-Modal EncodeR（HAMMER）を提案します。、セグメントの時間的ローカリゼーション、およびマスクされた言語モデリング。 ActivityNetキャプションとTVRデータセットのビデオコーパスでのモーメントのローカリゼーションに関するモデルを評価するために、広範な実験を実施しています。私たちのアプローチは、以前の方法と強力なベースラインを上回り、このタスクの新しい最先端を確立します。

Identifying a short segment in a long video that semantically matches a text query is a challenging task that has important application potentials in language-based video search, browsing, and navigation. Typical retrieval systems respond to a query with either a whole video or a pre-defined video segment, but it is challenging to localize undefined segments in untrimmed and unsegmented videos where exhaustively searching over all possible segments is intractable. The outstanding challenge is that the representation of a video must account for different levels of granularity in the temporal domain. To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-grained frame level to extract information at different scales based on multiple subtasks, namely, video retrieval, segment temporal localization, and masked language modeling. We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets. Our approach outperforms the previous methods as well as strong baselines, establishing new state-of-the-art for this task.

updated: Wed Nov 18 2020 02:42:36 GMT+0000 (UTC)

published: Wed Nov 18 2020 02:42:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト