End-to-End Modeling via Information Tree for One-Shot Natural Language Spatial Video Grounding

Mengze Li; Tianbao Wang; Haoyu Zhang; Shengyu Zhang; Zhou Zhao; Jiaxu Miao; Wenqiao Zhang; Wenming Tan; Jin Wang; Peng Wang; Shiliang Pu; Fei Wu

ワンショット自然言語空間ビデオグラウンディングのための情報ツリーを介したエンドツーエンドモデリング

自然言語の空間ビデオグラウンディングは、説明文をクエリとして使用して、ビデオフレーム内の関連オブジェクトを検出することを目的としています。大きな進歩にもかかわらず、ほとんどの既存の方法は、膨大な量の人間の努力を必要とする高密度のビデオフレーム注釈に依存しています。限られたアノテーション予算の下で効果的なグラウンディングを実現するために、ワンショットビデオグラウンディングを調査し、エンドツーエンドの方法で、1つのフレームのみにラベルを付けてすべてのビデオフレームで自然言語をグラウンディングする方法を学びます。エンドツーエンドのワンショットビデオグラウンディングの主要な課題の1つは、言語クエリまたはラベル付けされたフレームとは無関係なビデオフレームの存在です。もう1つの課題は、監督の制限に関連しており、表現学習が効果的でない可能性があります。これらの課題に対処するために、ワンショットビデオグラウンディング（IT-OS）の情報ツリーを介してエンドツーエンドモデルを設計しました。その主要なモジュールである情報ツリーは、ブランチ検索およびブランチトリミング技術に基づいて、無関係なフレームの干渉を排除できます。さらに、不十分なラベリングの下での表現学習を改善するために、情報ツリーに基づいていくつかの自己監視タスクが提案されています。ベンチマークデータセットでの実験は、モデルの有効性を示しています。

Natural language spatial video grounding aims to detect the relevant objects in video frames with descriptive sentences as the query. In spite of the great advances, most existing methods rely on dense video frame annotations, which require a tremendous amount of human effort. To achieve effective grounding under a limited annotation budget, we investigate one-shot video grounding, and learn to ground natural language in all video frames with solely one frame labeled, in an end-to-end manner. One major challenge of end-to-end one-shot video grounding is the existence of videos frames that are either irrelevant to the language query or the labeled frames. Another challenge relates to the limited supervision, which might result in ineffective representation learning. To address these challenges, we designed an end-to-end model via Information Tree for One-Shot video grounding (IT-OS). Its key module, the information tree, can eliminate the interference of irrelevant frames based on branch search and branch cropping techniques. In addition, several self-supervised tasks are proposed based on the information tree to improve the representation learning under insufficient labeling. Experiments on the benchmark dataset demonstrate the effectiveness of our model.

updated: Tue Mar 15 2022 15:50:45 GMT+0000 (UTC)

published: Tue Mar 15 2022 15:50:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト