Connecting Vision and Language with Video Localized Narratives

Paul Voigtlaender; Soravit Changpinyo; Jordi Pont-Tuset; Radu Soricut; Vittorio Ferrari

ローカライズされたビデオナラティブで視覚と言語を結び付ける

私たちは、ビジョンと言語をつなぐマルチモーダルなビデオ注釈の新しい形である Video Localized Narratives を提案します。元のローカライズされた物語では、注釈者は画像上で話し、マウスを同時に動かし、マウストレースセグメントで各単語を接地します。ただし、これはビデオでは困難です。私たちの新しいプロトコルは、アノテーターがローカライズされたナラティブを使用してビデオのストーリーを伝えることを可能にし、複数のアクターが相互にやり取りしたり、いくつかの受動的なオブジェクトとやり取りしたりする複雑なイベントをキャプチャします。 OVIS、UVO、および Oops データセットの 20,000 のビデオに注釈を付け、合計で 170 万語になりました。このデータに基づいて、ビデオナラティブグラウンディングおよびビデオ質問応答タスクの新しいベンチマークも構築し、強力なベースラインモデルからの参照結果を提供します。注釈は https://google.github.io/video-localized-narratives/ で入手できます。

We propose Video Localized Narratives, a new form of multimodal video annotations connecting vision and language. In the original Localized Narratives, annotators speak and move their mouse simultaneously on an image, thus grounding each word with a mouse trace segment. However, this is challenging on a video. Our new protocol empowers annotators to tell the story of a video with Localized Narratives, capturing even complex events involving multiple actors interacting with each other and with several passive objects. We annotated 20k videos of the OVIS, UVO, and Oops datasets, totalling 1.7M words. Based on this data, we also construct new benchmarks for the video narrative grounding and video question answering tasks, and provide reference results from strong baseline models. Our annotations are available at https://google.github.io/video-localized-narratives/.

updated: Wed Mar 15 2023 10:30:18 GMT+0000 (UTC)

published: Wed Feb 22 2023 09:04:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト