Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

Mathew Monfort; SouYoung Jin; Alexander Liu; David Harwath; Rogerio Feris; James Glass; Aude Oliva

話された瞬間：ビデオの説明から共同視聴覚表現を学ぶ

人々がイベントを観察するとき、彼らは重要な情報を抽象化し、起こっていることの簡潔な要約を構築することができます。これらの要約には、観察されたイベントの重要な高レベルの詳細（何、どこ、誰、どのように）を説明するコンテキスト情報とセマンティック情報が含まれ、観察者にとって重要でないと見なされる背景情報は除外されます。これを念頭に置いて、さまざまな動的イベントのビデオに対して人々が生成する説明は、各ビデオで関心のある重要な情報の理解を大幅に向上させることができます。これらの説明は、特定のイベントを要約するために人々が重要または必要と考えるものについての新しい洞察を得ると同時に、ビデオラベリングの拡張属性（アクション/オブジェクト/シーン/感情など）を提供するキャプションに取り込むことができます。ビデオを理解するための既存のキャプションデータセットは、規模が小さいか、特定のドメインに制限されています。これに対処するために、さまざまなイベントを描いたユニークな短いビデオにそれぞれ起因する500kの音声キャプションの音声モーメント（S-MiT）データセットを提示します。大規模な分類データセットのサイズをスケーリングできるようにしながら、音声録音を使用して説明を収集し、可能な限り自然で簡潔な状態を維持します。提案されたデータセットを利用するために、対照的な学習への新しい適応平均マージン（AMM）アプローチを提示し、複数のデータセットでのビデオ/キャプション検索に関するモデルを評価します。 AMMアプローチは一貫して結果を改善し、Spoken Momentsデータセットでトレーニングされたモデルは、他のビデオキャプションデータセットでトレーニングされたモデルよりも一般化が優れていることを示しています。

When people observe events, they are able to abstract key information and build concise summaries of what is happening. These summaries include contextual and semantic information describing the important high-level details (what, where, who and how) of the observed event and exclude background information that is deemed unimportant to the observer. With this in mind, the descriptions people generate for videos of different dynamic events can greatly improve our understanding of the key information of interest in each video. These descriptions can be captured in captions that provide expanded attributes for video labeling (e.g. actions/objects/scenes/sentiment/etc.) while allowing us to gain new insight into what people find important or necessary to summarize specific events. Existing caption datasets for video understanding are either small in scale or restricted to a specific domain. To address this, we present the Spoken Moments (S-MiT) dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events. We collect our descriptions using audio recordings to ensure that they remain as natural and concise as possible while allowing us to scale the size of a large classification dataset. In order to utilize our proposed dataset, we present a novel Adaptive Mean Margin (AMM) approach to contrastive learning and evaluate our models on video/caption retrieval on multiple datasets. We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.

updated: Mon May 10 2021 16:30:46 GMT+0000 (UTC)

published: Mon May 10 2021 16:30:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト