Audiovisual Moments in Time: A Large-Scale Annotated Dataset of Audiovisual Actions

Michael Joannou; Pia Rotshtein; Uta Noppeney

視聴覚の瞬間: 視聴覚アクションの大規模な注釈付きデータセット

私たちは、オーディオビジュアルアクションイベントの大規模なデータセットであるオーディオビジュアルモーメントインタイム (AVMIT) を紹介します。広範な注釈タスクでは、11 人の参加者が Moments in Time データセット (MIT) からの 3 秒の視聴覚ビデオのサブセットにラベルを付けました。各試験について、参加者は、ラベル付けされた視聴覚アクションイベントが存在するかどうか、およびそれがビデオの最も顕著な特徴であるかどうかを評価しました。データセットには 57,177 個の視聴覚ビデオの注釈が含まれており、それぞれのトレーニングを受けた参加者 11 人中 3 人が独立して評価しました。この最初のコレクションから、それぞれ 60 ビデオ (960 ビデオ) を含む、16 の異なるアクションクラスからなる厳選されたテストセットを作成しました。また、オーディオデータには VGGish/YamNet、ビジュアルデータには VGG16/EfficientNetB0 を使用する、事前に計算されたオーディオビジュアル特徴埋め込みの 2 セットも提供しています。これにより、オーディオビジュアル DNN 研究への参入障壁が低くなります。私たちは、オーディオビジュアルイベント認識のパフォーマンスを向上させるための AVMIT アノテーションと機能埋め込みの利点を調査しました。一連の 6 つのリカレントニューラルネットワーク (RNN) は、AVMIT でフィルターされた視聴覚イベントまたは MIT からのモダリティ非依存イベントのいずれかでトレーニングされ、その後、視聴覚テストセットでテストされました。すべての RNN で、視聴覚イベントのみをトレーニングすることにより、トップ 1 の精度が 2.71 ～ 5.94% 向上し、トレーニングデータの 3 倍の増加を上回りました。私たちは、新たに注釈が付けられた AVMIT データセットが、特に視聴覚の対応が非常に重要である研究課題に取り組む場合、計算モデルと人間の参加者を含む研究および比較実験のための貴重なリソースとして役立つことを期待しています。

We present Audiovisual Moments in Time (AVMIT), a large-scale dataset of audiovisual action events. In an extensive annotation task 11 participants labelled a subset of 3-second audiovisual videos from the Moments in Time dataset (MIT). For each trial, participants assessed whether the labelled audiovisual action event was present and whether it was the most prominent feature of the video. The dataset includes the annotation of 57,177 audiovisual videos, each independently evaluated by 3 of 11 trained participants. From this initial collection, we created a curated test set of 16 distinct action classes, with 60 videos each (960 videos). We also offer 2 sets of pre-computed audiovisual feature embeddings, using VGGish/YamNet for audio data and VGG16/EfficientNetB0 for visual data, thereby lowering the barrier to entry for audiovisual DNN research. We explored the advantages of AVMIT annotations and feature embeddings to improve performance on audiovisual event recognition. A series of 6 Recurrent Neural Networks (RNNs) were trained on either AVMIT-filtered audiovisual events or modality-agnostic events from MIT, and then tested on our audiovisual test set. In all RNNs, top 1 accuracy was increased by 2.71-5.94% by training exclusively on audiovisual events, even outweighing a three-fold increase in training data. We anticipate that the newly annotated AVMIT dataset will serve as a valuable resource for research and comparative experiments involving computational models and human participants, specifically when addressing research questions where audiovisual correspondence is of critical importance.

updated: Fri Aug 18 2023 17:13:45 GMT+0000 (UTC)

published: Fri Aug 18 2023 17:13:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト