ASOD60K: An Audio-Induced Salient Object Detection Dataset for Panoramic Videos

Yi Zhang

ASOD60K：パノラマビデオ用の音声誘導顕著な物体検出データセット

動的なパノラマシーンで人間が何に注意を払うかを調べることは、小売業の拡張現実（AR）、ARを活用した求人、視覚言語ナビゲーションなど、多くの基本的なアプリケーションに役立ちます。この目標を念頭に置いて、パノラマビデオから顕著なオブジェクトをセグメント化することを目的とした新しいタスクであるPV-SODを提案します。既存の固視/オブジェクトレベルの顕著性検出タスクとは対照的に、音声誘発性の顕著な眼球運動のガイダンスで顕著なオブジェクトにラベルが付けられる、音声誘発性の顕著な物体検出（SOD）に焦点を当てます。このタスクをサポートするために、ASOD60Kという名前の最初の大規模データセットを収集します。このデータセットには、6レベルの階層で注釈が付けられた4K解像度のビデオフレームが含まれているため、豊かさ、多様性、品質で際立っています。具体的には、各シーケンスはそのスーパー/サブクラスの両方でマークされ、各サブクラスのオブジェクトには、人間の目の凝視、境界ボックス、オブジェクト/インスタンスレベルのマスク、および関連する属性（たとえば、幾何学的歪み）がさらに注釈されます。）。これらの粗い注釈から細かい注釈により、PV-SODモデリングの詳細な分析が可能になります。たとえば、既存のSODモデルの主要な課題を特定したり、スキャンパスを予測して人間の長期的な固視行動を研究したりできます。 ASOD60Kで11の代表的なアプローチを体系的にベンチマークし、いくつかの興味深い発見を導き出します。この研究が、パノラマビデオに向けてSOD研究を進めるための良い出発点として役立つことを願っています。データセットとベンチマークは、https：//github.com/PanoAsh/ASOD60Kで公開されます。

Exploring to what humans pay attention in dynamic panoramic scenes is useful for many fundamental applications, including augmented reality (AR) in retail, AR-powered recruitment, and visual language navigation. With this goal in mind, we propose PV-SOD, a new task that aims to segment salient objects from panoramic videos. In contrast to existing fixation-/object-level saliency detection tasks, we focus on audio-induced salient object detection (SOD), where the salient objects are labeled with the guidance of audio-induced eye movements. To support this task, we collect the first large-scale dataset, named ASOD60K, which contains 4K-resolution video frames annotated with a six-level hierarchy, thus distinguishing itself with richness, diversity and quality. Specifically, each sequence is marked with both its super-/sub-class, with objects of each sub-class being further annotated with human eye fixations, bounding boxes, object-/instance-level masks, and associated attributes (e.g., geometrical distortion). These coarse-to-fine annotations enable detailed analysis for PV-SOD modelling, e.g., determining the major challenges for existing SOD models, and predicting scanpaths to study the long-term eye fixation behaviors of humans. We systematically benchmark 11 representative approaches on ASOD60K and derive several interesting findings. We hope this study could serve as a good starting point for advancing SOD research towards panoramic videos. The dataset and benchmark will be made publicly available at https://github.com/PanoAsh/ASOD60K.

updated: Fri Nov 12 2021 07:14:34 GMT+0000 (UTC)

published: Sat Jul 24 2021 15:14:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト