Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection

Kyle Min; Sourya Roy; Subarna Tripathi; Tanaya Guha; Somdeb Majumdar

アクティブスピーカー検出のための長期時空間グラフの学習

複数の話者がいるビデオでのアクティブ話者検出（ASD）は、効果的な視聴覚機能と長い時間ウィンドウでの時空間相関を学習する必要があるため、困難な作業です。この論文では、ASDなどの複雑なタスクを解決できる新しい時空間グラフ学習フレームワークであるSPELLを紹介します。この目的のために、ビデオフレーム内の各人物は、最初にそのフレームの一意のノードにエンコードされます。フレーム全体で1人の人物に対応するノードは、時間的ダイナミクスをエンコードするために接続されます。フレーム内のノードも接続されて、個人間の関係をエンコードします。したがって、SPELLはASDをノード分類タスクに減らします。重要なことに、SPELLは、計算コストの高い完全接続グラフニューラルネットワークに依存することなく、すべてのノードの長い時間的コンテキストを推論できます。 AVA-ActiveSpeakerデータセットでの広範な実験を通じて、グラフベースの表現を学習すると、その明示的な空間的および時間的構造により、アクティブスピーカーの検出パフォーマンスが大幅に向上することを示します。 SPELLは、以前のすべての最先端のアプローチを上回りますが、必要なメモリと計算リソースは大幅に少なくなります。私たちのコードはhttps://github.com/SRA2/SPELLで公開されています

Active speaker detection (ASD) in videos with multiple speakers is a challenging task as it requires learning effective audiovisual features and spatial-temporal correlations over long temporal windows. In this paper, we present SPELL, a novel spatial-temporal graph learning framework that can solve complex tasks such as ASD. To this end, each person in a video frame is first encoded in a unique node for that frame. Nodes corresponding to a single person across frames are connected to encode their temporal dynamics. Nodes within a frame are also connected to encode inter-person relationships. Thus, SPELL reduces ASD to a node classification task. Importantly, SPELL is able to reason over long temporal contexts for all nodes without relying on computationally expensive fully connected graph neural networks. Through extensive experiments on the AVA-ActiveSpeaker dataset, we demonstrate that learning graph-based representations can significantly improve the active speaker detection performance owing to its explicit spatial and temporal structure. SPELL outperforms all previous state-of-the-art approaches while requiring significantly lower memory and computational resources. Our code is publicly available at https://github.com/SRA2/SPELL

updated: Fri Jul 15 2022 23:43:17 GMT+0000 (UTC)

published: Fri Jul 15 2022 23:43:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト