The Right to Talk: An Audio-Visual Transformer Approach

Thanh-Dat Truong; Chi Nhan Duong; The De Vu; Hoang Anh Pham; Bhiksha Raj; Ngan Le; Khoa Luu

話す権利：視聴覚トランスフォーマーアプローチ

話者交替は、会話の規制を構築する上で重要な役割を果たしてきました。メインスピーカー（適切に発言の順番を取っている）とインタラプタ（メインスピーカーの発話を中断または反応している）を特定するタスクは、依然として困難なタスクです。以前のいくつかの方法はこのタスクに部分的に対処しましたが、まだいくつかの制限が残っています。まず、オーディオ機能とビジュアル機能を直接関連付けると、モダリティが異なるため、抽出される相関関係が制限される場合があります。第二に、ローカリゼーション、分離、および会話のコンテキストの一貫性を維持するのに役立つ時間セグメント間の関係が効果的に活用されていません。最後に、通常、新しい話者への移行に関する追跡と予測の決定を含む話者間の相互作用は、通常無視されます。したがって、この作業では、ローカリゼーションの問題に対する新しいオーディオビジュアルトランスフォーマーアプローチを紹介し、実際のマルチスピーカー会話ビデオのオーディオチャネルとビジュアルチャネルの両方でメインスピーカーを強調表示します。提案された方法は、視覚信号と音声信号の両方で提示されるさまざまなタイプの相関関係を利用します。トランスフォーマー構造の自己注意メカニズムを介して、時空間空間全体の時間的な視聴覚関係が予測され、最適化されます。さらに、新しく収集されたデータセットがメインスピーカー検出用に導入されています。私たちの知る限り、これは、マルチスピーカー会話ビデオのビジュアルチャネルとオーディオチャネルの両方でメインスピーカーを自動的にローカライズして強調表示できる最初の研究の1つです。

Turn-taking has played an essential role in structuring the regulation of a conversation. The task of identifying the main speaker (who is properly taking his/her turn of speaking) and the interrupters (who are interrupting or reacting to the main speaker's utterances) remains a challenging task. Although some prior methods have partially addressed this task, there still remain some limitations. Firstly, a direct association of Audio and Visual features may limit the correlations to be extracted due to different modalities. Secondly, the relationship across temporal segments helping to maintain the consistency of localization, separation, and conversation contexts is not effectively exploited. Finally, the interactions between speakers that usually contain the tracking and anticipatory decisions about the transition to a new speaker are usually ignored. Therefore, this work introduces a new Audio-Visual Transformer approach to the problem of localization and highlighting the main speaker in both audio and visual channels of a multi-speaker conversation video in the wild. The proposed method exploits different types of correlations presented in both visual and audio signals. The temporal audio-visual relationships across spatial-temporal space are anticipated and optimized via the self-attention mechanism in a Transformerstructure. Moreover, a newly collected dataset is introduced for the main speaker detection. To the best of our knowledge, it is one of the first studies that is able to automatically localize and highlight the main speaker in both visual and audio channels in multi-speaker conversation videos.

updated: Fri Aug 06 2021 18:04:24 GMT+0000 (UTC)

published: Fri Aug 06 2021 18:04:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト