DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video Generation

Yichao Yan; Zanwei Zhou; Zi Wang; Jingnan Gao; Xiaokang Yang

DialogueNeRF: リアルなアバターの対面会話ビデオ生成に向けて

会話は、メタバースでの仮想アバターアクティビティの重要な要素です。自然言語処理の発展により、テキストおよび音声による会話の生成は大きな進歩を遂げました。ただし、対面での会話は日常会話の大部分を占めており、既存の方法のほとんどは 1 人のトーキングヘッドの生成に焦点を当てていました。この取り組みでは、さらに一歩進んで、リアルな対面会話ビデオの生成を検討します。会話の生成は、写真のようにリアルな個々のトーキングヘッドを生成する必要があるだけでなく、聞き手が話者に応答する必要があるため、一人のトーキングヘッドの生成よりも困難です。この論文では、この課題に対処するために、神経放射場 (NeRF) に基づく新しい統合フレームワークを提案します。具体的には、NeRF フレームワークを使用して話し手と聞き手の両方をモデル化し、個々の表現を制御するためのさまざまな条件を設定します。スピーカーはオーディオ信号によって駆動されますが、リスナーの反応は視覚情報と音響情報の両方に依存します。このようにして、すべての対話者が同じネットワーク内でモデル化され、人間のアバター間で対面の会話ビデオが生成されます。さらに、このタスクに関する将来の研究を促進するために、34 個のビデオクリップを含む新しい人間の会話データセットを収集します。定量的および定性的な実験では、画質、ポーズシーケンスの傾向、レンダリングビデオの自然さなど、さまざまな側面でこの方法を評価します。実験結果は、結果のビデオ内のアバターが現実的な会話を実行でき、個々のスタイルを維持できることを示しています。すべてのコード、データ、モデルは公開されます。

Conversation is an essential component of virtual avatar activities in the metaverse. With the development of natural language processing, textual and vocal conversation generation has achieved a significant breakthrough. However, face-to-face conversations account for the vast majority of daily conversations, while most existing methods focused on single-person talking head generation. In this work, we take a step further and consider generating realistic face-to-face conversation videos. Conversation generation is more challenging than single-person talking head generation, since it not only requires generating photo-realistic individual talking heads but also demands the listener to respond to the speaker. In this paper, we propose a novel unified framework based on neural radiance field (NeRF) to address this task. Specifically, we model both the speaker and listener with a NeRF framework, with different conditions to control individual expressions. The speaker is driven by the audio signal, while the response of the listener depends on both visual and acoustic information. In this way, face-to-face conversation videos are generated between human avatars, with all the interlocutors modeled within the same network. Moreover, to facilitate future research on this task, we collect a new human conversation dataset containing 34 clips of videos. Quantitative and qualitative experiments evaluate our method in different aspects, e.g., image quality, pose sequence trend, and naturalness of the rendering videos. Experimental results demonstrate that the avatars in the resulting videos are able to perform a realistic conversation, and maintain individual styles. All the code, data, and models will be made publicly available.

updated: Sat Aug 12 2023 14:45:58 GMT+0000 (UTC)

published: Tue Mar 15 2022 14:16:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト