Jointly Harnessing Prior Structures and Temporal Consistency for Sign Language Video Generation

Yucheng Suo; Zhedong Zheng; Xiaohan Wang; Bang Zhang; Yi Yang

手話ビデオ生成のための以前の構造と時間的一貫性を共同で利用する

手話は、感情だけでなく感情を表現する能力の異なる人々のための窓口です。しかし、人々が手話を短期間で学ぶことは依然として困難です。この現実の課題に対処するために、この作業では、ユーザーの写真を特定の単語の手話ビデオに転送できるモーション転送システムを研究します。特に、出力ビデオの外観コンテンツは提供されたユーザー画像から取得され、ビデオのモーションは指定されたチュートリアルビデオから抽出されます。手話生成に最先端のモーション転送方法を採用する際の2つの主要な制限を観察します。（1）既存のモーション転送作業は、人体の以前の幾何学的知識を無視します。（2）以前の画像アニメーション手法では、トレーニング段階で入力として画像ペアのみを使用するため、ビデオ内の時間情報を十分に活用できませんでした。上記の制限に対処するために、手話ビデオ生成の時間的一貫性を備えた人間の以前の構造を共同で最適化するために、構造認識時間的一貫性ネットワーク（STCNet）を提案します。この論文には2つの主要な貢献があります。（1）細粒度のスケルトン検出器を利用して、身体のキーポイントに関する事前知識を提供します。このようにして、キーポイントの移動が有効な範囲にあることを確認し、モデルをより説明可能で堅牢なものにします。（2）生成されたビデオの連続性を保証するために行われる、短期サイクル損失と長期サイクル損失の2つのサイクル整合性損失を導入します。 2つの損失とキーポイント検出器ネットワークをエンドツーエンドで最適化します。

Sign language is the window for people differently-abled to express their feelings as well as emotions. However, it remains challenging for people to learn sign language in a short time. To address this real-world challenge, in this work, we study the motion transfer system, which can transfer the user photo to the sign language video of specific words. In particular, the appearance content of the output video comes from the provided user image, while the motion of the video is extracted from the specified tutorial video. We observe two primary limitations in adopting the state-of-the-art motion transfer methods to sign language generation:(1) Existing motion transfer works ignore the prior geometrical knowledge of the human body. (2) The previous image animation methods only take image pairs as input in the training stage, which could not fully exploit the temporal information within videos. In an attempt to address the above-mentioned limitations, we propose Structure-aware Temporal Consistency Network (STCNet) to jointly optimize the prior structure of human with the temporal consistency for sign language video generation. There are two main contributions in this paper. (1) We harness a fine-grained skeleton detector to provide prior knowledge of the body keypoints. In this way, we ensure the keypoint movement in a valid range and make the model become more explainable and robust. (2) We introduce two cycle-consistency losses, i.e., short-term cycle loss and long-term cycle loss, which are conducted to assure the continuity of the generated video. We optimize the two losses and keypoint detector network in an end-to-end manner.

updated: Fri Jul 08 2022 07:10:28 GMT+0000 (UTC)

published: Fri Jul 08 2022 07:10:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト