Vector Quantized Diffusion Model with CodeUnet for Text-to-Sign Pose Sequences Generation

Pan Xie; Qipeng Zhang; Zexian Li; Hao Tang; Yao Du; Xiaohui Hu

Text-to-Sign ポーズシーケンス生成のための CodeUnet を使用したベクトル量子化拡散モデル

Sign Language Production (SLP) は、話し言葉を手話シーケンスに自動的に翻訳することを目的としています。 SLP のコアプロセスは、サイングロスシーケンスを対応するサインポーズシーケンス (G2P) に変換することです。ほとんどの既存の G2P モデルは通常、この条件付きの長距離生成を自己回帰的に実行するため、必然的にエラーの蓄積につながります。この問題に対処するために、反復的な非自己回帰法である PoseVQ-Diffusion と呼ばれる、条件付きポーズシーケンス生成のためのベクトル量子化拡散法を提案します。具体的には、最初にベクトル量子化変分オートエンコーダー (Pose-VQVAE) モデルを導入して、ポーズシーケンスを潜在コードのシーケンスとして表現します。次に、最近開発された拡散アーキテクチャの拡張により、潜在的な離散空間をモデル化します。時空間情報をより有効に活用するために、新しいアーキテクチャ、つまり CodeUnet を導入して、離散空間で高品質のポーズシーケンスを生成します。さらに、学習したコードを利用して、対応する光沢シーケンスのポーズシーケンスの可変長を予測するための新しい順次 k 最近傍法を開発します。その結果、自己回帰 G2P モデルと比較して、このモデルはサンプリング速度が速く、大幅に優れた結果が得られます。以前の非自己回帰 G2P メソッドと比較して、PoseVQ-Diffusion は反復的な改良により予測結果を改善し、SLP 評価ベンチマークで最先端の結果を達成します。

Sign Language Production (SLP) aims to translate spoken languages into sign sequences automatically. The core process of SLP is to transform sign gloss sequences into their corresponding sign pose sequences (G2P). Most existing G2P models usually perform this conditional long-range generation in an autoregressive manner, which inevitably leads to an accumulation of errors. To address this issue, we propose a vector quantized diffusion method for conditional pose sequences generation, called PoseVQ-Diffusion, which is an iterative non-autoregressive method. Specifically, we first introduce a vector quantized variational autoencoder (Pose-VQVAE) model to represent a pose sequence as a sequence of latent codes. Then we model the latent discrete space by an extension of the recently developed diffusion architecture. To better leverage the spatial-temporal information, we introduce a novel architecture, namely CodeUnet, to generate higher quality pose sequence in the discrete space. Moreover, taking advantage of the learned codes, we develop a novel sequential k-nearest-neighbours method to predict the variable lengths of pose sequences for corresponding gloss sequences. Consequently, compared with the autoregressive G2P models, our model has a faster sampling speed and produces significantly better results. Compared with previous non-autoregressive G2P methods, PoseVQ-Diffusion improves the predicted results with iterative refinements, thus achieving state-of-the-art results on the SLP evaluation benchmark.

updated: Fri Aug 19 2022 03:49:13 GMT+0000 (UTC)

published: Fri Aug 19 2022 03:49:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト