CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment

Jiangbin Zheng; Yile Wang; Cheng Tan; Siyuan Li; Ge Wang; Jun Xia; Yidong Chen; Stan Z. Li

CVT-SLR: 変分アライメントによる手話認識のための対照的な視覚テキスト変換

手話認識 (SLR) は、手話ビデオにテキストのグロスとして注釈を付ける、監視が弱いタスクです。最近の研究では、大規模な利用可能な標識データセットの不足によって引き起こされる不十分なトレーニングが、SLR の主なボトルネックになることが示されています。そのため、ほとんどの SLR 作品は、事前にトレーニングされたビジュアルモジュールを採用し、2 つの主流ソリューションを開発しています。マルチストリームアーキテクチャはマルチキュービジュアル機能を拡張し、現在の SOTA パフォーマンスを実現しますが、複雑な設計が必要であり、潜在的なノイズが発生する可能性があります。あるいは、ビジュアルモダリティとテキストモダリティ間の明示的なクロスモーダルアラインメントを使用する高度なシングルキュー SLR フレームワークは、シンプルで効果的であり、マルチキューフレームワークと競合する可能性があります。この作業では、視覚と言語の両方のモダリティの事前訓練された知識を完全に探索するために、SLR の新しい対照的な視覚テキスト変換、CVT-SLR を提案します。単一キューのクロスモーダルアラインメントフレームワークに基づいて、完全な事前トレーニング済み言語モジュールを導入しながら、事前トレーニング済みのコンテキスト知識用の変分オートエンコーダー (VAE) を提案します。 VAE は、視覚的モダリティとテキストモダリティを暗黙的に調整しながら、従来のコンテキストモジュールとして事前にトレーニングされたコンテキスト知識の恩恵を受けます。一方、対照的なクロスモーダルアラインメントアルゴリズムは、一貫性の制約を明示的に強化するように設計されています。公開データセット (PHOENIX-2014 および PHOENIX-2014T) での広範な実験では、提案された CVT-SLR が一貫して既存の単一キュー方法よりも優れており、SOTA マルチキュー方法よりも優れていることが示されています。

Sign language recognition (SLR) is a weakly supervised task that annotates sign videos as textual glosses. Recent studies show that insufficient training caused by the lack of large-scale available sign datasets becomes the main bottleneck for SLR. Most SLR works thereby adopt pretrained visual modules and develop two mainstream solutions. The multi-stream architectures extend multi-cue visual features, yielding the current SOTA performances but requiring complex designs and might introduce potential noise. Alternatively, the advanced single-cue SLR frameworks using explicit cross-modal alignment between visual and textual modalities are simple and effective, potentially competitive with the multi-cue framework. In this work, we propose a novel contrastive visual-textual transformation for SLR, CVT-SLR, to fully explore the pretrained knowledge of both the visual and language modalities. Based on the single-cue cross-modal alignment framework, we propose a variational autoencoder (VAE) for pretrained contextual knowledge while introducing the complete pretrained language module. The VAE implicitly aligns visual and textual modalities while benefiting from pretrained contextual knowledge as the traditional contextual module. Meanwhile, a contrastive cross-modal alignment algorithm is designed to explicitly enhance the consistency constraints. Extensive experiments on public datasets (PHOENIX-2014 and PHOENIX-2014T) demonstrate that our proposed CVT-SLR consistently outperforms existing single-cue methods and even outperforms SOTA multi-cue methods.

updated: Thu Mar 23 2023 12:00:33 GMT+0000 (UTC)

published: Fri Mar 10 2023 06:12:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト