CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment

Jiangbin Zheng; Yile Wang; Cheng Tan; Siyuan Li; Ge Wang; Jun Xia; Yidong Chen; Stan Z. Li

CVT-SLR: 変分アライメントによる手話認識のための対照的な視覚テキスト変換

手話認識 (SLR) は、手話ビデオにテキストのグロスとして注釈を付ける、監視が弱いタスクです。最近の研究では、利用可能な大規模な手話データセットの不足によって引き起こされる不十分なトレーニングが、SLR の主なボトルネックになることが示されています。そのため、大部分の SLR 作品は、事前トレーニング済みのビジュアルモジュールを採用し、2 つの主流ソリューションを開発しています。マルチストリームアーキテクチャはマルチキュービジュアル機能を拡張し、現在の SOTA パフォーマンスを実現しますが、複雑な設計が必要であり、潜在的なノイズが発生する可能性があります。あるいは、ビジュアルモダリティとテキストモダリティ間の明示的なクロスモーダルアラインメントを使用する高度なシングルキュー SLR フレームワークは、シンプルで効果的であり、マルチキューフレームワークと競合する可能性があります。この作業では、視覚と言語の両方のモダリティの事前訓練された知識を完全に探索するために、SLR の新しい対照的な視覚テキスト変換、CVT-SLR を提案します。単一キューのクロスモーダルアラインメントフレームワークに基づいて、完全な事前トレーニング済み言語モジュールを導入しながら、事前トレーニング済みのコンテキスト知識用の変分オートエンコーダー (VAE) を提案します。 VAE は、視覚的モダリティとテキストモダリティを暗黙的に調整しながら、従来のコンテキストモジュールとして事前にトレーニングされたコンテキスト知識の恩恵を受けます。一方,対照的なクロスモーダルアラインメントアルゴリズムを提案して,明示的一貫性制約をさらに強化した。最も人気のある 2 つの公開データセット PHOENIX-2014 と PHOENIX-2014T で実施された広範な実験では、提案された SLR フレームワークが既存の単一キュー方法よりも一貫して優れているだけでなく、SOTA マルチキュー方法よりも優れていることが示されています。

Sign language recognition (SLR) is a weakly supervised task that annotates sign videos as textual glosses. Recent studies show that insufficient training caused by the lack of large-scale available sign language datasets becomes the main bottleneck for SLR. The majority of SLR works thereby adopt pretrained visual modules and develop two mainstream solutions. The multi-stream architectures extend multi-cue visual features, yielding the current SOTA performances but requiring complex designs and might introduce potential noise. Alternatively, the advanced single-cue SLR frameworks using explicit cross-modal alignment between visual and textual modalities are simple and effective, potentially competitive with the multi-cue framework. In this work, we propose a novel contrastive visual-textual transformation for SLR, CVT-SLR, to fully explore the pretrained knowledge of both the visual and language modalities. Based on the single-cue cross-modal alignment framework, we propose a variational autoencoder (VAE) for pretrained contextual knowledge while introducing the complete pretrained language module. The VAE implicitly aligns visual and textual modalities while benefiting from pretrained contextual knowledge as the traditional contextual module. Meanwhile, a contrastive cross-modal alignment algorithm is proposed to further enhance the explicit consistency constraints. Extensive experiments conducted on the two most popular public datasets, PHOENIX-2014 and PHOENIX-2014T, demonstrate that our proposed SLR framework not only consistently outperforms existing single-cue methods but even outperforms SOTA multi-cue methods.

updated: Tue Mar 21 2023 13:28:49 GMT+0000 (UTC)

published: Fri Mar 10 2023 06:12:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト