LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

Feng Xue; Yu Li; Deyin Liu; Yincen Xie; Lin Wu; Richang Hong

LipFormer: Visual-Landmark Transformers に基づく目に見えない話者の読唇術の学習

読唇術とは、ビデオ内のスピーカーのスピーチを理解し、さらに自然言語に翻訳することを指します。最先端の読唇法は、オーバーラップスピーカーの解釈に優れています。つまり、スピーカーはトレーニングセットと推論セットの両方に表示されます。ただし、これらの方法を目に見えない話者に一般化すると、トレーニングバンク内の話者の数が限られていることと、異なる話者の唇の形状/色によって引き起こされる明らかな視覚的変化のために、壊滅的なパフォーマンスの低下が発生します。したがって、唇の目に見える変化だけに頼ると、モデルのオーバーフィッティングが発生する傾向があります。この問題に対処するために、視覚的およびランドマーク全体でマルチモーダル機能を使用することを提案します。これにより、話者のアイデンティティに関係なく唇の動きを説明できます。次に、ビジュアルランドマークトランスフォーマー、つまり LipFormer に基づく文レベルの読唇術フレームワークを開発します。具体的には、LipFormer は、唇のモーションストリーム、顔のランドマークストリーム、およびクロスモーダルフュージョンで構成されます。 2 つのストリームからの埋め込みは、セルフアテンションによって生成され、クロスアテンションモジュールに供給されて、ビジュアルとランドマークの間の位置合わせを実現します。最後に、結果の融合された機能は、カスケード seq2seq モデルによって出力テキストにデコードできます。実験は、私たちの方法が目に見えない話者へのモデルの一般化を効果的に強化できることを示しています。

Lipreading refers to understanding and further translating the speech of a speaker in the video into natural language. State-of-the-art lipreading methods excel in interpreting overlap speakers, i.e., speakers appear in both training and inference sets. However, generalizing these methods to unseen speakers incurs catastrophic performance degradation due to the limited number of speakers in training bank and the evident visual variations caused by the shape/color of lips for different speakers. Therefore, merely depending on the visible changes of lips tends to cause model overfitting. To address this problem, we propose to use multi-modal features across visual and landmarks, which can describe the lip motion irrespective to the speaker identities. Then, we develop a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer. Specifically, LipFormer consists of a lip motion stream, a facial landmark stream, and a cross-modal fusion. The embeddings from the two streams are produced by self-attention, which are fed to the cross-attention module to achieve the alignment between visuals and landmarks. Finally, the resulting fused features can be decoded to output texts by a cascade seq2seq model. Experiments demonstrate that our method can effectively enhance the model generalization to unseen speakers.

updated: Sat Feb 04 2023 10:22:18 GMT+0000 (UTC)

published: Sat Feb 04 2023 10:22:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト