Lip-reading with Densely Connected Temporal Convolutional Networks

Pingchuan Ma; Yujiang Wang; Jie Shen; Stavros Petridis; Maja Pantic

密に接続された時間畳み込みネットワークによる読唇術

この作業では、孤立した単語の読唇術のための密に接続された時間畳み込みネットワーク（DC-TCN）を紹介します。 Temporal Convolutional Networks（TCN）は最近、多くの視覚タスクで大きな可能性を示していますが、その受容野は、読唇シナリオで複雑な時間的ダイナミクスをモデル化するのに十分な密度ではありません。この問題に対処するために、ネットワークに高密度接続を導入して、より堅牢な時間的特徴をキャプチャします。さらに、私たちのアプローチは、軽量の注意メカニズムであるSqueeze-and-Excitationブロックを利用して、モデルの分類力をさらに強化します。ベルやホイッスルがない場合、DC-TCNメソッドはLip Reading in the Wild（LRW）データセットで88.36％の精度を達成し、LRW-1000データセットで43.65％の精度を達成しました。これは、すべてのベースラインメソッドを上回り、新しい状態です。 -両方のデータセットの最先端。

In this work, we present the Densely Connected Temporal Convolutional Network (DC-TCN) for lip-reading of isolated words. Although Temporal Convolutional Networks (TCN) have recently demonstrated great potential in many vision tasks, its receptive fields are not dense enough to model the complex temporal dynamics in lip-reading scenarios. To address this problem, we introduce dense connections into the network to capture more robust temporal features. Moreover, our approach utilises the Squeeze-and-Excitation block, a light-weight attention mechanism, to further enhance the model's classification power. Without bells and whistles, our DC-TCN method has achieved 88.36% accuracy on the Lip Reading in the Wild (LRW) dataset and 43.65% on the LRW-1000 dataset, which has surpassed all the baseline methods and is the new state-of-the-art on both datasets.

updated: Wed Nov 11 2020 20:15:49 GMT+0000 (UTC)

published: Tue Sep 29 2020 18:08:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト