Looking for the Signs: Identifying Isolated Sign Instances in Continuous Video Footage

Tao Jiang; Necati Cihan Camgoz; Richard Bowden

兆候を探す：連続ビデオ映像で孤立した兆候インスタンスを特定する

この論文では、ワンショット手話スポッティングのタスクに焦点を当てます。つまり、孤立した手話（クエリ）の例を挙げて、この手話が連続的な同時調音の手話ビデオ（ターゲット）に表示されるかどうか/どこに表示されるかを特定します。）。この目標を達成するために、SignLookupと呼ばれる変圧器ベースのネットワークを提案します。 3D畳み込みニューラルネットワーク（CNN）を使用して、ビデオクリップから時空間表現を抽出します。クエリとターゲットビデオ間の時間スケールの不一致を解決するために、異なるフレームレベルのストライドを使用して、単一のビデオクリップから複数のクエリを作成します。連続的なスケール空間をシミュレートするために、これらのクエリクリップ全体に自己注意が適用されます。また、シーケンス内のコンテキストを学習するために、ターゲットビデオで別の自己注意モジュールを利用します。最後に、相互注意を使用して時間スケールを一致させ、ターゲットシーケンス内でクエリをローカライズします。広範な実験は、提案されたアプローチが、署名者の外見に関係なく、連続ビデオ内の孤立した兆候を確実に識別できるだけでなく、さまざまな手話に一般化できることを示しています。アテンションメカニズムと適応機能を利用することにより、私たちのモデルは、挑戦的なベンチマークデータセットで96％もの精度で、他のアプローチを大幅に上回る、サインスポッティングタスクで最先端のパフォーマンスを実現します。

In this paper, we focus on the task of one-shot sign spotting, i.e. given an example of an isolated sign (query), we want to identify whether/where this sign appears in a continuous, co-articulated sign language video (target). To achieve this goal, we propose a transformer-based network, called SignLookup. We employ 3D Convolutional Neural Networks (CNNs) to extract spatio-temporal representations from video clips. To solve the temporal scale discrepancies between the query and the target videos, we construct multiple queries from a single video clip using different frame-level strides. Self-attention is applied across these query clips to simulate a continuous scale space. We also utilize another self-attention module on the target video to learn the contextual within the sequence. Finally a mutual-attention is used to match the temporal scales to localize the query within the target sequence. Extensive experiments demonstrate that the proposed approach can not only reliably identify isolated signs in continuous videos, regardless of the signers' appearance, but can also generalize to different sign languages. By taking advantage of the attention mechanism and the adaptive features, our model achieves state-of-the-art performance on the sign spotting task with accuracy as high as 96% on challenging benchmark datasets and significantly outperforming other approaches.

updated: Sat Nov 20 2021 19:33:38 GMT+0000 (UTC)

published: Wed Jul 21 2021 12:49:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト