Video Text Tracking With a Spatio-Temporal Complementary Model

Yuzhe Gao; Xing Li; Jiajian Zhang; Yu Zhou; Dian Jin; Jing Wang; Shenggao Zhu; Xiang Bai

時空間相補モデルによるビデオテキスト追跡

テキスト追跡とは、ビデオ内の複数のテキストを追跡し、各テキストの軌跡を作成することです。既存の方法は、検出による追跡フレームワークを利用することによって、このタスクを積み重ねます。つまり、各フレームのテキストインスタンスを検出し、連続するフレームの対応するテキストインスタンスを関連付けます。このパラダイムの追跡精度は、より複雑なシナリオでは厳しく制限されていると主張します。たとえば、モーションブラーなどが原因で、テキストインスタンスの検出を見逃すと、テキストの軌跡が途切れます。さらに、外観が似ているさまざまなテキストインスタンスは簡単に混乱し、テキストインスタンスの関連付けが正しくなくなります。この目的のために、新しい時空間補完テキスト追跡モデルがこの論文で提案されている。 Siamese ComplementaryModuleを活用して、時間次元のテキストインスタンスの連続性特性を完全に活用します。これにより、テキストインスタンスの検出漏れが効果的に軽減され、各テキスト軌跡の完全性が保証されます。さらに、テキスト類似性学習ネットワークを介して、テキストインスタンスの意味的手がかりと視覚的手がかりを統一された表現に統合します。これにより、類似した外観のテキストインスタンスの存在下で高い識別力が提供され、それらの間の誤関連付けが回避されます。私たちの方法は、いくつかの公開ベンチマークで最先端のパフォーマンスを実現します。ソースコードはhttps://github.com/lsabrinax/VideoTextSCMで入手できます。

Text tracking is to track multiple texts in a video,and construct a trajectory for each text. Existing methodstackle this task by utilizing the tracking-by-detection frame-work, i.e., detecting the text instances in each frame andassociating the corresponding text instances in consecutiveframes. We argue that the tracking accuracy of this paradigmis severely limited in more complex scenarios, e.g., owing tomotion blur, etc., the missed detection of text instances causesthe break of the text trajectory. In addition, different textinstances with similar appearance are easily confused, leadingto the incorrect association of the text instances. To this end,a novel spatio-temporal complementary text tracking model isproposed in this paper. We leverage a Siamese ComplementaryModule to fully exploit the continuity characteristic of the textinstances in the temporal dimension, which effectively alleviatesthe missed detection of the text instances, and hence ensuresthe completeness of each text trajectory. We further integratethe semantic cues and the visual cues of the text instance intoa unified representation via a text similarity learning network,which supplies a high discriminative power in the presence oftext instances with similar appearance, and thus avoids the mis-association between them. Our method achieves state-of-the-art performance on several public benchmarks. The source codeis available at https://github.com/lsabrinax/VideoTextSCM.

updated: Tue Nov 09 2021 08:23:06 GMT+0000 (UTC)

published: Tue Nov 09 2021 08:23:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト