A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer

Weijia Wu; Yuanqiang Cai; Debing Zhang; Sibo Wang; Zhuang Li; Jiahong Li; Yejun Tang; Hong Zhou

バイリンガルのOpenWorldビデオテキストデータセットとトランスフォーマーを備えたエンドツーエンドのビデオテキストスポッター

ほとんどの既存のビデオテキストスポッティングベンチマークは、限られたデータで単一の言語とシナリオを評価することに焦点を合わせています。この作業では、大規模なバイリンガルのオープンワールドビデオテキストベンチマークデータセット（BOVText）を紹介します。 BOVTextには4つの機能があります。まず、1,750,000以上のフレームを持つ2,000以上の動画を提供します。これは、動画に付随するテキストを含む既存の最大のデータセットの25倍です。第二に、私たちのデータセットは、Life Vlog、Driving、Movieなどのさまざまなシナリオの幅広い選択を備えた30以上のオープンカテゴリをカバーしています。第三に、豊富なテキストタイプの注釈（つまり、タイトル、キャプション、またはシーンテキスト）がさまざまな表現に提供されますビデオの意味。第4に、BOVTextは、複数の文化のライブとコミュニケーションを促進するためのバイリンガルテキスト注釈を提供します。さらに、TransVTSpotterと呼ばれるTransformerを使用したエンドツーエンドのビデオテキストスポッティングフレームワークを提案します。これは、シンプルでありながら効率的なアテンションベースのクエリキーメカニズムを使用して、ビデオ内のマルチオリエントテキストスポッティングを解決します。前のフレームのオブジェクト機能を現在のフレームの追跡クエリとして適用し、回転角予測を導入してマルチオリエントテキストインスタンスに適合させます。 ICDAR2015（ビデオ）では、TransVTSpotterは44.1％MOTA、9fpsで最先端のパフォーマンスを実現しています。 TransVTSpotterのデータセットとコードは、それぞれgithub：com = weijiawu = BOVTextとgithub：com = weijiawu = TransVTSpotterにあります。

Most existing video text spotting benchmarks focus on evaluating a single language and scenario with limited data. In this work, we introduce a large-scale, Bilingual, Open World Video text benchmark dataset(BOVText). There are four features for BOVText. Firstly, we provide 2,000+ videos with more than 1,750,000+ frames, 25 times larger than the existing largest dataset with incidental text in videos. Secondly, our dataset covers 30+ open categories with a wide selection of various scenarios, e.g., Life Vlog, Driving, Movie, etc. Thirdly, abundant text types annotation (i.e., title, caption or scene text) are provided for the different representational meanings in video. Fourthly, the BOVText provides bilingual text annotation to promote multiple cultures live and communication. Besides, we propose an end-to-end video text spotting framework with Transformer, termed TransVTSpotter, which solves the multi-orient text spotting in video with a simple, but efficient attention-based query-key mechanism. It applies object features from the previous frame as a tracking query for the current frame and introduces a rotation angle prediction to fit the multiorient text instance. On ICDAR2015(video), TransVTSpotter achieves the state-of-the-art performance with 44.1% MOTA, 9 fps. The dataset and code of TransVTSpotter can be found at github:com=weijiawu=BOVText and github:com=weijiawu=TransVTSpotter, respectively.

updated: Thu Dec 09 2021 13:21:26 GMT+0000 (UTC)

published: Thu Dec 09 2021 13:21:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト