Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps

Qi Zhu; Chenyu Gao; Peng Wang; Qi Wu

シンプルは簡単ではありません：TextVQAとTextCapsのシンプルで強力なベースライン

OCR（光学式文字認識）ツールで認識できる日常のシーンに表示されるテキストには、通りの名前、製品のブランド、価格などの重要な情報が含まれています。テキストベースの視覚的な質問応答とテキストベースの画像キャプションの2つのタスクは、既存の視覚言語アプリケーションからのテキスト拡張機能を使用して、急速に普及しています。これらの問題に対処するために、多くの洗練されたマルチモダリティエンコーディングフレームワーク（異種グラフ構造など）が使用されています。この論文では、単純な注意メカニズムが、ベルやホイッスルなしで同じまたはさらに優れた仕事をすることができると主張します。このメカニズムでは、OCRトークン機能を個別の視覚的および言語的注意ブランチに分割し、それらを人気のあるTransformerデコーダーに送信して、回答またはキャプションを生成します。驚いたことに、この単純なベースラインモデルはかなり強力であることがわかりました。これらのSOTAモデルははるかに複雑なものを使用していますが、2つの人気のあるベンチマークであるTextVQAとST-VQAの3つのタスクすべてで常に最先端の（SOTA）モデルを上回っています。エンコードメカニズム。それをテキストベースの画像キャプションに転送すると、TextCaps Challenge2020の優勝者も上回ります。この作業が、この2つのOCRテキスト関連アプリケーションの新しいベースラインを設定し、マルチモダリティエンコーダ設計の新しい考え方を刺激することを願っています。コードはhttps://github.com/ZephyrZhuQi/ssbaselineで入手できます。

Texts appearing in daily scenes that can be recognized by OCR (Optical Character Recognition) tools contain significant information, such as street name, product brand and prices. Two tasks -- text-based visual question answering and text-based image captioning, with a text extension from existing vision-language applications, are catching on rapidly. To address these problems, many sophisticated multi-modality encoding frameworks (such as heterogeneous graph structure) are being used. In this paper, we argue that a simple attention mechanism can do the same or even better job without any bells and whistles. Under this mechanism, we simply split OCR token features into separate visual- and linguistic-attention branches, and send them to a popular Transformer decoder to generate answers or captions. Surprisingly, we find this simple baseline model is rather strong -- it consistently outperforms state-of-the-art (SOTA) models on two popular benchmarks, TextVQA and all three tasks of ST-VQA, although these SOTA models use far more complex encoding mechanisms. Transferring it to text-based image captioning, we also surpass the TextCaps Challenge 2020 winner. We wish this work to set the new baseline for this two OCR text related applications and to inspire new thinking of multi-modality encoder design. Code is available at https://github.com/ZephyrZhuQi/ssbaseline

updated: Wed Dec 09 2020 16:43:39 GMT+0000 (UTC)

published: Wed Dec 09 2020 16:43:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト