Mask is All You Need: Rethinking Mask R-CNN for Dense and Arbitrary-Shaped Scene Text Detection

Xugong Qin; Yu Zhou; Youhui Guo; Dayan Wu; Zhihong Tian; Ning Jiang; Hongbin Wang; Weiping Wang

必要なのはマスクだけです：高密度で任意の形状のシーンテキスト検出のためのマスクR-CNNの再考

オブジェクト検出とインスタンスセグメンテーションで大きな成功を収めているため、Mask R-CNNは大きな注目を集めており、任意の形状のシーンテキストの検出とスポッティングの強力なベースラインとして広く採用されています。ただし、2つの問題はまだ解決されていません。 1つ目は、密度の高いテキストケースです。これは、無視されがちですが、非常に実用的です。 1つの提案に複数のインスタンスが存在する可能性があるため、マスクヘッドが異なるインスタンスを区別することが困難になり、パフォーマンスが低下します。この作業では、パフォーマンスの低下はマスクヘッドの学習の混乱の問題に起因すると主張します。マスクヘッドで「deconv-conv」デコーダーの代わりにMLPデコーダーを使用することを提案します。これにより、問題が軽減され、堅牢性が大幅に向上します。また、各ピクセルをテキストまたは非テキストに分類するのではなく、マスクヘッドがインスタンス全体の形状を予測することを学習する、インスタンス対応のマスク学習を提案します。インスタンス対応のマスク学習を使用すると、マスクブランチは分離されたコンパクトなマスクを学習できます。 2つ目は、スケールとアスペクト比が大きく異なるため、RPNには複雑なアンカー設定が必要であり、異なるデータセット間での保守と転送が困難になることです。この問題を解決するために、すべてのインスタンス、特に極端なアスペクト比を持つインスタンスが十分なアンカーに関連付けられることが保証される適応ラベル割り当てを提案します。これらのコンポーネントを備えたMAYORという名前の提案された方法は、DAST1500、MSRA-TD500、ICDAR2015、CTW1500、およびTotal-Textを含む5つのベンチマークで最先端のパフォーマンスを実現します。

Due to the large success in object detection and instance segmentation, Mask R-CNN attracts great attention and is widely adopted as a strong baseline for arbitrary-shaped scene text detection and spotting. However, two issues remain to be settled. The first is dense text case, which is easy to be neglected but quite practical. There may exist multiple instances in one proposal, which makes it difficult for the mask head to distinguish different instances and degrades the performance. In this work, we argue that the performance degradation results from the learning confusion issue in the mask head. We propose to use an MLP decoder instead of the "deconv-conv" decoder in the mask head, which alleviates the issue and promotes robustness significantly. And we propose instance-aware mask learning in which the mask head learns to predict the shape of the whole instance rather than classify each pixel to text or non-text. With instance-aware mask learning, the mask branch can learn separated and compact masks. The second is that due to large variations in scale and aspect ratio, RPN needs complicated anchor settings, making it hard to maintain and transfer across different datasets. To settle this issue, we propose an adaptive label assignment in which all instances especially those with extreme aspect ratios are guaranteed to be associated with enough anchors. Equipped with these components, the proposed method named MAYOR achieves state-of-the-art performance on five benchmarks including DAST1500, MSRA-TD500, ICDAR2015, CTW1500, and Total-Text.

updated: Wed Sep 08 2021 04:32:29 GMT+0000 (UTC)

published: Wed Sep 08 2021 04:32:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト