Bridging the Performance Gap between DETR and R-CNN for Graphical Object Detection in Document Images

Tahira Shehzadi; Khurram Azeem Hashmi; Didier Stricker; Marcus Liwicki; Muhammad Zeshan Afzal

文書画像内のグラフィカルオブジェクト検出における DETR と R-CNN のパフォーマンスギャップを埋める

この論文は、グラフィカルオブジェクト検出における DETR と R-CNN の間のパフォーマンスギャップを埋める重要な一歩を踏み出しました。既存のグラフィックオブジェクト検出アプローチは、CNN ベースのオブジェクト検出方法の最近の機能強化を享受し、目覚ましい進歩を遂げています。最近、Transformer ベースの検出器によって一般的なオブジェクト検出パフォーマンスが大幅に向上し、手作りの機能や、オブジェクトクエリを使用した非最大抑制 (NMS) などの後処理ステップが不要になりました。ただし、このような強化されたトランスフォーマーベースの検出アルゴリズムの有効性は、グラフィックオブジェクト検出の問題に関してはまだ検証されていません。基本的に、DETR の最新の進歩に触発され、グラフィックオブジェクト検出用に既存の検出トランスフォーマーをほとんど変更せずに採用しています。ポイント、アンカーボックスを使用し、アンカーに正と負のノイズを追加してパフォーマンスを向上させるなど、さまざまな方法でオブジェクトクエリを変更します。これらの修正により、さまざまなサイズやアスペクト比を持つオブジェクトの処理が向上し、オブジェクトの位置やサイズの小さな変動に対する堅牢性が向上し、オブジェクトと非オブジェクト間の画像識別が向上します。 PubTables、TableBank、NTable、PubLaynet の 4 つのグラフィカルデータセットに対するアプローチを評価します。 DETR にクエリ変更を統合すると、以前の研究を上回り、TableBank、PubLaynet、PubTables でそれぞれ 96.9%、95.7%、99.3% の mAP という新しい最先端の結果を達成しました。広範なアブレーションの結果は、他のアプリケーションと同様に、変圧器ベースの方法が文書分析においてより効果的であることを示しています。この研究が文書画像分析における検出変換器の使用に関する研究にさらに注目を集めることを願っています。

This paper takes an important step in bridging the performance gap between DETR and R-CNN for graphical object detection. Existing graphical object detection approaches have enjoyed recent enhancements in CNN-based object detection methods, achieving remarkable progress. Recently, Transformer-based detectors have considerably boosted the generic object detection performance, eliminating the need for hand-crafted features or post-processing steps such as Non-Maximum Suppression (NMS) using object queries. However, the effectiveness of such enhanced transformer-based detection algorithms has yet to be verified for the problem of graphical object detection. Essentially, inspired by the latest advancements in the DETR, we employ the existing detection transformer with few modifications for graphical object detection. We modify object queries in different ways, using points, anchor boxes and adding positive and negative noise to the anchors to boost performance. These modifications allow for better handling of objects with varying sizes and aspect ratios, more robustness to small variations in object positions and sizes, and improved image discrimination between objects and non-objects. We evaluate our approach on the four graphical datasets: PubTables, TableBank, NTable and PubLaynet. Upon integrating query modifications in the DETR, we outperform prior works and achieve new state-of-the-art results with the mAP of 96.9%, 95.7% and 99.3% on TableBank, PubLaynet, PubTables, respectively. The results from extensive ablations show that transformer-based methods are more effective for document analysis analogous to other applications. We hope this study draws more attention to the research of using detection transformers in document image analysis.

updated: Fri Jun 23 2023 14:46:03 GMT+0000 (UTC)

published: Fri Jun 23 2023 14:46:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト