Document Image Cleaning using Budget-Aware Black-Box Approximation

Ganesh Tata; Katyani Singh; Eric Van Oeveren; Nilanjan Ray

予算を意識したブラックボックス近似を使用したドキュメント画像のクリーニング

最近の研究では、ニューラルネットワークを使用して微分不可能なブラックボックス関数の動作を近似することにより、ブラックボックスをエンドツーエンドのトレーニング用の微分可能なトレーニングパイプラインに統合できることが示されました。この方法論は「微分可能バイパス」と呼ばれており、この方法をうまく適用するには、ドキュメントプリプロセッサをトレーニングしてブラックボックス OCR エンジンのパフォーマンスを向上させる必要があります。ただし、OCR エンジンを適切に近似するには、全体のすべてのサンプルに対してクエリを実行する必要があります。クエリ効率の高い方法で勾配を計算することでブラックボックスモデルの敵対的な例を見つけるために、ブラックボックス攻撃の文献でいくつかの 0 次最適化 (ZO) アルゴリズムが提案されています。ただし、このようなアルゴリズムのクエリの複雑さと収束率により、この問題では実行不可能になります。この研究では、元のシステムの OCR エンジンクエリの 10% 未満で OCR プリプロセッサをトレーニングするための 2 つのサンプル選択アルゴリズムを提案します。精度を大幅に損なうことなく、総トレーニング時間を 60% 削減することができ、また、商用 OCR エンジンの単語レベルの精度が 4% 向上し、総クエリのわずか 2.5% と金銭的コストが 32 倍削減されたことも示しています。さらに、システムのパフォーマンスに影響を与えることなく、トレーニングデータセットからドキュメント画像の 30% を取り除く単純なランキング手法を提案します。

Recent work has shown that by approximating the behaviour of a non-differentiable black-box function using a neural network, the black-box can be integrated into a differentiable training pipeline for end-to-end training. This methodology is termed "differentiable bypass,'' and a successful application of this method involves training a document preprocessor to improve the performance of a black-box OCR engine. However, a good approximation of an OCR engine requires querying it for all samples throughout the training process, which can be computationally and financially expensive. Several zeroth-order optimization (ZO) algorithms have been proposed in black-box attack literature to find adversarial examples for a black-box model by computing its gradient in a query-efficient manner. However, the query complexity and convergence rate of such algorithms makes them infeasible for our problem. In this work, we propose two sample selection algorithms to train an OCR preprocessor with less than 10% of the original system's OCR engine queries, resulting in more than 60% reduction of the total training time without significant loss of accuracy. We also show an improvement of 4% in the word-level accuracy of a commercial OCR engine with only 2.5% of the total queries and a 32x reduction in monetary cost. Further, we propose a simple ranking technique to prune 30% of the document images from the training dataset without affecting the system's performance.

updated: Thu Jun 22 2023 23:07:31 GMT+0000 (UTC)

published: Thu Jun 22 2023 23:07:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト