Sim2Real Docs: Domain Randomization for Documents in Natural Scenes using Ray-traced Rendering

Nikhil Maddikunta; Huijun Zhao; Sumit Keswani; Alfy Samuel; Fu-Ming Guo; Nishan Srishankar; Vishwa Pardeshi; Austin Huang

Sim2Real Docs：レイトレースレンダリングを使用した自然シーンのドキュメントのドメインランダム化

これまで、デジタル化されたドキュメントのコンピュータビジョンシステムは、体系的にキャプチャされた高品質のスキャンに依存することができました。今日、デジタル文書を含む取引は、専門家以外の人が携帯電話の写真をアップロードしたときに開始される可能性が高くなっています。そのため、ドキュメント自動化のためのコンピュータビジョンは、自然のシーンのコンテキストでキャプチャされたドキュメントを考慮する必要があります。追加の課題は、ドキュメント処理のタスク目標が非常にユースケース固有である可能性があることです。これにより、公開されているデータセットの有用性が制限されますが、手動のデータラベル付けもコストがかかり、ユースケース間の変換が不十分です。これらの問題に対処するために、Sim2Real Docsを作成しました。これは、データセットを合成し、自然のシーンでドキュメントのドメインランダム化を実行するためのフレームワークです。 Sim2Real Docsは、3Dモデリングとレイトレーシングレンダリング用のオープンソースツールであるBlenderを使用して、ドキュメントのプログラムによる3Dレンダリングを可能にします。光、ジオメトリ、カメラ、背景の物理的な相互作用をシミュレートするレンダリングを使用して、自然なシーンのコンテキストでドキュメントのデータセットを合成します。各レンダリングは、関心のある潜在的な特性を指定するユースケース固有のグラウンドトゥルースデータとペアになっており、無制限のタスク適合トレーニングデータを生成します。機械学習モデルの役割は、レンダリングパイプラインによって引き起こされる逆問題を解決することです。このようなモデルは、ドメインのランダム化パラメーターを微調整または調整することにより、実際のデータでさらに繰り返すことができます。

In the past, computer vision systems for digitized documents could rely on systematically captured, high-quality scans. Today, transactions involving digital documents are more likely to start as mobile phone photo uploads taken by non-professionals. As such, computer vision for document automation must now account for documents captured in natural scene contexts. An additional challenge is that task objectives for document processing can be highly use-case specific, which makes publicly-available datasets limited in their utility, while manual data labeling is also costly and poorly translates between use cases. To address these issues we created Sim2Real Docs - a framework for synthesizing datasets and performing domain randomization of documents in natural scenes. Sim2Real Docs enables programmatic 3D rendering of documents using Blender, an open source tool for 3D modeling and ray-traced rendering. By using rendering that simulates physical interactions of light, geometry, camera, and background, we synthesize datasets of documents in a natural scene context. Each render is paired with use-case specific ground truth data specifying latent characteristics of interest, producing unlimited fit-for-task training data. The role of machine learning models is then to solve the inverse problem posed by the rendering pipeline. Such models can be further iterated upon with real-world data by either fine tuning or making adjustments to domain randomization parameters.

updated: Thu Dec 16 2021 22:07:48 GMT+0000 (UTC)

published: Thu Dec 16 2021 22:07:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト