On-Device Document Classification using multimodal features

Sugam Garg; Harichandana; Sumit Kumar

マルチモーダル機能を使用したオンデバイスドキュメント分類

小さなスクリーンショットから大きなビデオまで、ドキュメントは最新のスマートフォンの大部分のスペースを占めます。携帯電話のドキュメントはさまざまなソースから蓄積される可能性があり、携帯電話の高いストレージ容量により、数百のドキュメントが短期間に蓄積されます。ただし、ほとんどの検索方法はメタ情報またはドキュメント内のテキストのみに依存しているため、ドキュメントの検索または管理は依然として面倒な作業です。このホワイトペーパーでは、単一のモダリティでは分類には不十分であることを示し、デバイス上のドキュメントを分類するための新しいパイプラインを提示して、サーバーへのプライベートユーザーデータの転送を防ぎます。このタスクでは、光学式文字認識（OCR）用のオープンソースライブラリと新しいモデルアーキテクチャをパイプラインに統合します。デバイス上の推論に必要なメトリックであるサイズについてモデルを最適化します。標準のマルチモーダルデータセットFOOD-101を使用して分類モデルのベンチマークを行い、30％のモデル圧縮を使用した以前の最先端技術との競争力のある結果を紹介します。

From small screenshots to large videos, documents take up a bulk of space in a modern smartphone. Documents in a phone can accumulate from various sources, and with the high storage capacity of mobiles, hundreds of documents are accumulated in a short period. However, searching or managing documents remains an onerous task, since most search methods depend on meta-information or only text in a document. In this paper, we showcase that a single modality is insufficient for classification and present a novel pipeline to classify documents on-device, thus preventing any private user data transfer to server. For this task, we integrate an open-source library for Optical Character Recognition (OCR) and our novel model architecture in the pipeline. We optimise the model for size, a necessary metric for on-device inference. We benchmark our classification model with a standard multimodal dataset FOOD-101 and showcase competitive results with the previous State of the Art with 30% model compression.

updated: Wed Jan 06 2021 05:36:58 GMT+0000 (UTC)

published: Wed Jan 06 2021 05:36:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト