Towards Flexible Multi-modal Document Models

Naoto Inoue; Kotaro Kikuchi; Edgar Simo-Serra; Mayu Otani; Kota Yamaguchi

柔軟なマルチモーダルドキュメントモデルに向けて

グラフィックドキュメントを生成するための創造的なワークフローには、要素の整列、適切なフォントの選択、審美的に調和のとれた色の採用など、相互に関連する複雑なタスクが含まれます。この作業では、多くの異なる設計タスクを共同で解決できる全体的なモデルの構築を試みます。 FlexDM で示す私たちのモデルは、ベクターグラフィックドキュメントを一連のマルチモーダル要素として扱い、統一されたアーキテクチャを使用して、要素の種類、位置、スタイリング属性、画像、テキストなどのマスクされたフィールドを予測することを学習します。明示的なマルチタスク学習とドメイン内事前トレーニングを使用することで、モデルはさまざまなドキュメントフィールド間のマルチモーダルな関係をより適切に捉えることができます。実験結果は、単一のFlexDMが多数の異なる設計タスクをうまく解決できることを裏付けていますが、タスク固有のコストのかかるベースラインと競合するパフォーマンスを達成しています.

Creative workflows for generating graphical documents involve complex inter-related tasks, such as aligning elements, choosing appropriate fonts, or employing aesthetically harmonious colors. In this work, we attempt at building a holistic model that can jointly solve many different design tasks. Our model, which we denote by FlexDM, treats vector graphic documents as a set of multi-modal elements, and learns to predict masked fields such as element type, position, styling attributes, image, or text, using a unified architecture. Through the use of explicit multi-task learning and in-domain pre-training, our model can better capture the multi-modal relationships among the different document fields. Experimental results corroborate that our single FlexDM is able to successfully solve a multitude of different design tasks, while achieving performance that is competitive with task-specific and costly baselines.

updated: Fri Mar 31 2023 17:59:56 GMT+0000 (UTC)

published: Fri Mar 31 2023 17:59:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト