XMP-Font: Self-Supervised Cross-Modality Pre-training for Few-Shot Font Generation

Wei Liu; Fangyue Liu; Fei Din; Qian He; Zili Yi

XMP-Font：少数ショットのフォント生成のための自己監視型クロスモダリティ事前トレーニング

新しいフォントライブラリの生成は、グリフが豊富なスクリプトにとって非常に手間と時間がかかる作業です。したがって、テスト中に微調整することなく、わずかなグリフ参照しか必要としないため、数ショットのフォント生成が必要になります。既存の方法は、スタイルとコンテンツのもつれを解くパラダイムに従い、参照グリフのスタイルコードとソースのコンテンツ表現を組み合わせることによって新しいフォントが生成されることを期待しています。ただし、これらの数ショットのフォント生成方法は、コンテンツに依存しないスタイル表現をキャプチャできないか、ローカライズされたコンポーネントごとのスタイル表現を採用します。これは、コンポーネント間の間隔やコンポーネント間の間隔などのハイパーコンポーネント機能を含む多くの中国語フォントスタイルをモデル化するには不十分です。「接続されたストローク」。これらの欠点を解決し、スタイル表現の信頼性を高めるために、自己監視型クロスモダリティ事前トレーニング戦略と、グリフ画像と対応するストロークラベルで共同で調整されるクロスモダリティトランスベースのエンコーダを提案します。クロスモダリティエンコーダーは、自己監視方式で事前トレーニングされており、モダリティ間およびモダリティ内の相関関係を効果的にキャプチャできます。これにより、すべてのスケール（ストロークレベル、コンポーネントレベル）のコンテンツスタイルの解きほぐしとモデリングスタイルの表現が容易になります。および文字レベル）。事前にトレーニングされたエンコーダーは、微調整せずにダウンストリームのフォント生成タスクに適用されます。私たちの方法と最先端の方法を実験的に比較すると、私たちの方法がすべてのスケールのスタイルをうまく転送できることがわかります。さらに、必要な参照グリフは1つだけであり、数ショットのフォント生成タスクで2番目に優れたフォントよりも28％低い不良ケースの発生率が最も低くなります。

Generating a new font library is a very labor-intensive and time-consuming job for glyph-rich scripts. Few-shot font generation is thus required, as it requires only a few glyph references without fine-tuning during test. Existing methods follow the style-content disentanglement paradigm and expect novel fonts to be produced by combining the style codes of the reference glyphs and the content representations of the source. However, these few-shot font generation methods either fail to capture content-independent style representations, or employ localized component-wise style representations, which is insufficient to model many Chinese font styles that involve hyper-component features such as inter-component spacing and "connected-stroke". To resolve these drawbacks and make the style representations more reliable, we propose a self-supervised cross-modality pre-training strategy and a cross-modality transformer-based encoder that is conditioned jointly on the glyph image and the corresponding stroke labels. The cross-modality encoder is pre-trained in a self-supervised manner to allow effective capture of cross- and intra-modality correlations, which facilitates the content-style disentanglement and modeling style representations of all scales (stroke-level, component-level and character-level). The pre-trained encoder is then applied to the downstream font generation task without fine-tuning. Experimental comparisons of our method with state-of-the-art methods demonstrate our method successfully transfers styles of all scales. In addition, it only requires one reference glyph and achieves the lowest rate of bad cases in the few-shot font generation task 28% lower than the second best

updated: Mon Apr 11 2022 13:34:40 GMT+0000 (UTC)

published: Mon Apr 11 2022 13:34:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト