Static and Animated 3D Scene Generation from Free-form Text Descriptions

Faria Huq; Nafees Ahmed; Anindya Iqbal

自由形式のテキスト記述からの静的およびアニメーション化された3Dシーンの生成

自由形式のテキスト記述から一貫性のある有用な画像/ビデオシーンを生成することは、技術的には処理が非常に難しい問題です。同じシーンのテキストによる説明は、人によって大きく異なる場合があります。同じ人の場合もあります。テキストの説明を準備する際に単語と構文の選択が異なるため、システムがさまざまな形式の言語入力から一貫して望ましい出力を確実に生成することは困難です。シーン生成の以前の作業は、ほとんどの場合、ユーザーが説明を書く自由を制限するテキスト入力の厳密な文構造に限定されていました。私たちの仕事では、大きな制限なしに、さまざまなタイプの自由形式のテキストシーン記述から静的およびアニメーションの3Dシーンを生成することを目的とした新しいパイプラインを研究しています。特に、研究を実用的で扱いやすくするために、立方体、円柱、球のさまざまな組み合わせを含む、考えられるすべての3Dシーンの小さな部分空間に焦点を当てています。 2段階のパイプラインを設計します。最初の段階では、エンコーダーデコーダーニューラルアーキテクチャを使用して自由形式のテキストをエンコードします。第2段階では、生成されたエンコーディングに基づいて3Dシーンを生成します。私たちのニューラルアーキテクチャは、エンコーダとして最先端の言語モデルを活用して、豊富なコンテキストエンコーディングと新しいマルチヘッドデコーダを活用し、シーン内のオブジェクトの複数の機能を同時に予測します。私たちの実験では、固有の静的およびアニメーションシーン記述のそれぞれ13,00,000および14,00,000サンプルを含む大規模な合成データセットを生成します。 3Dオブジェクトの特徴を正常に検出することで、テストデータセットで98.427％の精度を達成します。私たちの仕事は、問題を解決するための1つのアプローチの概念実証を示しており、十分なトレーニングデータがあれば、同じパイプラインを拡張して、さらに広範な3Dシーン生成問題を処理できると考えています。

Generating coherent and useful image/video scenes from a free-form textual description is technically a very difficult problem to handle. Textual description of the same scene can vary greatly from person to person, or sometimes even for the same person from time to time. As the choice of words and syntax vary while preparing a textual description, it is challenging for the system to reliably produce a consistently desirable output from different forms of language input. The prior works of scene generation have been mostly confined to rigorous sentence structures of text input which restrict the freedom of users to write description. In our work, we study a new pipeline that aims to generate static as well as animated 3D scenes from different types of free-form textual scene description without any major restriction. In particular, to keep our study practical and tractable, we focus on a small subspace of all possible 3D scenes, containing various combinations of cube, cylinder and sphere. We design a two-stage pipeline. In the first stage, we encode the free-form text using an encoder-decoder neural architecture. In the second stage, we generate a 3D scene based on the generated encoding. Our neural architecture exploits state-of-the-art language model as encoder to leverage rich contextual encoding and a new multi-head decoder to predict multiple features of an object in the scene simultaneously. For our experiments, we generate a large synthetic data-set which contains 13,00,000 and 14,00,000 samples of unique static and animated scene descriptions, respectively. We achieve 98.427% accuracy on test data set in detecting the 3D objects features successfully. Our work shows a proof of concept of one approach towards solving the problem, and we believe with enough training data, the same pipeline can be expanded to handle even broader set of 3D scene generation problems.

updated: Sat Nov 28 2020 19:28:30 GMT+0000 (UTC)

published: Sun Oct 04 2020 11:31:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト