LISA: Localized Image Stylization with Audio via Implicit Neural Representation

Seung Hyun Lee; Chanyoung Kim; Wonmin Byeon; Sang Ho Yoon; Jinkyu Kim; Sangpil Kim

LISA: 暗黙的なニューラル表現による音声によるローカライズされた画像の様式化

オーディオ主導のローカライズされた画像の様式化を実行する新しいフレームワーク、オーディオによるローカライズされた画像の様式化 (LISA) を提示します。サウンドは、多くの場合、シーンの特定のコンテキストに関する情報を提供し、シーンまたはオブジェクトの特定の部分に密接に関連しています。しかし、既存の画像様式化作業は、画像またはテキスト入力を使用して画像全体を様式化することに重点を置いていました。音声入力に基づいて画像の特定の部分にスタイルを適用するのは自然なことですが、難しい作業です。この作業では、ユーザーがオーディオ入力を提供して入力画像内の音源をローカライズし、別のフレームワークをターゲットオブジェクトまたはシーンをローカルにスタイル設定することを提案します。 LISA はまず、CLIP 埋め込みスペースを利用して、オーディオビジュアルローカリゼーションネットワークを使用して繊細なローカリゼーションマップを生成します。次に、予測されたローカリゼーションマップと共に暗黙的ニューラル表現 (INR) を利用して、音情報に基づいてターゲットオブジェクトまたはシーンをスタイル化します。提案された INR は、ローカライズされたピクセル値を操作して、提供されたオーディオ入力と意味的に一致させることができます。一連の実験を通じて、提案されたフレームワークが他の音声ガイドによる様式化方法よりも優れていることを示します。さらに、LISA は簡潔なローカリゼーションマップを構築し、与えられたオーディオ入力に従ってターゲットオブジェクトまたはシーンを自然に操作します。

We present a novel framework, Localized Image Stylization with Audio (LISA) which performs audio-driven localized image stylization. Sound often provides information about the specific context of the scene and is closely related to a certain part of the scene or object. However, existing image stylization works have focused on stylizing the entire image using an image or text input. Stylizing a particular part of the image based on audio input is natural but challenging. In this work, we propose a framework that a user provides an audio input to localize the sound source in the input image and another for locally stylizing the target object or scene. LISA first produces a delicate localization map with an audio-visual localization network by leveraging CLIP embedding space. We then utilize implicit neural representation (INR) along with the predicted localization map to stylize the target object or scene based on sound information. The proposed INR can manipulate the localized pixel values to be semantically consistent with the provided audio input. Through a series of experiments, we show that the proposed framework outperforms the other audio-guided stylization methods. Moreover, LISA constructs concise localization maps and naturally manipulates the target object or scene in accordance with the given audio input.

updated: Mon Nov 21 2022 11:51:48 GMT+0000 (UTC)

published: Mon Nov 21 2022 11:51:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト