Learning Structured Representations of Visual Scenes

Meng-Jiun Chiou

視覚シーンの構造化された表現の学習

2つのレベルをつなぐ中間レベルの表現として、ペアワイズオブジェクト間の視覚的関係など、視覚シーンの構造化された表現は、構造とともに推論を学習する際に構成モデルに役立つだけでなく、モデル決定の解釈可能性を高めることが示されています。それにもかかわらず、これらの表現は、従来の認識タスクよりもはるかに注目されておらず、多くの未解決の課題が未解決のままになっています。論文では、機械が構造化された表現として視覚的な関係を持つ個々の画像またはビデオのコンテンツをどのように記述することができるかを研究します。具体的には、視覚シーンの構造化された表現を静的画像とビデオの両方の設定で効果的に構築および学習する方法を探り、外部の知識の組み込み、バイアス低減メカニズム、および強化された表現モデルから得られる改善を行います。この論文の最後に、視覚シーンの構造化表現学習の将来の方向性に光を当てるためのいくつかの未解決の課題と制限についても説明します。

As the intermediate-level representations bridging the two levels, structured representations of visual scenes, such as visual relationships between pairwise objects, have been shown to not only benefit compositional models in learning to reason along with the structures but provide higher interpretability for model decisions. Nevertheless, these representations receive much less attention than traditional recognition tasks, leaving numerous open challenges unsolved. In the thesis, we study how machines can describe the content of the individual image or video with visual relationships as the structured representations. Specifically, we explore how structured representations of visual scenes can be effectively constructed and learned in both the static-image and video settings, with improvements resulting from external knowledge incorporation, bias-reducing mechanism, and enhanced representation models. At the end of this thesis, we also discuss some open challenges and limitations to shed light on future directions of structured representation learning for visual scenes.

updated: Sat Jul 09 2022 05:40:08 GMT+0000 (UTC)

published: Sat Jul 09 2022 05:40:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト