Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships

Chao Lou; Wenjuan Han; Yuhuan Lin; Zilong Zheng

教師なし視覚-言語解析：依存関係を介して視覚シーングラフと言語構造をシームレスにブリッジする

言語の説明とともに現実的な視覚シーンの画像を理解することは、一般的な視覚的理解に向けた基本的なタスクです。以前の作品は、視覚的なシーン（シーングラフなど）と自然言語（依存関係ツリーなど）の階層構造を個別に構築することにより、説得力のある包括的な結果を示しています。ただし、共同ビジョン言語（VL）構造を構築する方法はほとんど調査されていません。より挑戦的ですが価値のある、教師なしの方法でそのような共同VL構造を誘発することを目標とする新しいタスクを紹介します。私たちの目標は、ビジュアルシーングラフと言語依存関係ツリーをシームレスに橋渡しすることです。 VL構造データが不足しているため、新しいデータセットVLParseを構築することから始めます。労働集約的なラベリングを最初から使用するのではなく、粗い構造を生成するための自動位置合わせ手順と、それに続く高品質の構造を生成するための人間による改良を提案します。さらに、Vision-Language Graph Autoencoderの略である対照学習（CL）ベースのフレームワークVLGAEを提案することにより、データセットのベンチマークを行います。私たちのモデルは、2つの派生タスク、つまり言語文法の誘導とVLフレーズの接地で優れたパフォーマンスを実現します。アブレーションは、視覚的な手がかりと、きめの細かいVL構造の構築に対する依存関係の両方の有効性を示しています。

Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have shown compelling comprehensive results by building hierarchical structures for visual scenes (e.g., scene graphs) and natural languages (e.g., dependency trees), individually. However, how to construct a joint vision-language (VL) structure has barely been investigated. More challenging but worthwhile, we introduce a new task that targets on inducing such a joint VL structure in an unsupervised manner. Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly. Due to the lack of VL structural data, we start by building a new dataset VLParse. Rather than using labor-intensive labeling from scratch, we propose an automatic alignment procedure to produce coarse structures followed by human refinement to produce high-quality ones. Moreover, we benchmark our dataset by proposing a contrastive learning (CL)-based framework VLGAE, short for Vision-Language Graph Autoencoder. Our model obtains superior performance on two derived tasks, i.e., language grammar induction and VL phrase grounding. Ablations show the effectiveness of both visual cues and dependency relationships on fine-grained VL structure construction.

updated: Wed Jun 01 2022 11:14:36 GMT+0000 (UTC)

published: Sun Mar 27 2022 09:51:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト