Video-aided Unsupervised Grammar Induction

Songyang Zhang; Linfeng Song; Lifeng Jin; Kun Xu; Dong Yu; Jiebo Luo

ビデオ支援の教師なし文法誘導

ラベルのないテキストとそれに対応するビデオの両方から構成パーサーを学習する、ビデオ支援文法誘導を調査します。マルチモーダル文法誘導の既存の方法は、テキストと画像のペアから構文文法を学習することに焦点を当てており、静止画像からの情報が誘導に役立つことを示す有望な結果が得られています。ただし、ビデオは、静的オブジェクトだけでなく、動詞句の誘導に役立つアクションや状態の変化など、さらに豊富な情報を提供します。このホワイトペーパーでは、最近の複合PCFGモデルをベースラインとして、ビデオの豊富な機能（アクション、オブジェクト、シーン、オーディオ、顔、OCR、音声など）について説明します。さらに、マルチモーダル複合PCFGモデル（MMC-PCFG）を提案して、さまざまなモダリティからのこれらの豊富な機能を効果的に集約します。提案されたMMC-PCFGは、エンドツーエンドでトレーニングされ、3つのベンチマーク、つまりDiDeMo、YouCook2、MSRVTTで個々のモダリティと以前の最先端システムを上回り、教師なし文法誘導にビデオ情報を活用することの有効性を確認します。

We investigate video-aided grammar induction, which learns a constituency parser from both unlabeled text and its corresponding video. Existing methods of multi-modal grammar induction focus on learning syntactic grammars from text-image pairs, with promising results showing that the information from static images is useful in induction. However, videos provide even richer information, including not only static objects but also actions and state changes useful for inducing verb phrases. In this paper, we explore rich features (e.g. action, object, scene, audio, face, OCR and speech) from videos, taking the recent Compound PCFG model as the baseline. We further propose a Multi-Modal Compound PCFG model (MMC-PCFG) to effectively aggregate these rich features from different modalities. Our proposed MMC-PCFG is trained end-to-end and outperforms each individual modality and previous state-of-the-art systems on three benchmarks, i.e. DiDeMo, YouCook2 and MSRVTT, confirming the effectiveness of leveraging video information for unsupervised grammar induction.

updated: Tue May 04 2021 00:23:28 GMT+0000 (UTC)

published: Fri Apr 09 2021 14:01:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト