MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning

Jianghui Wang; Yuxuan Wang; Dongyan Zhao; Zilong Zheng

MoviePuzzle: マルチモーダル順序学習による視覚的な物語的推論

視覚的な物語の推論と全体的な映画の理解をターゲットとする斬新な挑戦である MoviePuzzle を紹介します。ビデオ理解の分野で顕著な進歩が見られているにもかかわらず、これまでの研究のほとんどは、全体的なビデオ理解と、長編ビデオに存在する生来の視覚的物語構造に対処するための課題やモデルを提示できていません。この難題に取り組むために、ビデオダイアログ情報が存在する中で映画セグメントのショット、フレーム、クリップレイヤーを再シャッフルすることにより、ビデオモデルの時間的特徴学習と構造学習を増幅する MoviePuzzle タスクを提案しました。まず、映画を階層レイヤーに分析し、順序をランダムに並べ替えることにより、MovieNet に基づいて慎重に洗練されたデータセットを確立します。映画理解に関する先行技術を用いて MoviePuzzle をベンチマークすることに加えて、映画の並べ替えの基礎となる構造と視覚的意味順序を考慮した階層的対照映画クラスタリング (HCMC) モデルを考案しました。具体的には、ペアごとの対比学習アプローチを通じて、各層の正しい順序を予測するようにモデルをトレーニングします。これにより、映画の視覚的な物語構造を解読し、ビデオデータに潜む無秩序に対処するコツが身に付きます。実験では、\MoviePuzzle ベンチマークで私たちのアプローチが既存の最先端の手法を上回るパフォーマンスを示し、その有効性が強調されました。

We introduce MoviePuzzle, a novel challenge that targets visual narrative reasoning and holistic movie understanding. Despite the notable progress that has been witnessed in the realm of video understanding, most prior works fail to present tasks and models to address holistic video understanding and the innate visual narrative structures existing in long-form videos. To tackle this quandary, we put forth MoviePuzzle task that amplifies the temporal feature learning and structure learning of video models by reshuffling the shot, frame, and clip layers of movie segments in the presence of video-dialogue information. We start by establishing a carefully refined dataset based on MovieNet by dissecting movies into hierarchical layers and randomly permuting the orders. Besides benchmarking the MoviePuzzle with prior arts on movie understanding, we devise a Hierarchical Contrastive Movie Clustering (HCMC) model that considers the underlying structure and visual semantic orders for movie reordering. Specifically, through a pairwise and contrastive learning approach, we train models to predict the correct order of each layer. This equips them with the knack for deciphering the visual narrative structure of movies and handling the disorder lurking in video data. Experiments show that our approach outperforms existing state-of-the-art methods on the \MoviePuzzle benchmark, underscoring its efficacy.

updated: Wed Jun 14 2023 10:11:38 GMT+0000 (UTC)

published: Sun Jun 04 2023 03:51:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト