TrUMAn: Trope Understanding in Movies and Animations

Hung-Ting Su; Po-Wei Shen; Bing-Chen Tsai; Wen-Feng Cheng; Ke-Jyun Wang; Winston H. Hsu

TrUMAn：映画やアニメーションにおける比喩の理解

ビデオコンテンツを理解して理解することは、検索システムやレコメンデーションシステムなどの多くの実際のアプリケーションにとって非常に重要です。ディープラーニングの最近の進歩により、視覚的な手がかりを使用してさまざまなタスクのパフォーマンスが向上しましたが、意図、動機、または因果関係を推論するための深い認識は依然として困難です。ビデオ推論機能を調べることを目的とした既存のデータセットは、アクション、オブジェクト、関係などの視覚信号に焦点を当てているか、テキストバイアスを利用して回答することができます。これを観察して、視覚信号を超えた学習システムを評価および開発することを目的とした、新しいデータセットである映画とアニメーションの比喩理解（TrUMAn）とともに新しいタスクを提案します。比喩は、創造的な作品のために頻繁に使用されるストーリーテリングデバイスです。比喩理解タスクに対処し、マシンの深い認識スキルを可能にすることで、データマイニングアプリケーションとアルゴリズムを次のレベルに引き上げることができると楽観視しています。挑戦的なTrUMAnデータセットに取り組むために、潜在空間でビデオストーリーテリングを実行することによってビデオエンコーダーをガイドする新しい概念ストーリーテラーモジュールを備えたTropeUnderstanding and Storytelling（TrUSt）を紹介します。生成されたストーリーの埋め込みは、さらに信号を提供するために、比喩理解モデルに送られます。実験結果は、既存のタスクに関する最先端の学習システムが、生の入力信号でわずか12.01％の精度に到達することを示しています。また、人間が注釈を付けた説明を含むオラクルの場合でも、BERTコンテキスト埋め込みは最大28％の精度を達成します。提案されたTrUStはモデルのパフォーマンスを向上させ、13.94％のパフォーマンスに達します。また、詳細な分析を提供し、将来の研究への道を切り開きます。 TrUMAnは、https：//www.cmlab.csie.ntu.edu.tw/project/tropeで公開されています。

Understanding and comprehending video content is crucial for many real-world applications such as search and recommendation systems. While recent progress of deep learning has boosted performance on various tasks using visual cues, deep cognition to reason intentions, motivation, or causality remains challenging. Existing datasets that aim to examine video reasoning capability focus on visual signals such as actions, objects, relations, or could be answered utilizing text bias. Observing this, we propose a novel task, along with a new dataset: Trope Understanding in Movies and Animations (TrUMAn), intending to evaluate and develop learning systems beyond visual signals. Tropes are frequently used storytelling devices for creative works. By coping with the trope understanding task and enabling the deep cognition skills of machines, we are optimistic that data mining applications and algorithms could be taken to the next level. To tackle the challenging TrUMAn dataset, we present a Trope Understanding and Storytelling (TrUSt) with a new Conceptual Storyteller module, which guides the video encoder by performing video storytelling on a latent space. The generated story embedding is then fed into the trope understanding model to provide further signals. Experimental results demonstrate that state-of-the-art learning systems on existing tasks reach only 12.01% of accuracy with raw input signals. Also, even in the oracle case with human-annotated descriptions, BERT contextual embedding achieves at most 28% of accuracy. Our proposed TrUSt boosts the model performance and reaches 13.94% performance. We also provide detailed analysis topave the way for future research. TrUMAn is publicly available at:https://www.cmlab.csie.ntu.edu.tw/project/trope

updated: Tue Aug 10 2021 09:34:14 GMT+0000 (UTC)

published: Tue Aug 10 2021 09:34:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト