A Unified Model for Video Understanding and Knowledge Embedding with Heterogeneous Knowledge Graph Dataset

Jiaxin Deng; Dong Shen; Haojie Pan; Xiangyu Wu; Ximan Liu; Gaofeng Meng; Fan Yang; Size Li; Ruiji Fu; Zhongyuan Wang

異種ナレッジグラフデータセットを使用したビデオ理解とナレッジ埋め込みの統合モデル

ビデオを理解することは、短いビデオのビジネスプラットフォームにおいて重要なタスクであり、ビデオの推奨と分類において幅広い用途があります。既存のビデオ理解作業のほとんどは、ビデオフレーム、オーディオ、テキストなど、ビデオコンテンツ内に表示される情報のみに焦点を当てています。ただし、ビデオとの関連性が低いコンテンツを参照する場合は、外部のナレッジグラフ (KG) データセットから常識的な知識を導入することが、ビデオの理解に不可欠です。ビデオ知識グラフのデータセットが不足しているため、ビデオ理解と KG を統合する作業はまれです。この論文では、マルチモーダルビデオエンティティと実りある常識関係を含む異種データセットを提案します。このデータセットは、Video-Relation-Tag (VRT) や Video-Relation-Video (VRV) タスクなど、複数の新しいビデオ推論タスクも提供します。さらに、このデータセットに基づいて、ビデオ理解の目的をナレッジグラフ埋め込みと組み合わせて最適化するエンドツーエンドモデルを提案します。これにより、ビデオ理解に事実知識をより適切に注入できるだけでなく、KG の効果的なマルチモーダルエンティティ埋め込みも生成できます。 .包括的な実験により、ビデオ理解埋め込みと事実知識を組み合わせることで、コンテンツベースのビデオ検索パフォーマンスが向上することが示されています。さらに、HITS@10 で少なくとも 42.36% と 17.73% の改善を示し、VRT および VRV タスクで従来の KGE ベースの方法よりも優れたモデルがより優れたナレッジグラフの埋め込みを生成するのにも役立ちます。

Video understanding is an important task in short video business platforms and it has a wide application in video recommendation and classification. Most of the existing video understanding works only focus on the information that appeared within the video content, including the video frames, audio and text. However, introducing common sense knowledge from the external Knowledge Graph (KG) dataset is essential for video understanding when referring to the content which is less relevant to the video. Owing to the lack of video knowledge graph dataset, the work which integrates video understanding and KG is rare. In this paper, we propose a heterogeneous dataset that contains the multi-modal video entity and fruitful common sense relations. This dataset also provides multiple novel video inference tasks like the Video-Relation-Tag (VRT) and Video-Relation-Video (VRV) tasks. Furthermore, based on this dataset, we propose an end-to-end model that jointly optimizes the video understanding objective with knowledge graph embedding, which can not only better inject factual knowledge into video understanding but also generate effective multi-modal entity embedding for KG. Comprehensive experiments indicate that combining video understanding embedding with factual knowledge benefits the content-based video retrieval performance. Moreover, it also helps the model generate better knowledge graph embedding which outperforms traditional KGE-based methods on VRT and VRV tasks with at least 42.36% and 17.73% improvement in HITS@10.

updated: Sat Nov 19 2022 09:00:45 GMT+0000 (UTC)

published: Sat Nov 19 2022 09:00:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト