Building a Video-and-Language Dataset with Human Actions for Multimodal Logical Inference

Riko Suzuki; Hitomi Yanaka; Koji Mineshima; Daisuke Bekki

マルチモーダル論理推論のための人間の行動によるビデオと言語のデータセットの構築

このホワイトペーパーでは、動的な人間の行動を説明する意図的およびアスペクト表現に焦点を当てた、マルチモーダル論理推論のための人間の行動を含む新しいビデオと言語のデータセットを紹介します。データセットは、200本の動画、5,554個のアクションラベル、および1,942個のフォームのアクショントリプレットで構成されていますこれは、論理的な意味表現に変換できます。このデータセットは、ビデオと、否定や数量化を含む意味的に複雑な文との間のマルチモーダル推論システムを評価するのに役立つことが期待されています。

This paper introduces a new video-and-language dataset with human actions for multimodal logical inference, which focuses on intentional and aspectual expressions that describe dynamic human actions. The dataset consists of 200 videos, 5,554 action labels, and 1,942 action triplets of the form that can be translated into logical semantic representations. The dataset is expected to be useful for evaluating multimodal inference systems between videos and semantically complicated sentences including negation and quantification.

updated: Sun Jun 27 2021 03:57:36 GMT+0000 (UTC)

published: Sun Jun 27 2021 03:57:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト