Contextual Explainable Video Representation: Human Perception-based Understanding

Khoa Vo; Kashu Yamazaki; Phong X. Nguyen; Phat Nguyen; Khoa Luu; Ngan Le

状況に応じて説明可能なビデオ表現: 人間の知覚に基づく理解

ビデオの理解は成長分野であり、集中的な研究の対象となっています。これには、動作検出、動作認識、ビデオキャプション、ビデオ検索など、空間情報と時間情報の両方を理解するための多くの興味深いタスクが含まれます。ビデオ理解における最も困難な問題の 1 つは、特徴抽出、つまり、制約のないビデオの長く複雑な時間構造のために、トリミングされていないビデオからコンテキストの視覚的表現を抽出することです。事前にトレーニングされたバックボーンネットワークをブラックボックスとして適用して視覚的表現を抽出する既存のアプローチとは異なり、私たちのアプローチは、説明可能なメカニズムで最もコンテキストに即した情報を抽出することを目的としています。私たちが観察したように、人間は通常、俳優、関連するオブジェクト、および周囲の環境という 3 つの主な要因の間の相互作用を通じてビデオを認識します。したがって、そのような要因のそれぞれをキャプチャし、それらの間の関係をモデル化できる、コンテキストに応じて説明可能なビデオ表現抽出を設計することが非常に重要です。この論文では、人間の知覚プロセスをアクター、オブジェクト、および環境のモデル化に組み込むアプローチについて説明します。ビデオの理解における人間の知覚ベースの文脈表現の有効性を説明するために、ビデオの段落キャプションと一時的なアクション検出を選択します。ソースコードは、https://github.com/UARK-AICV/Video_Representation で公開されています。

Video understanding is a growing field and a subject of intense research, which includes many interesting tasks to understanding both spatial and temporal information, e.g., action detection, action recognition, video captioning, video retrieval. One of the most challenging problems in video understanding is dealing with feature extraction, i.e. extract contextual visual representation from given untrimmed video due to the long and complicated temporal structure of unconstrained videos. Different from existing approaches, which apply a pre-trained backbone network as a black-box to extract visual representation, our approach aims to extract the most contextual information with an explainable mechanism. As we observed, humans typically perceive a video through the interactions between three main factors, i.e., the actors, the relevant objects, and the surrounding environment. Therefore, it is very crucial to design a contextual explainable video representation extraction that can capture each of such factors and model the relationships between them. In this paper, we discuss approaches, that incorporate the human perception process into modeling actors, objects, and the environment. We choose video paragraph captioning and temporal action detection to illustrate the effectiveness of human perception based-contextual representation in video understanding. Source code is publicly available at https://github.com/UARK-AICV/Video_Representation.

updated: Sat Dec 17 2022 06:29:37 GMT+0000 (UTC)

published: Mon Dec 12 2022 19:29:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト