Learning Triadic Belief Dynamics in Nonverbal Communication from Videos

Lifeng Fan; Shuwen Qiu; Zilong Zheng; Tao Gao; Song-Chun Zhu; Yixin Zhu

ビデオからの非言語コミュニケーションにおけるトライアド信念ダイナミクスの学習

人間は独特の社会的認知能力を持っています。非言語コミュニケーションは、エージェント間で豊富な社会情報を伝えることができます。対照的に、そのような重要な社会的特徴は、既存のシーン理解文献にはほとんど欠けています。この論文では、純粋な視覚入力からエージェントの精神状態を表現、モデル化、学習、推測するために、さまざまな非言語的コミュニケーションの手がかり（たとえば、視線、人間のポーズ、ジェスチャー）を組み込んでいます。重要なことに、そのような精神的表現は、エージェントの信念を考慮に入れて、真の世界の状態が何であるかを表し、各エージェントの精神状態の信念を推測します。これは、真の世界の状態とは異なる場合があります。異なる信念と真の世界の状態を集約することにより、私たちのモデルは本質的に2つのエージェント間の相互作用の間に「5つの心」を形成します。この「5つの心」モデルは、無限再帰の信念を推測する以前の作品とは異なります。代わりに、エージェントの信念は「共通の心」に収束します。この表現に基づいて、5つの心すべてを共同で追跡および予測する階層的なエネルギーベースのモデルをさらに考案します。この新しい視点から、社会的イベントは、古典的なキーフレームビデオの要約を超越した一連の非言語的コミュニケーションと信念のダイナミクスによって解釈されます。実験では、このようなソーシャルアカウントを使用すると、最先端のキーフレームビデオサマリー方法と比較して、ソーシャルインタラクションが豊富なビデオでより優れたビデオサマリーが提供されることを示します。

Humans possess a unique social cognition capability; nonverbal communication can convey rich social information among agents. In contrast, such crucial social characteristics are mostly missing in the existing scene understanding literature. In this paper, we incorporate different nonverbal communication cues (e.g., gaze, human poses, and gestures) to represent, model, learn, and infer agents' mental states from pure visual inputs. Crucially, such a mental representation takes the agent's belief into account so that it represents what the true world state is and infers the beliefs in each agent's mental state, which may differ from the true world states. By aggregating different beliefs and true world states, our model essentially forms "five minds" during the interactions between two agents. This "five minds" model differs from prior works that infer beliefs in an infinite recursion; instead, agents' beliefs are converged into a "common mind". Based on this representation, we further devise a hierarchical energy-based model that jointly tracks and predicts all five minds. From this new perspective, a social event is interpreted by a series of nonverbal communication and belief dynamics, which transcends the classic keyframe video summary. In the experiments, we demonstrate that using such a social account provides a better video summary on videos with rich social interactions compared with state-of-the-art keyframe video summary methods.

updated: Wed Apr 07 2021 00:52:04 GMT+0000 (UTC)

published: Wed Apr 07 2021 00:52:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト