One-shot Visual Reasoning on RPMs with an Application to Video Frame Prediction

Wentao He; Jianfeng Ren; Ruibin Bai

ビデオフレーム予測への応用を伴うRPMでのワンショット視覚的推論

レイヴンのプログレッシブマトリックス（RPM）は、人間の視覚的推論能力の評価に頻繁に使用されます。研究者は、RPMの問題を自動的に解決できるシステムの開発に多大な努力を払ってきました。多くの場合、視覚認識と論理的推論の両方のタスクのために、ブラックボックスのエンドツーエンド畳み込みニューラルネットワーク（CNN）を使用します。説明性の高いソリューションを開発するという目的に向けて、知覚モジュールと推論モジュールを含む2段階のフレームワークであるOne-shot Human-Understandable ReaSoner（Os-HURS）を提案し、現実世界の課題に取り組みます。それぞれ、視覚認識とそれに続く論理的推論タスク。推論モジュールについては、人間がよりよく理解でき、モデルの複雑さを大幅に軽減できる「2 +1」定式化を提案します。その結果、正確な推論ルールは1つのRPMサンプルからのみ推定できますが、これは既存のソリューション方法では実行できません。提案された推論モジュールは、RPM問題を解決する際の人間の知識を正確にモデル化して、一連の推論ルールを生成することもできます。提案された方法を実際のアプリケーションで検証するために、RPMのようなワンショットフレーム予測（ROF）データセットが構築されます。ここでは、合成画像の代わりに実際のビデオフレームを使用して構築されたRPMで視覚的な推論が行われます。さまざまなRPMのようなデータセットでの実験結果は、提案されたOs-HURSが、最先端のモデルと比較して、大幅で一貫したパフォーマンスの向上を達成していることを示しています。

Raven's Progressive Matrices (RPMs) are frequently used in evaluating human's visual reasoning ability. Researchers have made considerable effort in developing a system which could automatically solve the RPM problem, often through a black-box end-to-end Convolutional Neural Network (CNN) for both visual recognition and logical reasoning tasks. Towards the objective of developing a highly explainable solution, we propose a One-shot Human-Understandable ReaSoner (Os-HURS), which is a two-step framework including a perception module and a reasoning module, to tackle the challenges of real-world visual recognition and subsequent logical reasoning tasks, respectively. For the reasoning module, we propose a "2+1" formulation that can be better understood by humans and significantly reduces the model complexity. As a result, a precise reasoning rule can be deduced from one RPM sample only, which is not feasible for existing solution methods. The proposed reasoning module is also capable of yielding a set of reasoning rules, precisely modeling the human knowledge in solving the RPM problem. To validate the proposed method on real-world applications, an RPM-like One-shot Frame-prediction (ROF) dataset is constructed, where visual reasoning is conducted on RPMs constructed using real-world video frames instead of synthetic images. Experimental results on various RPM-like datasets demonstrate that the proposed Os-HURS achieves a significant and consistent performance gain compared with the state-of-the-art models.

updated: Wed Nov 24 2021 06:51:38 GMT+0000 (UTC)

published: Wed Nov 24 2021 06:51:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト