Learning to Perceive in Deep Model-Free Reinforcement Learning

Gonçalo Querido; Alberto Sardinha; Francisco Melo

深いモデルフリー強化学習で知覚することを学ぶ

この作業では、入力観測の一部のみにアクセスして未知のタスクを完了する方法を学習できる、新しいモデルフリーの強化学習 (RL) エージェントを提案します。私たちは、人間の特徴である視覚的注意と能動的知覚の概念からインスピレーションを得て、それらをエージェントに適用して、ハードな注意メカニズムを作成しようとしました。このメカニズムでは、モデルは最初に入力画像のどの領域を見るべきかを決定し、その後で初めてその領域のピクセルにアクセスできます。現在の RL エージェントはこの原則に従っておらず、これらのメカニズムがこの作業と同じ目的に適用されたことはありません。私たちのアーキテクチャでは、リカレントアテンションモデル (RAM) と呼ばれる既存のモデルを適応させ、それを近位ポリシー最適化 (PPO) アルゴリズムと組み合わせます。これらの特性を持つモデルが、完全な入力観測にアクセスする最先端のモデルフリー RL エージェントと同様のパフォーマンスを達成できるかどうかを調査します。この分析は、2 つの Atari ゲーム、Pong と SpaceInvaders で行われます。これらは離散的なアクションスペースを持ち、CarRacing は連続的なアクションスペースを持ちます。そのパフォーマンスを評価するだけでなく、モデルの注意の動きを分析し、それを人間の行動の例と比較します。このような視覚的な制限があっても、テストした 3 つのゲームのうち 2 つにおいて、モデルが PPO + LSTM のパフォーマンスと一致することを示しています。

This work proposes a novel model-free Reinforcement Learning (RL) agent that is able to learn how to complete an unknown task having access to only a part of the input observation. We take inspiration from the concepts of visual attention and active perception that are characteristic of humans and tried to apply them to our agent, creating a hard attention mechanism. In this mechanism, the model decides first which region of the input image it should look at, and only after that it has access to the pixels of that region. Current RL agents do not follow this principle and we have not seen these mechanisms applied to the same purpose as this work. In our architecture, we adapt an existing model called recurrent attention model (RAM) and combine it with the proximal policy optimization (PPO) algorithm. We investigate whether a model with these characteristics is capable of achieving similar performance to state-of-the-art model-free RL agents that access the full input observation. This analysis is made in two Atari games, Pong and SpaceInvaders, which have a discrete action space, and in CarRacing, which has a continuous action space. Besides assessing its performance, we also analyze the movement of the attention of our model and compare it with what would be an example of the human behavior. Even with such visual limitation, we show that our model matches the performance of PPO+LSTM in two of the three games tested.

updated: Tue Jan 10 2023 00:31:57 GMT+0000 (UTC)

published: Tue Jan 10 2023 00:31:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト