Generative Planning for Temporally Coordinated Exploration in Reinforcement Learning

Haichao Zhang; Wei Xu; Haonan Yu

強化学習における時間的に調整された探索のための生成的計画

標準のモデルフリー強化学習アルゴリズムは、予想される将来の収益を最大化するために、現在のタイムステップで実行されるアクションを生成するポリシーを最適化します。柔軟性はありますが、シングルステップの性質のため、非効率的な探査から生じる困難に直面しています。この作業では、現在のステップだけでなく、将来のいくつかのステップ（したがって、生成計画と呼ばれる）に対してもアクションを生成できる生成計画方法（GPM）を紹介します。これはGPMにいくつかの利点をもたらします。まず、GPMは価値を最大化することによってトレーニングされるため、GPMから生成された計画は、価値の高い領域に到達するための意図的なアクションシーケンスと見なすことができます。したがって、GPMは、生成されたマルチステッププランを活用して、価値の高い領域に向けて時間的に調整された探索を行うことができます。これは、探索ステップの数に応じて一貫した動きが指数関数的に減衰する単一ステップレベルで各アクションを摂動することによって生成される一連のアクションよりも効果的である可能性があります。。第二に、大まかな初期計画ジェネレーターから始めて、GPMはそれをタスクに適応するように改良することができ、それは見返りに、将来の調査に利益をもたらします。これは、計画の形で非適応的である一般的に使用されるアクションリピート戦略よりも潜在的に効果的です。さらに、マルチステッププランは、現在から将来の期間にわたるエージェントの意図として解釈できるため、解釈のためのより有益で直感的なシグナルを提供します。実験はいくつかのベンチマーク環境で実施され、その結果はいくつかのベースライン方法と比較してその有効性を示しました。

Standard model-free reinforcement learning algorithms optimize a policy that generates the action to be taken in the current time step in order to maximize expected future return. While flexible, it faces difficulties arising from the inefficient exploration due to its single step nature. In this work, we present Generative Planning method (GPM), which can generate actions not only for the current step, but also for a number of future steps (thus termed as generative planning). This brings several benefits to GPM. Firstly, since GPM is trained by maximizing value, the plans generated from it can be regarded as intentional action sequences for reaching high value regions. GPM can therefore leverage its generated multi-step plans for temporally coordinated exploration towards high value regions, which is potentially more effective than a sequence of actions generated by perturbing each action at single step level, whose consistent movement decays exponentially with the number of exploration steps. Secondly, starting from a crude initial plan generator, GPM can refine it to be adaptive to the task, which, in return, benefits future explorations. This is potentially more effective than commonly used action-repeat strategy, which is non-adaptive in its form of plans. Additionally, since the multi-step plan can be interpreted as the intent of the agent from now to a span of time period into the future, it offers a more informative and intuitive signal for interpretation. Experiments are conducted on several benchmark environments and the results demonstrated its effectiveness compared with several baseline methods.

updated: Mon Jan 24 2022 15:53:32 GMT+0000 (UTC)

published: Mon Jan 24 2022 15:53:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト