Towards Robust Bisimulation Metric Learning

Mete Kemertas; Tristan Aumentado-Armstrong

堅牢な双模倣メトリック学習に向けて

深層強化学習（DRL）で学習された表現は、複雑な観察からタスク関連の情報を抽出し、注意散漫に対する堅牢性とポリシーに対する有益性のバランスをとる必要があります。このような安定した豊富な表現は、多くの場合、最新の関数近似手法によって学習され、高次元の連続状態アクション空間でも、ポリシー改善定理の実用化を可能にします。双模倣メトリックは、機能的に類似した状態を表現空間で一緒に折りたたむことにより、この表現学習問題に対する1つの解決策を提供します。これにより、ノイズとディストラクタに対する不変性が促進されます。この作業では、ポリシー上の双模倣メトリックの値関数近似範囲を非最適ポリシーに一般化し、環境ダイナミクスを近似します。私たちの理論的結果は、実際の使用で発生する可能性のある埋め込み病変を特定するのに役立ちます。特に、これらの問題は、制約の少ないダイナミクスモデルと、報酬がまばらな環境での埋め込みノルムの報酬信号への不安定な依存性に起因することがわかります。さらに、一連の実用的な救済策を提案します：（i）表現空間に対するノルム制約、および（ii）本質的な報酬と潜在的な空間の正則化による以前のアプローチの拡張。最後に、結果として得られる方法がまばらな報酬関数に対してよりロバストであるだけでなく、以前の方法が失敗する観察的気晴らしを伴う困難な連続制御タスクを解決できるという証拠を提供します。

Learned representations in deep reinforcement learning (DRL) have to extract task-relevant information from complex observations, balancing between robustness to distraction and informativeness to the policy. Such stable and rich representations, often learned via modern function approximation techniques, can enable practical application of the policy improvement theorem, even in high-dimensional continuous state-action spaces. Bisimulation metrics offer one solution to this representation learning problem, by collapsing functionally similar states together in representation space, which promotes invariance to noise and distractors. In this work, we generalize value function approximation bounds for on-policy bisimulation metrics to non-optimal policies and approximate environment dynamics. Our theoretical results help us identify embedding pathologies that may occur in practical use. In particular, we find that these issues stem from an underconstrained dynamics model and an unstable dependence of the embedding norm on the reward signal in environments with sparse rewards. Further, we propose a set of practical remedies: (i) a norm constraint on the representation space, and (ii) an extension of prior approaches with intrinsic rewards and latent space regularization. Finally, we provide evidence that the resulting method is not only more robust to sparse reward functions, but also able to solve challenging continuous control tasks with observational distractions, where prior methods fail.

updated: Wed Oct 27 2021 00:32:07 GMT+0000 (UTC)

published: Wed Oct 27 2021 00:32:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト