Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels

Tianxin Tao; Daniele Reda; Michiel van de Panne

ピクセルからの深層強化学習のためのVisionTransformerメソッドの評価

ビジョントランスフォーマー（ViT）は最近、コンピュータービジョン用のトランスフォーマーアーキテクチャの重要な可能性を示しました。標準の畳み込みニューラルネットワーク（CNN）アーキテクチャと比較して、画像ベースの深層強化学習もViTアーキテクチャからどの程度恩恵を受けることができますか？この質問に答えるために、画像ベースの強化学習（RL）制御タスクのViTトレーニング方法を評価し、これらの結果を主要な畳み込みネットワークアーキテクチャ方法であるRADと比較します。 ViTエンコーダーをトレーニングするために、補助タスクとして扱われる最近提案されたいくつかの自己監視損失と、追加の損失条件のないベースラインを検討します。 RADを使用してトレーニングされたCNNアーキテクチャは、依然として一般的に優れたパフォーマンスを提供することがわかります。 ViTメソッドの場合、私たちが検討する3種類の補助タスクはすべて、単純なViTトレーニングよりも優れています。さらに、ViT再構成ベースのタスクは、ViT対照学習よりも大幅に優れていることがわかります。

Vision Transformers (ViT) have recently demonstrated the significant potential of transformer architectures for computer vision. To what extent can image-based deep reinforcement learning also benefit from ViT architectures, as compared to standard convolutional neural network (CNN) architectures? To answer this question, we evaluate ViT training methods for image-based reinforcement learning (RL) control tasks and compare these results to a leading convolutional-network architecture method, RAD. For training the ViT encoder, we consider several recently-proposed self-supervised losses that are treated as auxiliary tasks, as well as a baseline with no additional loss terms. We find that the CNN architectures trained using RAD still generally provide superior performance. For the ViT methods, all three types of auxiliary tasks that we consider provide a benefit over plain ViT training. Furthermore, ViT reconstruction-based tasks are found to significantly outperform ViT contrastive-learning.

updated: Sun May 15 2022 18:42:33 GMT+0000 (UTC)

published: Mon Apr 11 2022 07:10:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト