Structure-Encoding Auxiliary Tasks for Improved Visual Representation in Vision-and-Language Navigation

Chia-Wen Kuo; Chih-Yao Ma; Judy Hoffman; Zsolt Kira

視覚と言語のナビゲーションにおける視覚的表現を改善するための構造符号化補助タスク

Vision-and-Language Navigation (VLN) では、研究者は通常、エージェントがトレーニングまたはテストされる環境を微調整することなく、ImageNet で事前トレーニングされた画像エンコーダーを使用します。ただし、ImageNet からのトレーニング画像とナビゲーション環境のビューとの間の分布シフトにより、ImageNet の事前トレーニング済み画像エンコーダーが最適でなくなる可能性があります。したがって、このホワイトペーパーでは、ナビゲーション環境のデータを活用して画像エンコーダーを事前トレーニングおよび改善する一連の構造エンコード補助タスク (SEA) を設計します。具体的には、(1) 3D ジグソーパズル、(2) 通過可能性予測、(3) インスタンス分類を設計およびカスタマイズして、画像エンコーダーを事前トレーニングします。厳密なアブレーションを通じて、SEA の事前トレーニング済みの機能は、シーンの構造情報をより適切にエンコードすることが示されています。これは、ImageNet の事前トレーニング済みの機能では適切にエンコードできませんが、ターゲットナビゲーションタスクにとって重要です。 SEA の事前トレーニング済み機能は、チューニングなしで既存の VLN エージェントに簡単にプラグインできます。たとえば、Test-Unseen 環境では、VLN エージェントと事前トレーニング済みの SEA 機能を組み合わせることで、Speaker-Follower で 12%、Env-Dropout で 5%、AuxRN で 4% の絶対成功率の向上を実現しています。

In Vision-and-Language Navigation (VLN), researchers typically take an image encoder pre-trained on ImageNet without fine-tuning on the environments that the agent will be trained or tested on. However, the distribution shift between the training images from ImageNet and the views in the navigation environments may render the ImageNet pre-trained image encoder suboptimal. Therefore, in this paper, we design a set of structure-encoding auxiliary tasks (SEA) that leverage the data in the navigation environments to pre-train and improve the image encoder. Specifically, we design and customize (1) 3D jigsaw, (2) traversability prediction, and (3) instance classification to pre-train the image encoder. Through rigorous ablations, our SEA pre-trained features are shown to better encode structural information of the scenes, which ImageNet pre-trained features fail to properly encode but is crucial for the target navigation task. The SEA pre-trained features can be easily plugged into existing VLN agents without any tuning. For example, on Test-Unseen environments, the VLN agents combined with our SEA pre-trained features achieve absolute success rate improvement of 12% for Speaker-Follower, 5% for Env-Dropout, and 4% for AuxRN.

updated: Sun Nov 20 2022 23:04:39 GMT+0000 (UTC)

published: Sun Nov 20 2022 23:04:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト