PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For Vision-and-Language Navigation

Liuyi Wang; Chengju Liu; Zongtao He; Shu Li; Qingqing Yan; Huiyi Chen; Qijun Chen

過去: 視覚と言語のナビゲーションのための進歩を意識した時空間変換スピーカー

視覚と言語のナビゲーション (VLN) は、重要ではありますが、クロスモーダルナビゲーションの課題です。 VLN の汎化パフォーマンスを向上させる強力な手法の 1 つは、独立したスピーカーモデルを使用して、データ拡張のための疑似命令を提供することです。しかし、長期短期記憶 (LSTM) に基づく現在のスピーカーモデルには、さまざまな場所やタイムステップに関連する機能に対応する機能がありません。これに対処するために、ネットワークのコアとしてトランスを使用する、新しい進歩を意識した時空間トランススピーカー (PASTS) モデルを提案します。 PASTS は、時空間エンコーダーを使用してパノラマ表現を融合し、ステップ間の中間接続をエンコードします。さらに、不正確な監督につながる可能性のある位置ずれの問題を回避するために、モデルが命令生成の進行状況を推定し、よりきめ細かいキャプション結果を容易にできるようにする話者進行モニター (SPM) が提案されています。さらに、過学習を軽減するために、多機能ドロップアウト (MFD) 戦略が導入されています。提案された PASTS は、既存の VLN モデルと柔軟に組み合わせることができます。実験結果は、PASTS が既存のすべてのスピーカーモデルを上回り、以前の VLN モデルのパフォーマンスを向上させ、標準のルームツールーム (R2R) データセットで最先端のパフォーマンスを実現することを示しています。

Vision-and-language navigation (VLN) is a crucial but challenging cross-modal navigation task. One powerful technique to enhance the generalization performance in VLN is the use of an independent speaker model to provide pseudo instructions for data augmentation. However, current speaker models based on Long-Short Term Memory (LSTM) lack the ability to attend to features relevant at different locations and time steps. To address this, we propose a novel progress-aware spatio-temporal transformer speaker (PASTS) model that uses the transformer as the core of the network. PASTS uses a spatio-temporal encoder to fuse panoramic representations and encode intermediate connections through steps. Besides, to avoid the misalignment problem that could result in incorrect supervision, a speaker progress monitor (SPM) is proposed to enable the model to estimate the progress of instruction generation and facilitate more fine-grained caption results. Additionally, a multifeature dropout (MFD) strategy is introduced to alleviate overfitting. The proposed PASTS is flexible to be combined with existing VLN models. The experimental results demonstrate that PASTS outperforms all existing speaker models and successfully improves the performance of previous VLN models, achieving state-of-the-art performance on the standard Room-to-Room (R2R) dataset.

updated: Fri May 19 2023 02:25:56 GMT+0000 (UTC)

published: Fri May 19 2023 02:25:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト