Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for Navigation Instruction Generation

Haitian Zeng; Xiaohan Wang; Wenguan Wang; Yi Yang

Kefa: ナビゲーション命令生成のための知識が強化され、きめ細かく調整されたスピーカー

ナビゲーション指示生成用の新しいスピーカーモデル Kefa を紹介します。視覚と言語ナビゲーションの既存のスピーカーモデルは、異なる環境間での視覚特徴のドメインギャップが大きく、時間的接地能力が不十分であるという問題に悩まされています。この課題に対処するために、外部の知識事実を使用して特徴表現を強化する知識洗練モジュールと、生成された命令と観測シーケンスの間のきめ細かい位置合わせを強制する適応型時間的位置合わせ方法を提案します。さらに、方向フレーズの正確さを意識したナビゲーション指示評価のための新しい指標 SPICE-D を提案します。 R2R および UrbanWalk データセットの実験結果は、提案された KEFA スピーカーが屋内と屋外の両方のシーンで最先端の命令生成パフォーマンスを達成することを示しています。

We introduce a novel speaker model Kefa for navigation instruction generation. The existing speaker models in Vision-and-Language Navigation suffer from the large domain gap of vision features between different environments and insufficient temporal grounding capability. To address the challenges, we propose a Knowledge Refinement Module to enhance the feature representation with external knowledge facts, and an Adaptive Temporal Alignment method to enforce fine-grained alignment between the generated instructions and the observation sequences. Moreover, we propose a new metric SPICE-D for navigation instruction evaluation, which is aware of the correctness of direction phrases. The experimental results on R2R and UrbanWalk datasets show that the proposed KEFA speaker achieves state-of-the-art instruction generation performance for both indoor and outdoor scenes.

updated: Tue Jul 25 2023 09:39:59 GMT+0000 (UTC)

published: Tue Jul 25 2023 09:39:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト