CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

Long Bai; Mobarakol Islam; Hongliang Ren

CAT-ViL: ロボット手術における視覚的質問の局所的応答のための共同注意ゲート型視覚言語埋め込み

医学生や若手外科医は、外科を学ぶ際に質問に答えてくれる上級外科医や専門家に頼ることがよくあります。しかし、専門家は臨床や学術的な仕事で忙しいことが多く、指導を行う時間がほとんどありません。一方、既存のディープラーニング（DL）ベースの外科用ビジュアル質問応答（VQA）システムは、答えの場所が不明な単純な答えしか提供できません。さらに、ビジョン言語 (ViL) の埋め込みは、この種のタスクではまだあまり調査されていません。したがって、外科用 Visual Question Localized-Answering (VQLA) システムは、医学生や若手外科医が録画された外科ビデオから学び理解するのに役立ちます。我々は、検出モデルによる特徴抽出を必要としない、手術シナリオにおける VQLA 用の Co-tention gaTed Vision-Language (CAT-ViL) 埋め込みを備えたエンドツーエンドのトランスフォーマーを提案します。 CAT-ViL 埋め込みモジュールは、ビジュアルソースとテキストソースからのマルチモーダル機能を融合するように設計されています。融合された埋め込みは、統合予測のための並列分類器と検出器の前に、標準のデータ効率画像変換器 (DeiT) モジュールに供給されます。私たちは、MICCAI EndoVis Challenge 2017 および 2018 の公開手術ビデオで実験検証を実施します。実験結果は、最先端のアプローチと比較して、私たちが提案したモデルの優れたパフォーマンスと堅牢性を強調しています。アブレーション研究により、提案されたすべてのコンポーネントの優れた性能がさらに証明されています。提案された方法は、手術シーンを理解するための有望なソリューションを提供し、手術トレーニングのための人工知能 (AI) ベースの VQLA システムの主要なステップを開きます。私たちのコードは公開されています。

Medical students and junior surgeons often rely on senior surgeons and specialists to answer their questions when learning surgery. However, experts are often busy with clinical and academic work, and have little time to give guidance. Meanwhile, existing deep learning (DL)-based surgical Visual Question Answering (VQA) systems can only provide simple answers without the location of the answers. In addition, vision-language (ViL) embedding is still a less explored research in these kinds of tasks. Therefore, a surgical Visual Question Localized-Answering (VQLA) system would be helpful for medical students and junior surgeons to learn and understand from recorded surgical videos. We propose an end-to-end Transformer with the Co-Attention gaTed Vision-Language (CAT-ViL) embedding for VQLA in surgical scenarios, which does not require feature extraction through detection models. The CAT-ViL embedding module is designed to fuse multimodal features from visual and textual sources. The fused embedding will feed a standard Data-Efficient Image Transformer (DeiT) module, before the parallel classifier and detector for joint prediction. We conduct the experimental validation on public surgical videos from MICCAI EndoVis Challenge 2017 and 2018. The experimental results highlight the superior performance and robustness of our proposed model compared to the state-of-the-art approaches. Ablation studies further prove the outstanding performance of all the proposed components. The proposed method provides a promising solution for surgical scene understanding, and opens up a primary step in the Artificial Intelligence (AI)-based VQLA system for surgical training. Our code is publicly available.

updated: Sat Aug 19 2023 22:23:36 GMT+0000 (UTC)

published: Tue Jul 11 2023 11:35:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト