SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery

Lalithkumar Seenivasan; Mobarakol Islam; Gokul Kannan; Hongliang Ren

SurgicalGPT: 手術における視覚的な質問応答のためのエンドツーエンドの言語視覚 GPT

GPT ベースの大規模言語モデル (LLM) の進歩は、自然言語処理に革命をもたらし、さまざまなドメインでの使用が指数関数的に増加しています。一方向の注意を組み込むことで、これらの自己回帰 LLM は長く一貫した段落を生成できます。ただし、視覚と言語処理の両方を必要とする視覚的質問応答 (VQA) タスクの場合、複数のモダリティのコンテキストを一度にキャプチャするために、双方向の注意を持つモデルまたは融合技術を使用するモデルがよく使用されます。 GPT はビジョントークンをネイティブに処理しないため、ロボット手術における VQA の GPT モデルの進歩を活用するために、GPT2 モデルを拡張してビジョン入力を含めるエンドツーエンドのトレーニング可能な Language-Vision GPT (LV-GPT) モデルを設計します。（画像）。提案された LV-GPT には、特徴抽出器 (ビジョントークナイザー) とビジョントークンの埋め込み (トークンの種類とポーズ) が組み込まれています。 GPT モデルにおける一方向の注意の制限と、首尾一貫した長い段落を生成する能力を考慮して、画像から答えを推測するために質問を理解する人間の思考プロセスを模倣して、ビジョントークンの前に単語トークンを慎重に並べます。量的には、LV-GPT モデルが、公開されている 2 つの外科用 VQA データセット (内視鏡ビジョンチャレンジロボットシーンセグメンテーション 2018 および CholecTriplet2021 に基づく) と、新しく注釈を付けたデータセット (全体的な手術シーンのデータセット上で)。さらに、3 つのデータセットすべてに注釈を付けて、サブタイプ分析を可能にする質問タイプの注釈を含めます。さらに、LV-GPTモデルでのビジョントークンのトークンシーケンス、トークンタイプ、ポーズ埋め込みの効果を広く研究し、提示します。

Advances in GPT-based large language models (LLMs) are revolutionizing natural language processing, exponentially increasing its use across various domains. Incorporating uni-directional attention, these autoregressive LLMs can generate long and coherent paragraphs. However, for visual question answering (VQA) tasks that require both vision and language processing, models with bi-directional attention or models employing fusion techniques are often employed to capture the context of multiple modalities all at once. As GPT does not natively process vision tokens, to exploit the advancements in GPT models for VQA in robotic surgery, we design an end-to-end trainable Language-Vision GPT (LV-GPT) model that expands the GPT2 model to include vision input (image). The proposed LV-GPT incorporates a feature extractor (vision tokenizer) and vision token embedding (token type and pose). Given the limitations of unidirectional attention in GPT models and their ability to generate coherent long paragraphs, we carefully sequence the word tokens before vision tokens, mimicking the human thought process of understanding the question to infer an answer from an image. Quantitatively, we prove that the LV-GPT model outperforms other state-of-the-art VQA models on two publically available surgical-VQA datasets (based on endoscopic vision challenge robotic scene segmentation 2018 and CholecTriplet2021) and on our newly annotated dataset (based on the holistic surgical scene dataset). We further annotate all three datasets to include question-type annotations to allow sub-type analysis. Furthermore, we extensively study and present the effects of token sequencing, token type and pose embedding for vision tokens in the LV-GPT model.

updated: Sat Jul 22 2023 15:43:46 GMT+0000 (UTC)

published: Wed Apr 19 2023 21:22:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト