Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

Kun Yuan; Vinkle Srivastav; Tong Yu; Joel L. Lavanchy; Pietro Mascagni; Nassir Navab; Nicolas Padoy

何百もの外科ビデオ講義を見てマルチモーダル表現を学ぶ

外科用コンピュータビジョンアプリケーションの最近の進歩は、主に視覚データのみを使用する、完全に監視された方法によって推進されています。これらの方法は、手動で注釈を付けた手術ビデオに依存して、オブジェクトカテゴリの固定セットを予測するため、目に見えない手術手順や下流のタスクへの一般化可能性が制限されます。この研究では、オープンな外科用 e ラーニングプラットフォームを通じて利用できる外科用ビデオ講義が、手動の注釈に依存せずに、マルチモーダル表現学習のための効果的な監視信号を提供できるというアイデアを提案しました。私たちは、テキストの転写を生成するために複数の補完的な自動音声認識システムを採用することにより、外科ビデオ講義に存在する外科特有の言語的課題に取り組んでいます。次に、マルチモーダル表現学習のための新しい方法、SurgVLP (外科用ビジョン言語事前トレーニング) を紹介します。 SurgVLP は、ビデオクリップの埋め込みを、共同潜在空間内でまとめることによって、対応する複数のテキストの埋め込みと位置合わせするための新しい対比学習目標を構築します。学習された関節潜在空間の表現能力を効果的に示すために、評価のベンチマークとして、テキストベースのビデオ検索、時間活動のグラウンディング、ビデオキャプションなど、手術用のいくつかの視覚および言語タスクを導入します。さらに、ラベル付きグラウンドトゥルースを使用せずに、手術ツール、位相、トリプレット認識など、従来の視覚のみを使用する外科手術の下流タスクに私たちのアプローチを使用できることを示します。コードは https://github.com/CAMMA-public/SurgVLP で利用可能になります。

Recent advancements in surgical computer vision applications have been driven by fully-supervised methods, primarily using only visual data. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective supervisory signals for multi-modal representation learning without relying on manual annotations. We address the surgery-specific linguistic challenges present in surgical video lectures by employing multiple complementary automatic speech recognition systems to generate text transcriptions. We then present a novel method, SurgVLP - Surgical Vision Language Pre-training, for multi-modal representation learning. SurgVLP constructs a new contrastive learning objective to align video clip embeddings with the corresponding multiple text embeddings by bringing them together within a joint latent space. To effectively show the representation capability of the learned joint latent space, we introduce several vision-and-language tasks for surgery, such as text-based video retrieval, temporal activity grounding, and video captioning, as benchmarks for evaluation. We further demonstrate that without using any labeled ground truth, our approach can be employed for traditional vision-only surgical downstream tasks, such as surgical tool, phase, and triplet recognition. The code will be made available at https://github.com/CAMMA-public/SurgVLP

updated: Sat Jan 13 2024 13:56:32 GMT+0000 (UTC)

published: Thu Jul 27 2023 22:38:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト