Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train

Zhao Wang; Chang Liu; Shaoting Zhang; Qi Dou

大規模な自己教師あり事前トレーニングによる内視鏡ビデオ解析の基礎モデル

Foundation モデルは、病気の診断やテキストレポートの生成など、さまざまなアプリケーションで目覚ましい成功を収めています。現在のところ、内視鏡ビデオ分析の基礎モデルはまだ不足しています。本稿では、膨大な内視鏡ビデオデータを使用して特別に開発された基礎モデルであるEndo-FMを提案します。まず、空間的次元と時間的次元にわたるローカルとグローバルの両方の長距離依存関係をキャプチャするビデオトランスフォーマーを構築します。 2 番目に、自己教師ありの方法でグローバルビューとローカルビューを使用してトランスフォーマーモデルを事前トレーニングし、時空間の変動に対して堅牢で、さまざまなシーン間で区別できるようにすることを目指しています。基礎モデルを開発するために、9 つの公的に利用可能なデータセットと、中国の上海にある仁吉病院の宝山分院から非公開で収集されたデータセットを組み合わせて、大規模な内視鏡ビデオデータセットを構築しました。私たちのデータセット全体は、最大 500 万フレームの 33,000 を超えるビデオクリップで構成されており、さまざまなプロトコル、標的臓器、疾患の種類が含まれています。当社の事前トレーニング済み Endo-FM は、バックボーンとして機能することにより、微調整を通じて特定の下流タスクに簡単に採用できます。分類、セグメンテーション、検出を含む 3 つの異なるタイプの下流タスクに関する実験により、当社の Endo-FM は、現在の最先端 (SOTA) 自己教師あり事前トレーニングおよびアダプターベースの転移学習手法をはるかに上回っています。 VCL (分類、セグメンテーション、および検出に対して 3.1% F1、4.8% Dice、および 5.5% F1) および ST アダプター (分類、セグメンテーション、および検出に対して 5.9% F1、9.6% Dice、および 9.9% F1) などの有意なマージン検出）。コード、データセット、モデルは https://github.com/med-air/Endo-FM でリリースされています。

Foundation models have exhibited remarkable success in various applications, such as disease diagnosis and text report generation. To date, a foundation model for endoscopic video analysis is still lacking. In this paper, we propose Endo-FM, a foundation model specifically developed using massive endoscopic video data. First, we build a video transformer, which captures both local and global long-range dependencies across spatial and temporal dimensions. Second, we pre-train our transformer model using global and local views via a self-supervised manner, aiming to make it robust to spatial-temporal variations and discriminative across different scenes. To develop the foundation model, we construct a large-scale endoscopy video dataset by combining 9 publicly available datasets and a privately collected dataset from Baoshan Branch of Renji Hospital in Shanghai, China. Our dataset overall consists of over 33K video clips with up to 5 million frames, encompassing various protocols, target organs, and disease types. Our pre-trained Endo-FM can be easily adopted for a given downstream task via fine-tuning by serving as the backbone. With experiments on 3 different types of downstream tasks, including classification, segmentation, and detection, our Endo-FM surpasses the current state-of-the-art (SOTA) self-supervised pre-training and adapter-based transfer learning methods by a significant margin, such as VCL (3.1% F1, 4.8% Dice, and 5.5% F1 for classification, segmentation, and detection) and ST-Adapter (5.9% F1, 9.6% Dice, and 9.9% F1 for classification, segmentation, and detection). Code, datasets, and models are released at https://github.com/med-air/Endo-FM.

updated: Wed Nov 29 2023 11:54:44 GMT+0000 (UTC)

published: Thu Jun 29 2023 07:34:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト