Learning from a tiny dataset of manual annotations: a teacher/student approach for surgical phase recognition

Tong Yu; Didier Mutter; Jacques Marescaux; Nicolas Padoy

手動注釈の小さなデータセットからの学習：手術段階認識のための教師/学生アプローチ

リアルタイムのビデオストリームからシーンを解釈できるビジョンアルゴリズムは、コンピュータ支援手術システムがコンテキストアウェアな動作を実現するために必要です。腹腔鏡手術において、そのようなシステムに必要な１つの特定のアルゴリズムは、外科的段階の識別であり、そのための現在の最先端技術は、ＣＮＮ－ＬＳＴＭに基づくモデルである。この種のモデルを使用した以前の多くの作業では、完全に監視された方法でそれらをトレーニングしており、完全に注釈が付けられたデータセットが必要です。代わりに、私たちの仕事は、注釈付きデータの量が少ない（利用可能なすべてのビデオ録画の25％未満）シナリオでの外科的位相認識の学習の問題に直面しています。教師/学生タイプのアプローチを提案します。教師と呼ばれる強力な予測子が、グラウンドトゥルース注釈付きビデオの小さなデータセットで事前にトレーニングされ、別のモデル（学生）が学習する、より大きなデータセットの合成注釈を生成します。私たちの場合、教師はオフライン推論専用に設計された新しいCNN-biLSTM-CRFアーキテクチャを備えています。一方、学生はリアルタイムの予測を行うことができるCNN-LSTMです。手動で注釈を付けたさまざまな量のビデオの結果は、新しいCNN-biLSTM-CRF予測子の優位性と、注釈なしのビデオ用に生成された合成ラベルを使用してトレーニングされたCNN-LSTMのパフォーマンスの向上を示しています。利用可能な注釈付きの記録がほとんどないオフラインとオンラインの両方の手術段階認識の場合、この新しい教師/学生戦略は、注釈なしのデータを効率的に活用することにより、貴重なパフォーマンスの向上を提供します。

Vision algorithms capable of interpreting scenes from a real-time video stream are necessary for computer-assisted surgery systems to achieve context-aware behavior. In laparoscopic procedures one particular algorithm needed for such systems is the identification of surgical phases, for which the current state of the art is a model based on a CNN-LSTM. A number of previous works using models of this kind have trained them in a fully supervised manner, requiring a fully annotated dataset. Instead, our work confronts the problem of learning surgical phase recognition in scenarios presenting scarce amounts of annotated data (under 25% of all available video recordings). We propose a teacher/student type of approach, where a strong predictor called the teacher, trained beforehand on a small dataset of ground truth-annotated videos, generates synthetic annotations for a larger dataset, which another model - the student - learns from. In our case, the teacher features a novel CNN-biLSTM-CRF architecture, designed for offline inference only. The student, on the other hand, is a CNN-LSTM capable of making real-time predictions. Results for various amounts of manually annotated videos demonstrate the superiority of the new CNN-biLSTM-CRF predictor as well as improved performance from the CNN-LSTM trained using synthetic labels generated for unannotated videos. For both offline and online surgical phase recognition with very few annotated recordings available, this new teacher/student strategy provides a valuable performance improvement by efficiently leveraging the unannotated data.

updated: Wed Sep 30 2020 14:22:43 GMT+0000 (UTC)

published: Fri Nov 30 2018 19:50:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト