Voice-assisted Image Labelling for Endoscopic Ultrasound Classification using Neural Networks

Ester Bonmati; Yipeng Hu; Alexander Grimwood; Gavin J. Johnson; George Goodchild; Margaret G. Keane; Kurinchi Gurusamy; Brian Davidson; Matthew J. Clarkson; Stephen P. Pereira; Dean C. Barratt

ニューラルネットワークを使用した超音波内視鏡分類のための音声支援画像ラベリング

超音波画像診断は、診断および治療手順中に患者の解剖学的構造をリアルタイムで視覚化するために一般的に使用される技術です。オペレーターへの依存度が高く、再現性が低いため、超音波のイメージングと解釈は急な学習曲線で困難になります。ディープラーニングを使用した自動画像分類は、初心者の超音波トレーニングをサポートし、経験豊富な開業医が複雑な病状を患っている患者の超音波画像の解釈を支援することで、これらの課題のいくつかを克服する可能性があります。ただし、深層学習手法を使用すると、正確な結果を提供するために大量のデータが必要になります。ラベルは、生体内で利用可能な3D空間コンテキストなしで2D画像に遡及的に割り当てられるため、または手順中にフレーム間の構造を視覚的に追跡する際に推測されるため、大きな超音波データセットのラベル付けは困難な作業です。この作業では、手順中に臨床医によって提供された生の口頭コメントから超音波内視鏡（EUS）画像にラベルを付けるマルチモーダル畳み込みニューラルネットワーク（CNN）アーキテクチャを提案します。 1つは音声データ用、もう1つは画像データ用の2つのブランチで構成されるCNNを使用します。これらのブランチは、解剖学的ランドマークの音声名から画像ラベルを予測するために結合されます。ネットワークは、専門のオペレーターからの記録された口頭のコメントを使用して訓練されました。私たちの結果は、5つの異なるラベルを持つデータセットの画像レベルで76％の予測精度を示しています。音声による解説を追加すると、超音波画像分類のパフォーマンスが向上し、深層学習アプリケーションに必要な大規模なEUSデータセットに手動でラベルを付ける負担がなくなると結論付けています。

Ultrasound imaging is a commonly used technology for visualising patient anatomy in real-time during diagnostic and therapeutic procedures. High operator dependency and low reproducibility make ultrasound imaging and interpretation challenging with a steep learning curve. Automatic image classification using deep learning has the potential to overcome some of these challenges by supporting ultrasound training in novices, as well as aiding ultrasound image interpretation in patient with complex pathology for more experienced practitioners. However, the use of deep learning methods requires a large amount of data in order to provide accurate results. Labelling large ultrasound datasets is a challenging task because labels are retrospectively assigned to 2D images without the 3D spatial context available in vivo or that would be inferred while visually tracking structures between frames during the procedure. In this work, we propose a multi-modal convolutional neural network (CNN) architecture that labels endoscopic ultrasound (EUS) images from raw verbal comments provided by a clinician during the procedure. We use a CNN composed of two branches, one for voice data and another for image data, which are joined to predict image labels from the spoken names of anatomical landmarks. The network was trained using recorded verbal comments from expert operators. Our results show a prediction accuracy of 76% at image level on a dataset with 5 different labels. We conclude that the addition of spoken commentaries can increase the performance of ultrasound image classification, and eliminate the burden of manually labelling large EUS datasets necessary for deep learning applications.

updated: Tue Oct 12 2021 21:22:24 GMT+0000 (UTC)

published: Tue Oct 12 2021 21:22:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト