Transformer-based end-to-end classification of variable-length volumetric data

Marzieh Oghbaie; Teresa Araujo; Taha Emre; Ursula Schmidt-Erfurth; Hrvoje Bogunovic

可変長体積データのトランスフォーマーベースのエンドツーエンド分類

3D 医療データの自動分類はメモリを大量に消費します。また、サンプル間でスライス数が異なることもよくあります。サブサンプリングなどの単純なソリューションはこれらの問題を解決できますが、関連する診断情報が削除される可能性があります。トランスフォーマーは、逐次データ分析において有望なパフォーマンスを示しています。ただし、長いシーケンスのアプリケーションはデータ、計算、メモリを要求します。この論文では、可変長の体積データを効率的に分類できる、エンドツーエンドの Transformer ベースのフレームワークを提案します。特に、トレーニング中に入力スライス単位の解像度をランダム化することにより、各ボリュームスライスに割り当てられる学習可能な位置埋め込みの容量が強化されます。その結果、テスト時の高解像度ボリュームであっても、各位置埋め込みに蓄積された位置情報を隣接するスライスに一般化できます。そうすることで、モデルは可変ボリューム長に対してより堅牢になり、さまざまな計算予算に適応できるようになります。私たちは、網膜 OCT ボリューム分類において提案されたアプローチを評価し、最先端のビデオトランスフォーマーと比較して、9 クラスの診断タスクでバランスのとれた精度の平均 21.96% の向上を達成しました。私たちの調査結果は、トレーニング中に入力のスライスごとの解像度を変更すると、ボリュームあたりのスライス数が固定されたトレーニングと比較して、より有益なボリューム表現が得られることを示しています。コードは https://github.com/marziehoghbaie/VLFAT で入手できます。

The automatic classification of 3D medical data is memory-intensive. Also, variations in the number of slices between samples is common. Naive solutions such as subsampling can solve these problems, but at the cost of potentially eliminating relevant diagnosis information. Transformers have shown promising performance for sequential data analysis. However, their application for long-sequences is data, computationally, and memory demanding. In this paper, we propose an end-to-end Transformer-based framework that allows to classify volumetric data of variable length in an efficient fashion. Particularly, by randomizing the input slice-wise resolution during training, we enhance the capacity of the learnable positional embedding assigned to each volume slice. Consequently, the accumulated positional information in each positional embedding can be generalized to the neighbouring slices, even for high resolution volumes at the test time. By doing so, the model will be more robust to variable volume length and amenable to different computational budgets. We evaluated the proposed approach in retinal OCT volume classification and achieved 21.96% average improvement in balanced accuracy on a 9-class diagnostic task, compared to state-of-the-art video transformers. Our findings show that varying the slice-wise resolution of the input during training results in more informative volume representation as compared to training with fixed number of slices per volume. Our code is available at: https://github.com/marziehoghbaie/VLFAT.

updated: Thu Jul 13 2023 10:19:04 GMT+0000 (UTC)

published: Thu Jul 13 2023 10:19:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト