Rendezvous: Attention Mechanisms for the Recognition of Surgical Action Triplets in Endoscopic Videos

Chinedu Innocent Nwoye; Tong Yu; Cristians Gonzalez; Barbara Seeliger; Pietro Mascagni; Didier Mutter; Jacques Marescaux; Nicolas Padoy

ランデブー：内視鏡ビデオにおける外科的アクショントリプレットの認識のための注意メカニズム

内視鏡ビデオの外科ワークフロー分析のための既存のすべてのフレームワークの中で、アクショントリプレット認識は、外科活動に関する真にきめ細かい包括的な情報を提供することを目的とした唯一のフレームワークとして際立っています。この情報は、組み合わせは、正確に特定するのが非常に困難です。トリプレットコンポーネントは、個別に認識するのが難しい場合があります。このタスクでは、3つのトリプレットコンポーネントすべてに対して同時に認識を実行するだけでなく、それらの間のデータ関連付けを正しく確立する必要があります。このタスクを達成するために、2つの異なるレベルで注意を活用することにより、手術ビデオから直接トリプレットを認識する新しいモデル、ランデブー（RDV）を紹介します。最初に、シーン内の個々のアクショントリプレットコンポーネントをキャプチャするための新しい形式の空間的注意を導入します。クラスアクティベーションガイド付き注意メカニズム（CAGAM）と呼ばれます。この手法は、楽器から生じるアクティベーションを使用した動詞とターゲットの認識に焦点を当てています。アソシエーションの問題を解決するために、RDVモデルは、Transformerネットワークに触発された新しい形式のセマンティックアテンションを追加します。 RDVは、複数のクロスアテンションとセルフアテンションを使用して、楽器、動詞、ターゲット間の関係を効果的にキャプチャできます。また、CholecT50も紹介します。これは50の内視鏡ビデオのデータセットであり、すべてのフレームに100のトリプレットクラスのラベルが注釈として付けられています。提案されたRDVモデルは、このデータセットの最先端の方法と比較して、トリプレット予測mAPを9％以上大幅に改善します。

Out of all existing frameworks for surgical workflow analysis in endoscopic videos, action triplet recognition stands out as the only one aiming to provide truly fine-grained and comprehensive information on surgical activities. This information, presented as combinations, is highly challenging to be accurately identified. Triplet components can be difficult to recognize individually; in this task, it requires not only performing recognition simultaneously for all three triplet components, but also correctly establishing the data association between them. To achieve this task, we introduce our new model, the Rendezvous (RDV), which recognizes triplets directly from surgical videos by leveraging attention at two different levels. We first introduce a new form of spatial attention to capture individual action triplet components in a scene; called the Class Activation Guided Attention Mechanism (CAGAM). This technique focuses on the recognition of verbs and targets using activations resulting from instruments. To solve the association problem, our RDV model adds a new form of semantic attention inspired by Transformer networks. Using multiple heads of cross and self attentions, RDV is able to effectively capture relationships between instruments, verbs, and targets. We also introduce CholecT50 - a dataset of 50 endoscopic videos in which every frame has been annotated with labels from 100 triplet classes. Our proposed RDV model significantly improves the triplet prediction mAP by over 9% compared to the state-of-the-art methods on this dataset.

updated: Tue Sep 07 2021 17:52:52 GMT+0000 (UTC)

published: Tue Sep 07 2021 17:52:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト