Cross-Enhancement Transformer for Action Segmentation

Jiahui Wang; Zhenyou Wang; Shanna Zhuang; Hui Wang

アクションセグメンテーションのためのクロスエンハンスメントトランスフォーマー

時間的畳み込みは、畳み込み層を増やすことによって長期受容野を強化するアクションセグメンテーションで選択されるパラダイムです。ただし、高層では、フレーム認識に必要なローカル情報が失われます。上記の問題を解決するために、この論文では、クロスエンハンスメントトランスと呼ばれる新しいエンコーダ-デコーダ構造を提案します。私たちのアプローチは、インタラクティブな自己注意メカニズムを備えた時間的構造表現の効果的な学習になり得ます。エンコーダーの各レイヤーの畳み込み特徴マップを、自己注意によって生成されたデコーダーの一連の特徴と連結しました。したがって、ローカル情報とグローバル情報は、一連のフレームアクションで同時に使用されます。さらに、オーバーセグメンテーションエラーにペナルティを課すトレーニングプロセスを強化するために、新しい損失関数が提案されています。実験によると、私たちのフレームワークは、50Salads、Georgia Tech Egocentric Activities、Breakfastデータセットの3つのやりがいのあるデータセットに対して最先端のパフォーマンスを発揮します。

Temporal convolutions have been the paradigm of choice in action segmentation, which enhances long-term receptive fields by increasing convolution layers. However, high layers cause the loss of local information necessary for frame recognition. To solve the above problem, a novel encoder-decoder structure is proposed in this paper, called Cross-Enhancement Transformer. Our approach can be effective learning of temporal structure representation with interactive self-attention mechanism. Concatenated each layer convolutional feature maps in encoder with a set of features in decoder produced via self-attention. Therefore, local and global information are used in a series of frame actions simultaneously. In addition, a new loss function is proposed to enhance the training process that penalizes over-segmentation errors. Experiments show that our framework performs state-of-the-art on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities and the Breakfast dataset.

updated: Thu May 19 2022 10:06:30 GMT+0000 (UTC)

published: Thu May 19 2022 10:06:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト