Quantifying and Learning Static vs. Dynamic Information in Deep Spatiotemporal Networks

Matthew Kowal; Mennatullah Siam; Md Amirul Islam; Neil D. B. Bruce; Richard P. Wildes; Konstantinos G. Derpanis

深層時空間ネットワークにおける静的情報と動的情報の定量化と学習

中間表現で深い時空間モデルによってキャプチャされた情報の理解は限られています。たとえば、アクション認識アルゴリズムが単一フレームの視覚的外観に大きく影響されることを証拠が示唆している一方で、ダイナミクスへのバイアスと比較して、潜在表現におけるそのような静的バイアスを評価するための定量的方法論は存在しません。時空間モデルの静的および動的バイアスを定量化するアプローチを提案することでこの課題に取り組み、このアプローチをアクション認識、自動ビデオオブジェクトセグメンテーション (AVOS)、およびビデオインスタンスセグメンテーション (VIS) の 3 つのタスクに適用します。主な調査結果は次のとおりです。(i) 調査されたほとんどのモデルは、静的情報に偏っています。 (ii) ダイナミクスに偏っていると思われる一部のデータセットは、実際には静的情報に偏っています。 (iii) アーキテクチャ内の個々のチャネルは、静的、動的、またはその 2 つの組み合わせに偏っている可能性があります。 (iv) ほとんどのモデルは、トレーニングの前半で最高のバイアスに収束します。次に、これらのバイアスが動的にバイアスされたデータセットのパフォーマンスにどのように影響するかを調べます。アクション認識のために、静的情報からダイナミクスへのモデルの偏りをなくす、セマンティックにガイドされたドロップアウトである StaticDropout を提案します。 AVOS では、以前のアーキテクチャと比較して、フュージョンレイヤーとクロスコネクションレイヤーのより優れた組み合わせを設計しています。

There is limited understanding of the information captured by deep spatiotemporal models in their intermediate representations. For example, while evidence suggests that action recognition algorithms are heavily influenced by visual appearance in single frames, no quantitative methodology exists for evaluating such static bias in the latent representation compared to bias toward dynamics. We tackle this challenge by proposing an approach for quantifying the static and dynamic biases of any spatiotemporal model, and apply our approach to three tasks, action recognition, automatic video object segmentation (AVOS) and video instance segmentation (VIS). Our key findings are: (i) Most examined models are biased toward static information. (ii) Some datasets that are assumed to be biased toward dynamics are actually biased toward static information. (iii) Individual channels in an architecture can be biased toward static, dynamic or a combination of the two. (iv) Most models converge to their culminating biases in the first half of training. We then explore how these biases affect performance on dynamically biased datasets. For action recognition, we propose StaticDropout, a semantically guided dropout that debiases a model from static information toward dynamics. For AVOS, we design a better combination of fusion and cross connection layers compared with previous architectures.

updated: Thu Nov 03 2022 13:17:53 GMT+0000 (UTC)

published: Thu Nov 03 2022 13:17:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト