TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

Michael S. Ryoo; AJ Piergiovanni; Anurag Arnab; Mostafa Dehghani; Anelia Angelova

TokenLearner：8つの学習済みトークンは画像とビデオに対して何ができますか？

この論文では、適応的に学習された少数のトークンに依存し、画像とビデオの両方の理解タスクに適用できる新しい視覚表現学習を紹介します。手作業で設計された分割戦略に依存してビジュアルトークンを取得し、注意を引くために多数の密にサンプリングされたパッチを処理する代わりに、私たちのアプローチはビジュアルデータ内の重要なトークンをマイニングすることを学習します。これにより、いくつかの重要な視覚的トークンを効率的かつ効果的に見つけることができ、ビデオのより長い時間的範囲、または画像の空間コンテンツにわたって、そのようなトークン間のペアワイズ注意のモデリングが可能になります。私たちの実験は、画像とビデオの両方の認識タスクのいくつかの挑戦的なベンチマークで強力なパフォーマンスを示しています。重要なのは、トークンが適応的であるため、大幅に削減された計算量で競争力のある結果を達成することです。

In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks. Instead of relying on hand-designed splitting strategies to obtain visual tokens and processing a large number of densely sampled patches for attention, our approach learns to mine important tokens in visual data. This results in efficiently and effectively finding a few important visual tokens and enables modeling of pairwise attention between such tokens, over a longer temporal horizon for videos, or the spatial content in images. Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks. Importantly, due to our tokens being adaptive, we accomplish competitive results at significantly reduced compute amount.

updated: Mon Jun 21 2021 17:55:59 GMT+0000 (UTC)

published: Mon Jun 21 2021 17:55:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト