Joint learning of images and videos with a single Vision Transformer

Shuki Shimizu; Toru Tamaki

単一の Vision Transformer による画像とビデオの共同学習

本研究では、単一モデルを用いて画像と動画を共同学習する手法を提案する。一般に、画像とビデオは別のモデルによってトレーニングされることがよくあります。この論文では、Vision Transformer IV-ViT への入力として画像のバッチを取得し、遅延融合による時間集約を伴うビデオフレームのセットも取得する方法を提案します。 2 つの画像データセットと 2 つの動作認識データセットに関する実験結果が示されています。

In this study, we propose a method for jointly learning of images and videos using a single model. In general, images and videos are often trained by separate models. We propose in this paper a method that takes a batch of images as input to Vision Transformer IV-ViT, and also a set of video frames with temporal aggregation by late fusion. Experimental results on two image datasets and two action recognition datasets are presented.

updated: Mon Aug 21 2023 07:38:33 GMT+0000 (UTC)

published: Mon Aug 21 2023 07:38:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト