Global Prototype Encoding for Incremental Video Highlights Detection

Sen Pei; Shixiong Xu; Ye Yuan; Jiashi Feng; Xiaohui Shen; Xiaojie Jin

インクリメンタルビデオハイライト検出用のグローバルプロトタイプエンコーディング

ビデオハイライトの検出は、コンピュータービジョンタスクのトピックとして長い間研究されてきました。未公開の未加工のビデオ入力が与えられた場合に、ユーザーにアピールするクリップを掘り出します。ただし、ほとんどの場合、この一連の研究の主流の方法は、固定数のハイライトカテゴリが事前に適切に定義され、すべてのトレーニングデータが同時に利用可能である必要がある閉世界の仮定に基づいて構築されています。その結果、ハイライトカテゴリとデータセットのサイズの両方に関してスケーラビリティが低下します。上記の問題に取り組むために、段階的に学習できるビデオハイライト検出器、つまりグローバルプロトタイプエンコーディング (GPE) を提案し、対応するプロトタイプを介して拡張データセットで新しく定義されたビデオハイライトをキャプチャします。加えて、ByteFood と呼ばれる十分に注釈が付けられた高価なデータセットを提示します。これには、料理、食事、食材、プレゼンテーションの 4 つの異なるドメインに属する 5.1k 以上のグルメビデオが含まれます。私たちの知る限りでは、ビデオハイライト検出にインクリメンタルラーニング設定が導入されたのはこれが初めてです。これにより、ビデオ入力のトレーニングの負担が軽減され、データセットのサイズに比例して従来のニューラルネットワークのスケーラビリティが促進されます。そしてドメインの量。さらに、提案された GPE は ByteFood の現在の増分学習方法を上回り、少なくとも 1.57% の mAP の改善を報告しています。コードとデータセットはすぐに利用できるようになります。

Video highlights detection has been long researched as a topic in computer vision tasks, digging the user-appealing clips out given unexposed raw video inputs. However, in most case, the mainstream methods in this line of research are built on the closed world assumption, where a fixed number of highlight categories is defined properly in advance and need all training data to be available at the same time, and as a result, leads to poor scalability with respect to both the highlight categories and the size of the dataset. To tackle the problem mentioned above, we propose a video highlights detector that is able to learn incrementally, namely Global Prototype Encoding (GPE), capturing newly defined video highlights in the extended dataset via their corresponding prototypes. Alongside, we present a well annotated and costly dataset termed ByteFood, including more than 5.1k gourmet videos belongs to four different domains which are cooking, eating, food material, and presentation respectively. To the best of our knowledge, this is the first time the incremental learning settings are introduced to video highlights detection, which in turn relieves the burden of training video inputs and promotes the scalability of conventional neural networks in proportion to both the size of the dataset and the quantity of domains. Moreover, the proposed GPE surpasses current incremental learning methods on ByteFood, reporting an improvement of 1.57% mAP at least. The code and dataset will be made available sooner.

updated: Fri Dec 16 2022 06:30:21 GMT+0000 (UTC)

published: Mon Sep 12 2022 11:51:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト