REVECA -- Rich Encoder-decoder framework for Video Event CAptioner

Jaehyuk Heo; YongGi Jeong; Sunwoo Kim; Jaehee Kim; Pilsung Kang

REVECA-ビデオイベントキャプション用のリッチエンコーダーデコーダーフレームワーク

CVPR 2022で開催されたロングフォームビデオ理解ワークショップでの一般的な境界イベントキャプションチャレンジで使用されたアプローチについて説明します。ビデオからの空間的および時間的情報を利用するビデオイベントキャプション（REVECA）用のリッチエンコーダデコーダフレームワークを設計しました。対応するイベント境界のキャプションを生成します。 REVECAは、フレーム位置の埋め込みを使用して、イベント境界の前後の情報を組み込みます。さらに、時間セグメントネットワークと時間ベースのペアワイズ差分法を使用して抽出された特徴を使用して、時間情報を学習します。注意プーリングプロセスのセマンティックセグメンテーションマスクは、イベントの主題を学習するために採用されています。最後に、LoRAを適用して画像エンコーダーを微調整し、学習効率を高めます。 REVECAは、Kinetics-GEBCテストデータで50.97の平均スコアをもたらしました。これは、ベースラインメソッドよりも10.17改善されています。私たちのコードはhttps://github.com/TooTouch/REVECAで入手できます。

We describe an approach used in the Generic Boundary Event Captioning challenge at the Long-Form Video Understanding Workshop held at CVPR 2022. We designed a Rich Encoder-decoder framework for Video Event CAptioner (REVECA) that utilizes spatial and temporal information from the video to generate a caption for the corresponding the event boundary. REVECA uses frame position embedding to incorporate information before and after the event boundary. Furthermore, it employs features extracted using the temporal segment network and temporal-based pairwise difference method to learn temporal information. A semantic segmentation mask for the attentional pooling process is adopted to learn the subject of an event. Finally, LoRA is applied to fine-tune the image encoder to enhance the learning efficiency. REVECA yielded an average score of 50.97 on the Kinetics-GEBC test data, which is an improvement of 10.17 over the baseline method. Our code is available in https://github.com/TooTouch/REVECA.

updated: Sat Jun 18 2022 11:10:12 GMT+0000 (UTC)

published: Sat Jun 18 2022 11:10:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト