Learning Action Changes by Measuring Verb-Adverb Textual Relationships

Davide Moltisanti; Frank Keller; Hakan Bilen; Laura Sevilla-Lara

動詞と副詞のテキスト関係の測定による行動変化の学習

この作業の目標は、ビデオでアクションが実行される方法を理解することです。つまり、ビデオが与えられた場合、アクションに適用された変更を示す副詞を予測することを目的としています (たとえば、「細かく」カット)。この問題を回帰タスクとしてキャストします。動詞と副詞の間のテキスト関係を測定して、学習しようとしている行動の変化を表す回帰ターゲットを生成します。さまざまなデータセットでアプローチをテストし、副詞予測と反意語分類の両方で最先端の結果を達成します。さらに、一般的に想定される 2 つの条件 (テスト中のアクションラベルの可用性と副詞の対義語としてのペアリング) を持ち上げると、以前の作業よりも優れています。副詞認識用の既存のデータセットは、ノイズが多く学習が困難であるか、外観が副詞の影響を受けない動作を含んでおり、評価の信頼性が低くなります。これに対処するために、新しい高品質のデータセットである Adverbs in Recipes (AIR) を収集します。さまざまな方法で実行すると意味のある視覚的変化を示す一連のアクションをキュレーションして、教育レシピのビデオに焦点を当てています。 AIR のビデオはより厳密にトリミングされており、複数のアノテーターによって手動でレビューされ、高いラベル付け品質が確保されています。結果は、よりクリーンなビデオを与えられた AIR からモデルがよりよく学習することを示しています。同時に、AIR での副詞の予測は困難であり、改善の余地がかなりあることを示しています。

The goal of this work is to understand the way actions are performed in videos. That is, given a video, we aim to predict an adverb indicating a modification applied to the action (e.g. cut "finely"). We cast this problem as a regression task. We measure textual relationships between verbs and adverbs to generate a regression target representing the action change we aim to learn. We test our approach on a range of datasets and achieve state-of-the-art results on both adverb prediction and antonym classification. Furthermore, we outperform previous work when we lift two commonly assumed conditions: the availability of action labels during testing and the pairing of adverbs as antonyms. Existing datasets for adverb recognition are either noisy, which makes learning difficult, or contain actions whose appearance is not influenced by adverbs, which makes evaluation less reliable. To address this, we collect a new high quality dataset: Adverbs in Recipes (AIR). We focus on instructional recipes videos, curating a set of actions that exhibit meaningful visual changes when performed differently. Videos in AIR are more tightly trimmed and were manually reviewed by multiple annotators to ensure high labelling quality. Results show that models learn better from AIR given its cleaner videos. At the same time, adverb prediction on AIR is challenging, demonstrating that there is considerable room for improvement.

updated: Mon Mar 27 2023 10:53:38 GMT+0000 (UTC)

published: Mon Mar 27 2023 10:53:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト