What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction

Sunny Panchal; Apratim Bhattacharyya; Guillaume Berger; Antoine Mercier; Cornelius Bohm; Florian Dietrichkeit; Reza Pourreza; Xuanlin Li; Pulkit Madan; Mingu Lee; Mark Todorovich; Ingo Bax; Roland Memisevic

Vision-language models have shown impressive progress in recent years. However, existing models are largely limited to turn-based interactions, where each turn must be stepped (i.e., prompted) by the user. Open-ended, asynchronous interactions, where an AI model may proactively deliver timely responses or feedback based on the unfolding situation in real-time, are an open challenge. In this work, we present the QEVD benchmark and dataset, which explores human-AI interaction in the challenging, yet controlled, real-world domain of fitness coaching -- a task which intrinsically requires monitoring live user activity and providing immediate feedback. The benchmark requires vision-language models to recognize complex human actions, identify possible mistakes, and provide appropriate feedback in real-time. Our experiments reveal the limitations of existing state-of-the-art vision-language models for such asynchronous situated interactions. Motivated by this, we propose a simple end-to-end streaming baseline that can respond asynchronously to human actions with appropriate feedback at the appropriate time.

updated: Mon Dec 23 2024 17:06:20 GMT+0000 (UTC)

published: Thu Jul 11 2024 00:10:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト