Estimating Presentation Competence using Multimodal Nonverbal Behavioral Cues

Ömer Sümer; Cigdem Beyan; Fabian Ruth; Olaf Kramer; Ulrich Trautwein; Enkelejda Kasneci

マルチモーダル非言語的行動手がかりを使用したプレゼンテーション能力の推定

人前で話す能力とプレゼンテーション能力は、私たちの教育、専門家、日常生活における社会的相互作用の多くの分野で重要な役割を果たしています。スピーチ中の私たちの意図は、聴衆が実際に理解しているものとは異なる可能性があるため、私たちのメッセージを適切に伝えるには、複雑なスキルが必要です。プレゼンテーション能力は、初期の学年で培われ、時間の経過とともに継続的に開発されています。プレゼンテーション能力の効率的な開発を促進できる1つのアプローチは、視覚および音声機能と機械学習に基づくスピーチ中の人間の行動の自動分析です。さらに、この分析は、プレゼンテーション能力に関連するスキルの向上と開発を提案するために使用できます。この作業では、プレゼンテーション能力を推定するために、さまざまな非言語的行動の手がかり、すなわち、顔、体のポーズベース、およびオーディオ関連の機能の寄与を調査します。分析は251人の学生のビデオで実行されましたが、自動評価はTübingenInstrumentforPresentation Competence（TIP）による手動評価に基づいています。私たちの分類結果は、同じデータセット評価での早期融合（71.25％の精度）と、クロスデータセット評価での音声、顔、体のポーズの特徴の遅い融合（78.11％の精度）で最高のパフォーマンスに達しました。同様に、回帰結果は融合戦略で最高のパフォーマンスを示しました。

Public speaking and presentation competence plays an essential role in many areas of social interaction in our educational, professional, and everyday life. Since our intention during a speech can differ from what is actually understood by the audience, the ability to appropriately convey our message requires a complex set of skills. Presentation competence is cultivated in the early school years and continuously developed over time. One approach that can promote efficient development of presentation competence is the automated analysis of human behavior during a speech based on visual and audio features and machine learning. Furthermore, this analysis can be used to suggest improvements and the development of skills related to presentation competence. In this work, we investigate the contribution of different nonverbal behavioral cues, namely, facial, body pose-based, and audio-related features, to estimate presentation competence. The analyses were performed on videos of 251 students while the automated assessment is based on manual ratings according to the Tübingen Instrument for Presentation Competence (TIP). Our classification results reached the best performance with early fusion in the same dataset evaluation (accuracy of 71.25%) and late fusion of speech, face, and body pose features in the cross dataset evaluation (accuracy of 78.11%). Similarly, regression results performed the best with fusion strategies.

updated: Thu May 06 2021 13:09:41 GMT+0000 (UTC)

published: Thu May 06 2021 13:09:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト