Fully automatic scoring of handwritten descriptive answers in Japanese language tests

Hung Tuan Nguyen; Cuong Tuan Nguyen; Haruki Oka; Tsunenori Ishioka; Masaki Nakagawa

日本語テストでの手書きの記述的回答の全自動採点

本稿では、2017年と2018年に約12万人の受験者を対象に行われた新しい日本の大学入試の試験で手書きの記述的回答を自動的に採点する実験を紹介します。約40万の回答があり、2,000万文字以上あります。すべての回答は人間の試験官によって採点されていますが、手書きの文字にはラベルが付けられていません。ラベル付き手書きデータセットでトレーニングされたディープニューラルネットワークベースの手書き認識機能を、このラベルなし回答セットに適応させる試みを紹介します。私たちが提案する方法は、さまざまなトレーニング戦略を組み合わせ、複数の認識機能を組み合わせ、特定のデータへの過剰適合を回避するために、大規模な一般的なコーパスから構築された言語モデルを使用します。私たちの実験では、提案された方法は、データセットの0.5％未満を占める約2,000の検証済みラベル付き回答を使用して、97％を超える文字精度を記録します。次に、認識された回答は、誤認識された文字を修正したり、ルーブリックの注釈を付けたりすることなく、BERTモデルに基づいて事前にトレーニングされた自動スコアリングシステムに送られます。自動スコアリングシステムは、0.84から0.98の二次加重カッパ（QWK）を達成します。 QWKは0.8を超えているため、自動スコアリングシステムと人間の検査官の間のスコアリングの許容可能な類似性を表しています。これらの結果は、記述的回答のエンドツーエンドの自動スコアリングに関するさらなる研究に有望です。

This paper presents an experiment of automatically scoring handwritten descriptive answers in the trial tests for the new Japanese university entrance examination, which were made for about 120,000 examinees in 2017 and 2018. There are about 400,000 answers with more than 20 million characters. Although all answers have been scored by human examiners, handwritten characters are not labelled. We present our attempt to adapt deep neural network-based handwriting recognizers trained on a labelled handwriting dataset into this unlabeled answer set. Our proposed method combines different training strategies, ensembles multiple recognizers, and uses a language model built from a large general corpus to avoid overfitting into specific data. In our experiment, the proposed method records character accuracy of over 97% using about 2,000 verified labelled answers that account for less than 0.5% of the dataset. Then, the recognized answers are fed into a pre-trained automatic scoring system based on the BERT model without correcting misrecognized characters and providing rubric annotations. The automatic scoring system achieves from 0.84 to 0.98 of Quadratic Weighted Kappa (QWK). As QWK is over 0.8, it represents acceptable similarity of scoring between the automatic scoring system and the human examiners. These results are promising for further research on end-to-end automatic scoring of descriptive answers.

updated: Mon Jan 10 2022 08:47:52 GMT+0000 (UTC)

published: Mon Jan 10 2022 08:47:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト