Margot Mieskes
Hochschule Darmstadt, Germany
Ulrike Padó
Hochschule für Technik Stuttgart, Germany
Download articlePublished in: Proceedings of the 8th Workshop on Natural Language Processing for Computer Assisted Language Learning (NLP4CALL 2019), September 30, Turku Finland
Linköping Electronic Conference Proceedings 164:8, p. 79-85
NEALT Proceedings Series 39:8, p. 79-85
Published: 2019-09-30
ISBN: 978-91-7929-998-9
ISSN: 1650-3686 (print), 1650-3740 (online)
Summarization Evaluation and Short-Answer Grading share the challenge of
automatically evaluating content quality.
Therefore, we explore the use of ROUGE,
a well-known Summarization Evaluation
method, for Short-Answer Grading. We
find a reliable ROUGE parametrization that
is robust across corpora and languages and
produces scores that are significantly correlated with human short-answer grades.
ROUGE adds no information to Short-Answer Grading NLP-based machine learning features in a by-corpus evaluation.
However, on a question-by-question basis,
we find that the ROUGE Recall score may
outperform standard NLP features. We
therefore suggest to use ROUGE within
a framework for per-question feature selection or as a reliable and reproducible
baseline for SAG.