Publicerad: 2013-05-17
ISBN: 978-91-7519-589-6
ISSN: 1650-3686 (tryckt), 1650-3740 (online)
This paper presents and evaluates a novel and flexible chunking method using Constraint Grammar (CG) rules to introduce chunk edges in corpus annotation. Our method exploits preexisting (non-constituent) morphosyntactic annotation such as part-of-speech or function tags; but can also be made to work on raw text; integrated with other CG modules. The first version of the chunker was developed for German CG-annotated interview data; with a parallel English version derived from the German one; indicating a high degree of language-independence of the rules in the presence of generalized syntactic-functional tags (e.g. subject; object; modifier). Two different approaches are discussed; one for minimal; flat chunking; the other for deep; nested chunking. The system has a reasonable performance and robustness for both; achieving F-scores of 89.1 and 97.4 for nested and minimal chunking; respectively. Xml markup is supported; and with a full set of rules; the tool can be used to convert CG annotation into complete constituent trees in VISL or TIGER format.
Inga referenser tillgängliga