Marus Uneson
Centre for Languages and Literature, Lund University, Sweden
Download article![](/images/PDF_24.png)
Published in: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16
Linköping Electronic Conference Proceedings 85:36, p. 399-409
NEALT Proceedings Series 16:36, p. 399-409
Published: 2013-05-17
ISBN: 978-91-7519-589-6
ISSN: 1650-3686 (print), 1650-3740 (online)
The RWAAI (Repository and Workspace for Austroasiatic Intangible heritage) project aims at building a digital archive out of existing legacy data from the Austroasiatic language family. One aspect of the project is the preservation of analogue legacy data. In this context; we have at our hands a large number of mostly-phonemic transcriptions of narrative monologues; often with accompanying sound recordings; in the unwritten Kammu language of northern Laos. Some of the transcriptions; however; lack tone marks; which for a tonal language such as Kammu makes them substantially less useful. The problem of restoring tones can be recast as one of word sense disambiguation; or; more generally; lexical ambiguity resolution. We attack it by decision lists; along the lines of Yarowsky (1994); using the tone-marked part of the corpus (120kW) as training data. The performance ceiling of this corpus is uncertain: the stories were all annotated; primarily for human rather than machine consumption; by a single person during almost 40 years; with slowly emerging idiosyncratic conventions. Thus; both inter-annotator and intra-annotator agreement figures are unknown. Nevertheless; with the data from this one annotator as a gold standard; we improve from an already-high baseline accuracy of 95.7% to 97.2% (by 10-fold cross-validation).
Word sense disambiguation; Kammu; decision lists; lexical ambiguity resolution; tone restoration; legacy data
Agirre; E. and Edmonds; P. (2006). Word sense disambiguation: Algorithms and applications; volume 33. Springer Science+ Business Media.
Jurafsky; D. and Martin; J. H. (2008). An Introduction to Natural Language Processing; Computational Linguistics; and Speech Recognition. Prentice-Hall; 2 edition.
Lindell; K.; Öjvind Swahn; J.; and Tayanin; D. (1977). A Kammu story-listener’s tales. Number 33 in Scandinavian Institute of Asian Studies Monograph Series. Curzon Press; London.
Lindell; K.; Öjvind Swahn; J.; and Tayanin; D. (1980). Folk Tales from Kammu II: A Story-teller’s Tales; volume 40 of Scandinavian Institute of Asian Studies Monograph Series. Curzon Press; London.
Lindell; K.; Öjvind Swahn; J.; and Tayanin; D. (1984). Folk Tales from Kammu III: Pearls of Kammu Literature. Number 51 in Scandinavian Institute of Asian Studies Monograph Series. Curzon Press; London.
Lindell; K.; Öjvind Swahn; J.; and Tayanin; D. (1989). Folk Tales from Kammu IV: A Master-Teller’s Tales. Number 56 in Scandinavian Institute of Asian Studies Monograph Series. Curzon Press; London.
Lindell; K.; Öjvind Swahn; J.; and Tayanin; D. (1995). Folk Tales from Kammu V: A Young Story-Teller’s Tales. Number 66 in Nordic Institute of Asian Studies Monograph series. Curzon Press; London.
Lindell; K.; Öjvind Swahn; J.; and Tayanin; D. (1998). Folk Tales from Kammu VI: A Teller’s Last Tales; volume 77 of Nordic Institute of Asian Studies Monograph series. Curzon Press; London.
Miller; G. A.; Beckwith; R.; Fellbaum; C.; Gross; D.; and Miller; K. (1990). Five papers on WordNet. International Journal of Lexicography; 3(4):235–244.
Navigli; R. (2009). Word sense disambiguation: a survey. ACM Computing Surveys; 41(2):1–69.
Rivest; R. (1987). Learning decision lists. Machine learning; 2(3):229–246.
Settles; B. (2009). Active learning literature survey. Technical Report Computer Sciences Technical Report 1648; University of Wisconsin–Madison.
Svantesson; J.-O. (1983). Kammu Phonology and Morphology. PhD thesis; Lund University. Travaux de l’Institut de linguistique de Lund; 18.
Svantesson; J.-O. (1989). Tonogenetic mechanisms in northern mon-khmer. Phonetica; 46(1-3):60–79.
Svantesson; J.-O.; Tayanin; D.; Lindell; K.; and Lundström; H. (in press). Kammu yùanenglish dictionary.
Yarowsky; D. (1994). Decision lists for lexical ambiguity resolution: Application to accent restoration in spanish and french. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics; pages 88–95. Association for Computational Linguistics.
Yarowsky; D. (1996). Homograph disambiguation in text-to-speech synthesis. pages 157–172.