Konferensartikel

Morphological analysis with limited resources: Latvian example

Pēteris Paikens
University of Latvia, Institute of Mathematics and Computer Science, Latvia

Laura Rituma
University of Latvia, Institute of Mathematics and Computer Science, Latvia

Lauma Pretkalnina
University of Latvia, Institute of Mathematics and Computer Science, Latvia

Ladda ner artikel

Ingår i: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Linköping Electronic Conference Proceedings 85:24, s. 267-277

NEALT Proceedings Series 16:24, p. 267-277

Visa mer +

Publicerad: 2013-05-17

ISBN: 978-91-7519-589-6

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

We describe an approach for morphological analysis combining a rule-based word level morphological analyzer with statistical tagging; detailing its application to Latvian language. Latvian is a highly inflective Indo-European language with a rich morphology.

The tools described here include an implementation of Latvian inflectional paradigms; a morphological analysis tool with a guessing module for out-of-vocabulary words; and a statistical POS/morphology tagger for disambiguation of multiple analysis possibilities. Currently achieved accuracy with a training set of only ~40 000 words is 97.9% for part of speech tagging and 93.6% for the full morphological feature tag set; which is better than any previously publicly available taggers for Latvian.

We also describe the construction and methodology of the necessary linguistic resources – a morphological dictionary and an annotated morphological corpus; and evaluate the effect of resource size on analysis accuracy; showing what results can be achieved with limited linguistic resources.

Nyckelord

Morphology; inflective language; POS tagging; Latvian language; morphological corpus

Referenser

Barzdinš G.; Gruzitis N.; Nešpore G. and Saulite B. (2007). Dependency-Based Hybrid Model of Syntactic Analysis for the Languages with a Rather Free Word Order. In Proceedings of the 16th Nordic Conference of Computational Linguistics; pages 13–20; Tartu.

Erjavec; T. (2004). MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications; Lexicons and Corpora. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’2004); pages 1535–1538; Paris.

Deksne; D. and Skadinš; R. (2011). CFG Based Grammar Checker for Latvian. In Proceedings of the 18th Nordic Conference of Computational Linguistics NODALIDA 2011; Riga; Latvia.

Drizule; V. (1978). ?? ?????????????? ????????????? ???????? ??????? ?????????? ????? [On automated recognition of flexive homonymy in Latvian language]. In LZA Vestis 1978; 10; pages 79–87; Riga; LZA.

Gahbiche-Braham; S.; Bonneau-Maynard; H.; Lavergne; T. and Yvon; F. (2012). Joint Segmentation and POS Tagging for Arabic Using a CRF-based Classifier. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12); Istanbul; Turkey.

Greitane I. (1994). Latviešu valodas lokamo vardškiru locišanas algoritmi. (Algorithms for Latvian Form Generation) In LZA Vestis 1994; 1; pages 32–39; Riga; LZA.

Hajic; J.; Krbec; P.; Kveton; P.; Oliva; K. and Petkevic; V. (2001). Serial combination of rules and statistics: A case study in Czech tagging. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics; pages 268–275.

Hajic; J. (2000). Morphological Tagging: Data vs. Dictionaries. In: Proceedings of the 6th Applied Natural Language Processing and the 1st NAACL Conference; pages 94-101; Seatle; Washington; U.S.A.

Hulden; M. and Francom; J. (2012). Boosting statistical tagger accuracy with simple rule-based grammars. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12); Istanbul; Turkey.

Levane; K. and Spektors A. (2000). Morphemic Analysis and Morphological Tagging of Latvian Corpus. In Proceedings of the Second International Conference on Language Resources and Evaluation; vol. 2; pages 1095–1098.

Levane-Petrova K. (2011). Morfologiski marketa valodas korpusa izmantošana valodas izpete. In "Vards un ta petišanas aspekti": Rakstu krajums 15(1); pages 187–193; Liepaja; LiePA.

Paikens; P. (2007). Lexicon-based morphological analysis of Latvian language. In Proceedings of 3rd Baltic Conference on Human Language Technologies (HLT 2007); pages 235-240; Kaunas.

Pinnis; M. and Goba; K. (2011). Maximum Entropy Model for Disambiguation of Rich Morphological Tags. In Systems and Frameworks for Computational Morphology;Communications in Computer and Information Science; 1; Volume 100; The 2nd Workshop on Systems and Frameworks for Computational Morphology (SFCM2011); pages 14-22; Heidelberg; Springer.

Sarkans U. (1996). Morphemic and Morphological Analysis of the Latvian Language. In Proceedings of the Forth conference on Computational Lexicography and Text Research; pages 219–225; Budapest Skadina I. (2004). Latviešu valodas morfologiskas analizes sistema – tas nozime teikuma pareizrakstibas parbaude. In Vards un ta petišanas aspekti 8; pages 282–290; Liepaja.

Soida; E. and Klavina; S. (1970). Latviešu valodas inversa vardnica; Riga; LVU.

Toutanova K.; Klein D.; Manning C.D. and Singer Y. (2003). Feature-Rich Part-of- Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003; pages 252–259.

Vasiljevs; A.; Kikane; J. and Skadinš; R. (2004). Development of HLT for Baltic languages in widely used applications. In Proceedings of First Baltic Conference „Human Language Technologies – the Baltic Perspective”; pages 198-202; Riga.

Yuret; D. and Türe F. (2006). Learning morphological disambiguation rules for Turkish. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL ’06); pages 328-334; Association for Computational Linguistics; Stroudsburg; PA; USA.

Citeringar i Crossref