Publicerad: 2013-05-17
ISBN: 978-91-7519-589-6
ISSN: 1650-3686 (tryckt), 1650-3740 (online)
In this paper; we experiment with using Stagger; an open-source implementation of an Averaged Perceptron tagger; to tag Icelandic; a morphologically complex language. By adding languagespecific linguistic features and using IceMorphy; an unknown word guesser; we obtain state-of- the-art tagging accuracy of 92.82%. Furthermore; by adding data from a morphological database; and word embeddings induced from an unannotated corpus; the accuracy increases to 93.84%. This is equivalent to an error reduction of 5.5%; compared to the previously best tagger for Icelandic; consisting of linguistic rules and a Hidden Markov Model.
Averaged Perceptron; Part-of-Speech Tagging; Morphological Database; Linguistic Features; Word Embeddings
Berger; A. L.; Pietra; V. J. D.; and Pietra; S. A. D. (1996). A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics; 22:39–71.
Bjarnadóttir; K. (2012). The Database of Modern Icelandic Inflection. In Proceedings of the workshop “Language Technology for Normalization of Less-Resourced Languages”; SaLTMiL 8 – AfLaT; LREC; Istanbul; Turkey.
Brants; T. (2000). TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing; Seattle; WA; USA.
Collins; M. (2002). Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing; Philadelphia; PA; USA.
Collobert; R. and Weston; J. (2008). A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask learning. In Proceedings of the 25th International Conference on Machine learning; ICML; Helsinki; Finland.
Collobert; R.; Weston; J.; Bottou; L.; Karlen; M.; Kavukcuoglu; K.; and Kuksa; P. (2011). Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research; 12:2493–2537.
Dredze; M. andWallenberg; J. (2008a). Further Results and Analysis of Icelandic Part of Speech Tagging. Technical report; Department of Computer and Information Science; University of Pennsylvania.
Dredze; M. and Wallenberg; J. (2008b). Icelandic Data Driven Part of Speech Tagging. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies; ACL-HLT; Columbus; OH; USA.
Georgiev; G.; Zhikov; V.; Simov; K.; Osenova; P.; and Nakov; P. (2012). Feature-Rich Partof- speech Tagging for Morphologically Complex Languages: Application to Bulgarian. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics; EACL; Avignon; France
Giménez; J. and Màrquez; L. (2004). SVMTool: A general POS tagger generator based on Support Vector Machines. In Proceedings of the 4th International Conference on Language Resources and Evaluation; LREC; Lisbon; Portugal
Helgadóttir; S. (2005). Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic. In Holmboe; H.; editor; Nordisk Sprogteknologi 2004. Museum Tusculanums Forlag; Copenhagen.
Lafferty; J.; McCallum; A.; and Pereira; F. (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the 18th International Conference on Machine Learning; ICML; Williamstown; MA; USA.
Loftsson; H. (2008). Tagging Icelandic text: A linguistic rule-based approach. Nordic Journal of Linguistics; 31(1):47–72.
Loftsson; H.; Helgadóttir; S.; and Rögnvaldsson; E. (2011). Using a morphological database to increase the accuracy in PoS tagging. In Proceedings of Recent Advances in Natural Language Processing; RANLP; Hissar; Bulgaria.
Loftsson; H.; Kramarczyk; I.; Helgadóttir; S.; and Rögnvaldsson; E. (2009). Improving the PoS tagging accuracy of Icelandic text. In Proceedings of the 17th Nordic Conference of Computational Linguistics; NoDaLiDa; Odense; Denmark.
Loftsson; H. and Rögnvaldsson; E. (2007). IceNLP: A Natural Language Processing Toolkit for Icelandic. In Proceedings of Interspeech 2007; Special Session: “Speech and language technology for less-resourced languages”; Interspeech; Antwerp; Belgium.
Marcus; M. P.; Santorini; B.; and Marcinkiewicz; M. A. (1993). Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics; 19(20):313–330. Mikheev; A. (1997). Automatic Rule Induction for Unknown Word Guessing. Computational Linguistics; 21(4):543–565.
Nakagawa; T. and Yuji; M. (2006). Guessing parts-of-speech of unknown words using global information. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual meeting of the Association for Computational Linguistics; Sydney; Australia.
Nakov; P.; Bonev; Y.; Angelova; G.; Cius; E.; and Hahn; W. v. (2003). Guessing Morphological Classes of Unknown German Nouns. In Proceedings of Recent Advances in Natural Language Processing; RANLP; Borovets; Bulgaria.
Pind; J.; Magnússon; F.; and Briem; S. (1991). Íslensk orðtíðnibók [The Icelandic Frequency Dictionary]. The Institute of Lexicography; University of Iceland; Reykjavik; Iceland.
Radziszewski; A. (2013). A tiered CRF tagger for Polish. In Bembenik; R.; Skonieczny; L.;Rybi´nski; H.; Kryszkiewicz; M.; and Niezgódka; M.; editors; Intelligent Tools for Building a Scientific Information Platform: Advanced Architectures and Solutions. Springer Verlag.
Ratnaparkhi; A. (1996). A Maximum Entropy Model for Part-Of-Speech Tagging. In Proceedings of the Empirical Methods in Natural Language Processing Conference; Philadelphia; PA; USA.
Rögnvaldsson; E. and Helgadóttir; S. (2011). Morphosyntactic Tagging of Old Icelandic Texts and Its Use in Studying Syntactic Variation and Change. In Sporleder; C.; van den Bosch; A.; and Zervanou; K.; editors; Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workshop Series. Springer; Berlin.
Shen; L.; Satta; G.; and Joshi; A. (2007). Guided Learning for Bidirectional Sequence Classification.In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics; ACL; Prague; Czech Republic.
Søgaard; A. (2011). Semi-supervised condensed nearest neighbor for part-of-speech tagging. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies; ACL-HLT; Portland; Oregon.
Spoustová; D. j.; Haji?c; J.; Raab; J.; and Spousta; M. (2009). Semi-supervised Training for the Averaged Perceptron POS Tagger. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics; EACL; Athens; Greece.
Toutanova; K.; Klein; D.; Manning; C. D.; and Singer; Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology; NAACL; Edmonton; Canada.
Tsuruoka; Y.; Miyao; Y.; and Kazama; J. (2011). Learning with Lookahead: Can History- Based Models Rival Globally Optimized Models? In Proceedings of the Fifteenth Conference on Computational Natural Language Learning; CoNLL; Portland; Oregon; USA.
Turian; J.; Ratinov; L.; and Bengio; Y. (2010). Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics; ACL; Uppsala; Sweden.
Östling; R. (2012). Stagger: A modern POS tagger for Swedish. In Proceedings of the Swedish Language Technology Conference; SLTC; Lund; Sweden.