Alexandre Rademaker
IBM Research and EMAp/FGV, Brazil
Fabricio Chalub
IBM Research, Brazil
Livy Real
University of São Paulo, Brazil
Cláudia Freitas
PUC-Rio, Brazil
Eckhard Bick
University of Southern Denmark, Denmark
Valeria de Paiva
Nuance Communications, USA
Download articlePublished in: Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), September 18-20, 2017, Università di Pisa, Italy
Linköping Electronic Conference Proceedings 139:23, p. 197-206
Published: 2017-09-13
ISBN: 978-91-7685-467-9
ISSN: 1650-3686 (print), 1650-3740 (online)
This paper describes the creation of a Portuguese corpus following the guidelines of the Universal Dependencies Framework. Instead of starting from scratch, we invested in a conversion process from the existing Portuguese corpus, called Bosque. The conversion was done by applying a context-sensitive set of Constraint Grammar rules to its original deep linguistic analysis, which was carried out by the parser PALAVRAS, with some additional manual corrections. Universal Dependencies offer the promise of greater parallelism between languages, a plus for researchers in many areas. We report the challenges of dealing with Portuguese, a Romance language, hoping that our experience will help others.
Susana Afonso, Eckhard Bick, Renato Haber, and Diana Santos. 2002. Floresta sintá(c)tica: a treebank for Portuguese. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC), pages 1698–1703, Las Palmas, Spain.
Douglas Biber, Stig Johansson, Geoffrey Leech, Susan Conrad, and Edward Finegan. 1999. The Longman grammar of spoken and written English. Longman, London.
Eckhard Bick and Tino Didriksen. 2015. Cg-3—beyond classical constraint grammar. In Pro ceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania, pages 31–39. Linköping University Electronic Press.
Eckhard Bick. 2014. PALAVRAS – a constraint grammar-based parsing system for portuguese. In Tony Berber Sardinha and Thelma de Lourdes São Bento Ferreira, editors, Working with Portuguese Corpora, pages 279–302. Bloomsbury Academic.
Eckhard Bick. 2016. Constraint grammar-based conversion of dependency treebanks. In Proceedings of the 13th International Conference on Natural Language Processing (ICON), pages 109–114, Varanasi, India, Dec. NLP Association of India (NLPAI).
Sabine Buchholz and Erwin Marsi. 2006. Conll-x shared task on multilingual dependency parsing. In Proceedings of the Tenth Conference on Computational Natural Language Learning, CoNLL-X ’06, pages 149–164, Stroudsburg, PA, USA. Association for Computational Linguistics.
Joakim Nivre et al. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).
Cláudia Freitas, Paulo Rocha, and Eckhard Bick. 2008. Floresta sintá (c) tica: bigger, thicker and easier. In International Conference on Computational Processing of the Portuguese Language, pages 216–219. Springer.
Fred Karlsson. 1990. Constraint grammar as a framework for parsing running text. In Proceedings of the 13th conference on Computational linguistics-Volume 3, pages 168–173. Association for Computational Linguistics.
Juhani Luotolahti, Jenna Kanerva, Sampo Pyysalo, and Filip Ginter. 2015. Sets: Scalable and efficient tree search in dependency graphs. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 51–55. Association for Computational Linguistics.
R. McDonald, J. Nivre, Y. Quirmbach-Brundage, Y. Goldberg, D. Das, K. Ganchev, K. Hall, S. Petrov, H. Zhang, O. Täckström, D. Bedini, N. Castelló, and J. Lee. 2013. Universal dependency annotation for multilingual parsing. In Proceedings of the ACL 2013. Association for Computational Linguistics, August.
Luiza Frizzo Truggo. 2016. Classes de palavras – da grécia antiga ao google: um estudo motivado pela conversão de tagsets. Master’s thesis, PUC-Rio.