Samuel Läubli
Institute of Computational Linguistics, University of Zurich, Zürich, Switzerland
Mark Fishel
Institute of Computational Linguistics, University of Zurich, Zürich, Switzerland
Martin Volk
Institute of Computational Linguistics, University of Zurich, Zürich, Switzerland
Manuela Weibel
SemioticTransfer AG, Baden, Switzerland
Ladda ner artikelIngår i: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16
Linköping Electronic Conference Proceedings 85:30, s. 331-341
NEALT Proceedings Series 16:30, p. 331-341
Publicerad: 2013-05-17
ISBN: 978-91-7519-589-6
ISSN: 1650-3686 (tryckt), 1650-3740 (online)
Since the emergence of translation memory software; translation companies and freelance translators have been accumulating translated text for various languages and domains. This data has the potential of being used for training domain-specific machine translation systems for corporate or even personal use. But while the resulting systems usually perform well in translating domain-specific language; their out-of-domain vocabulary coverage is often insufficient due to the limited size of the translation memories. In this paper; we demonstrate that small in-domain translation memories can be successfully complemented with freely available general-domain parallel corpora such that (a) the number of out-of-vocabulary words (OOV) is reduced while (b) the in-domain terminology is preserved. In our experiments; a German–French and a German–Italian statistical machine translation system geared to marketing texts of the automobile industry has been significantly improved using Europarl and OpenSubtitles data; both in terms of automatic evaluation metrics and human judgement.
Machine Translation; Translation Memory; Domain Adaptation; Perplexity Minimization
Bertoldi; N.; Haddow; B.; and Fouet; J.-B. (2009). Improved minimum error rate training in moses. The Prague Bulletin of Mathematical Linguistics; 91(1):7–16.
Callison-Burch; C.; Koehn; P.; Monz; C.; Post; M.; Soricut; R.; and Specia; L. (2012). Findings of the 2012 workshop on statistical machine translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation; pages 10–51; Montréal; Canada. Association for Computational Linguistics.
Chen; S. F. and Goodman; J. (1998). An empirical study of smoothing techniques for language modeling. omputer Speech & Language; 13:359–393.
Clark; J. H.; Dyer; C.; Lavie; A.; and Smith; N. A. (2011). Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2; HLT ’11; pages 176–181; Stroudsburg; PA; USA. Association for Computational Linguistics.
Denkowski; M. and Lavie; A. (2011). Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. In Proceedings of the Sixth Workshop on Statistical Machine Translation; WMT ’11; pages 85–91; Stroudsburg; PA; USA. Association for Computational Linguistics.
Dyet; C (2009). Using a maximum entropy model to build segmentation lattices for MT. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics; NAACL ’09; pages 406–414; Stroudsburg; PA; USA. Association for Computational Linguistics.
Federico; M. and Cettolo; M. (2007). Efficient handling of n-gram language models for statistical machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation; StatMT ’07; pages 88–95; Stroudsburg; PA; USA. Association for Computational Linguistics.
Foster; G. and Kuhn; R. (2007). Mixture-model adaptation for SMT. In Proceedings of the Second Workshop on Statistical Machine Translation; StatMT ’07; pages 128–135; Stroudsburg; PA; USA. Association for Computational Linguistics.
Hardmeier; C.; Bisazza; A.; and Federico; M. (2010). FBK at WMT 2010: word lattices for morphological reduction and chunk-based reordering. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR; WMT ’10; pages 88–92; Stroudsburg; PA; USA. Association for Computational Linguistics.
Kanavos; P. and Kartsaklis; D. (2010). Integrating machine translation with translation memory: A practical approach. In JEC 2010: Second joint EM+/CNGL Workshop “Bringing MT to the user: research on integrating MT in the translation industry”; pages 11–20.
Koehn; P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. In Machine Translation Summit X; pages 79–86; Phuket; Thailand.
Koehn; P.; Hoang; H.; Birch; A.; Callison-Burch; C.; Federico; M.; Bertoldi; N.; Cowan; B.; Shen; W.; Moran; C.; Zens; R.; Dyer; C.; Bojar; O.; Constantin; A.; and Herbst; E. (2007). Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions; ACL ’07; pages 177–180; Stroudsburg; PA; USA. Association for Computational Linguistics.
Koehn; P. and Knight; K. (2003). Empirical methods for compound splitting. In Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1; EACL ’03; pages 187–193; Stroudsburg; PA; USA. Association for Computational Linguistics.
Koehn; P. and Senellart; J. (2010). Convergence of translation memory and statistical machine translation. In JEC 2010: Second joint EM+/CNGL Workshop “Bringing MT to the user: research on integrating MT in the translation industry”; pages 21–31.
Och; F. J. (2003). Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1; ACL ’03; pages 160–167; Stroudsburg; PA; USA. Association for Computational Linguistics.
Pym; A.; Grin; F.; Sfreddo; C.; and Chan; A. L. J. (2012). The Status of the Translation Profession in the European Union; volume 7/2012 of Studies on translation and multilingualism. Publications Office of the European Union.
Sennrich; R. (2012). Perplexity Minimization for Translation Model Domain Adaptation in Statistical Machine Translation. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics; EACL ’12; pages 539–549; Stroudsburg; PA; USA. Association for Computational Linguistics.
Stymne; S. (2009). Compound processing for phrase-based statistical machine translation. Master’s thesis; Linköping University; Sweden.
Tiedemann; J. (2009). News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In Nicolov; N.; Bontcheva; K.; Angelova; G.; and Mitkov; R.; editors; Recent Advances in Natural Language Processing; volume V; pages 237–248; Borovets; Bulgaria. John Benjamins; Amsterdam/Philadelphia.