Data Selection For Statistical Machine Translation

Data Selection for Statistical Machine Translation PDF
Author: Amittai Axelrod
Publisher:
ISBN:
Category :
Languages : un
Pages : 124

Get Book

Data Selection For Statistical Machine Translation

by Amittai Axelrod, Data Selection For Statistical Machine Translation Books available in PDF, EPUB, Mobi Format. Download Data Selection For Statistical Machine Translation books, Machine translation, the computerized translation of one human language to another, could be used to communicate between the thousands of languages used around the world. Statistical machine translation (SMT) is an approach to building these translation engines without much human intervention, and large-scale implementations by Google, Microsoft, and Facebook in their products are used by millions daily. The quality of SMT systems depends on the example translations used to train the models. Data can come from a variety of sources, many of which are not optimal for common specific tasks. The goal is to be able to find the right data to use to train a model for a particular task. This work determines the most relevant subsets of these large datasets with respect to a translation task, enabling the construction of task-specific translation systems that are more accurate and easier to train than the large-scale models. Three methods are explored for identifying task-relevant translation training data from a general data pool. The first uses only a language model to score the training data according to lexical probabilities, improving on prior results by using a bilingual score that accounts for differences between the target domain and the general data. The second is a topic-based relevance score that is novel for SMT, using topic models to project texts into a latent semantic space. These semantic vectors are then used to compute similarity of sentences in the general pool to the target task. This work finds that what the automatic topic models capture for some tasks is actually the style of the language, rather than task-specific content words. This motivates the third approach, a novel style-based data selection method. Hybrid word and part-of-speech (POS) representations of the two corpora are constructed by retaining the discriminative words and using POS tags as a proxy for the stylistic content of the infrequent words. Language models based on these representations can be used to quantify the underlying stylistic relevance between two texts. Experiments show that style-based data selection can outperform the current state-of-the-art method for task-specific data selection, in terms of SMT system performance and vocabulary coverage. Taken together, the experimental results indicate that it is important to characterize corpus differences when selecting data for statistical machine translation.



Data Selection Using Topic Adaptation For Statistical Machine Translation

Data Selection Using Topic Adaptation for Statistical Machine Translation PDF
Author: Hitokazu Matsushita
Publisher:
ISBN:
Category :
Languages : un
Pages : 81

Get Book

Data Selection Using Topic Adaptation For Statistical Machine Translation

by Hitokazu Matsushita, Data Selection Using Topic Adaptation For Statistical Machine Translation Books available in PDF, EPUB, Mobi Format. Download Data Selection Using Topic Adaptation For Statistical Machine Translation books, Statistical machine translation (SMT) requires large quantities of bitexts (i.e., bilingual parallel corpora) as training data to yield good quality translations. While obtaining a large amount of training data is critical, the similarity between training and test data also has a significant impact on SMT performance. Many SMT studies define data similarity in terms of domain-overlap, and domains are defined to be synonymous with data sources. Consequently, the SMT community has focused on domain adaptation techniques that augment small (in-domain) datasets with large datasets from other sources (hence, out-of-domain, per the definition). However, many training datasets consist of topically diverse data, and not all data contained in a single dataset are useful for translations of a specific target task.



Language Science And Language Technology In Africa

Language Science and Language technology in Africa PDF
Author: Steve Ndinga-Koumba-Binza
Publisher: AFRICAN SUN MeDIA
ISBN: 9781920338794
Category : Language Arts & Disciplines
Languages : en
Pages : 362

Get Book

Language Science And Language Technology In Africa

by Steve Ndinga-Koumba-Binza, Language Science And Language Technology In Africa Books available in PDF, EPUB, Mobi Format. Download Language Science And Language Technology In Africa books, This book provides a broad overview of current work on South African languages, language resources and language technologies. While it provides a fairly comprehensive overview, it also ties together the most recent knowledge state here, and is therefore truly innovative ? The book is therefore informed by current international trends in the respective fields of science, and feeds back into them ? There is absolutely no doubt that the book has an academic peer audience and is directed at specialists in the field. - Prof. Axel Fleisch, University of Helsinki, Finland



Linguistically Motivated Statistical Machine Translation

Linguistically Motivated Statistical Machine Translation PDF
Author: Deyi Xiong
Publisher: Springer
ISBN: 9812873562
Category : Language Arts & Disciplines
Languages : un
Pages : 152

Get Book

Linguistically Motivated Statistical Machine Translation

by Deyi Xiong, Linguistically Motivated Statistical Machine Translation Books available in PDF, EPUB, Mobi Format. Download Linguistically Motivated Statistical Machine Translation books, This book provides a wide variety of algorithms and models to integrate linguistic knowledge into Statistical Machine Translation (SMT). It helps advance conventional SMT to linguistically motivated SMT by enhancing the following three essential components: translation, reordering and bracketing models. It also serves the purpose of promoting the in-depth study of the impacts of linguistic knowledge on machine translation. Finally it provides a systematic introduction of Bracketing Transduction Grammar (BTG) based SMT, one of the state-of-the-art SMT formalisms, as well as a case study of linguistically motivated SMT on a BTG-based platform.



Machine Translation

Machine Translation PDF
Author: Muyun Yang
Publisher: Springer
ISBN: 9811036357
Category : Computers
Languages : en
Pages : 125

Get Book

Machine Translation

by Muyun Yang, Machine Translation Books available in PDF, EPUB, Mobi Format. Download Machine Translation books, This book constitutes the refereed proceedings of the 12th China Workshop on Machine Translation, CWMT 2016, held in Urumqi, China, in August 2016. The 10 English papers presented in this volume were carefully reviewed and selected from 76 submissions. They deal with statistical machine translation, hybrid machine translation, machine translation evaluation, post editing, alignment, and inducing bilingual knowledge from corpora.



Machine Learning In Translation Corpora Processing

Machine Learning in Translation Corpora Processing PDF
Author: Krzysztof Wolk
Publisher: CRC Press
ISBN: 0429588836
Category : Computers
Languages : en
Pages : 264

Get Book

Machine Learning In Translation Corpora Processing

by Krzysztof Wolk, Machine Learning In Translation Corpora Processing Books available in PDF, EPUB, Mobi Format. Download Machine Learning In Translation Corpora Processing books, This book reviews ways to improve statistical machine speech translation between Polish and English. Research has been conducted mostly on dictionary-based, rule-based, and syntax-based, machine translation techniques. Most popular methodologies and tools are not well-suited for the Polish language and therefore require adaptation, and language resources are lacking in parallel and monolingual data. The main objective of this volume to develop an automatic and robust Polish-to-English translation system to meet specific translation requirements and to develop bilingual textual resources by mining comparable corpora.



Statistical Machine Translation

Statistical Machine Translation PDF
Author: Philipp Koehn
Publisher: Cambridge University Press
ISBN: 0521874157
Category : Computers
Languages : en
Pages : 433

Get Book

Statistical Machine Translation

by Philipp Koehn, Statistical Machine Translation Books available in PDF, EPUB, Mobi Format. Download Statistical Machine Translation books, The dream of automatic language translation is now closer thanks to recent advances in the techniques that underpin statistical machine translation. This class-tested textbook from an active researcher in the field, provides a clear and careful introduction to the latest methods and explains how to build machine translation systems for any two languages. It introduces the subject's building blocks from linguistics and probability, then covers the major models for machine translation: word-based, phrase-based, and tree-based, as well as machine translation evaluation, language modeling, discriminative training and advanced methods to integrate linguistic annotation. The book also reports the latest research, presents the major outstanding challenges, and enables novices as well as experienced researchers to make novel contributions to this exciting area. Ideal for students at undergraduate and graduate level, or for anyone interested in the latest developments in machine translation.



Latent Domain Models For Statistical Machine Translation

Latent Domain Models for Statistical Machine Translation PDF
Author: Hoàng Cường
Publisher:
ISBN:
Category :
Languages : un
Pages : 145

Get Book

Latent Domain Models For Statistical Machine Translation

by Hoàng Cường, Latent Domain Models For Statistical Machine Translation Books available in PDF, EPUB, Mobi Format. Download Latent Domain Models For Statistical Machine Translation books, "A data-driven approach to model translation suffers from the data mismatch problem and demands domain adaptation techniques. Given parallel training data originating from a specific domain, training an MT system on the data would result in a rather suboptimal translation for other domains. But does suboptimality of translation happen only in such an extreme scenario of domain mismatch? This dissertation shows that training SMT systems on heterogeneous corpora (e.g. EuroParl) may also result in suboptimal performance of statistical translation systems. Specifically, it is clear that a word/phrase could be translated in different ways when it comes to different domains. The translation statistics induced from word alignment models and phrase-based models, however, reflect translation preferences aggregated over diverse domains in heterogeneous corpora. In this sense, they can be considered as coarse and domain-confused statistics. This dissertation shows that domain-confused statistics may harm performance of both word alignment and phrase-based models. Another important contribution of this dissertation is to provide a principled way to address the problem. We focus on learning the translation statistics with respect to each of diverse domains (i.e. domain-focused translation statistics). With our method of domain induction for translation, we present a comprehensive study of domain adaptation for statistical machine translation, including four specific case studies Data Selection, Phrase-Based Translation, Word Alignment and Rewarding Domain Invariance in translation. Finally, we briefly describe Scorpio, the ILLC-UvA Adaptation System submitted to an adaptation task at WMT 2016, which participated with the language pair of English-Dutch. This system consolidates the ideas in the thesis on latent variable models for adaptation. Results validate the effective adaptation performance in a competitive setting."--Samenvatting auteur.



Use Of Source Language Context In Statistical Machine Translation

Use of Source Language Context in Statistical MacHine Translation PDF
Author: Rejwanul Haque
Publisher: LAP Lambert Academic Publishing
ISBN: 9783847340973
Category :
Languages : un
Pages : 228

Get Book

Use Of Source Language Context In Statistical Machine Translation

by Rejwanul Haque, Use Of Source Language Context In Statistical Machine Translation Books available in PDF, EPUB, Mobi Format. Download Use Of Source Language Context In Statistical Machine Translation books, The translation features typically used in state-of-the-art statistical machine translation (SMT) model dependencies between the source and target phrases, but not among the phrases in the source language themselves. A swathe of research has demonstrated that integrating source context modelling directly into log-linear phrase-based SMT (PB-SMT) and hierarchical PB-SMT (HPB-SMT), and can positively influence the weighting and selection of target phrases, and thus improve translation quality. In this book we present novel approaches to incorporate source-language contextual modelling into the state-of-the-art SMT models in order to enhance the quality of lexical selection. We investigate the effectiveness of use of a range of contextual features, including lexical features of neighbouring words, part-of-speech tags, supertags, sentence-similarity features, dependency information, and semantic roles. We explored a series of language pairs featuring typologically different languages, and examined the scalability of our research to larger amounts of training data.