What is Machine Translation?
To process any translation, human or automated, the meaning of a text in the
original (source) language must be fully restored in the target language, i.e.
the translation. While on the surface this seems straightforward, it is far
more complex. Translation is not a mere word-for-word substitution. A
translator must interpret and analyze all of the elements in the text and know
how each word may influence another. This requires extensive expertise in
grammar, syntax (sentence structure), semantics (meanings), etc., in the source
and target languages, as well as familiarity with each local region. Human and machine translation each have their share of challenges. For
example, no two individual translators can produce identical translations of
the same text in the same language pair, and it may take several rounds of
revisions to meet customer satisfaction. But the greater challenge lies in how
machine translation can produce publishable quality translations. Rule-based machine translation relies on countless built-in linguistic rules
and millions of bilingual dictionaries for each language pair. The software parses text and creates a transitional representation from
which the text in the target language is generated. This process requires
extensive lexicons with morphological, syntactic, and semantic information, and
large sets of rules. The software uses these complex rule sets and then
transfers the grammatical structure of the source language into the target
language. Translations are built on gigantic dictionaries and sophisticated linguistic
rules. Users can improve the out-of-the-box translation quality by adding their
terminology into the translation process. They create user-defined dictionaries
which override the system’s default settings. In most cases, there are two steps: an initial investment that significantly
increases the quality at a limited cost, and an ongoing investment to increase
quality incrementally. While rule-based MT brings companies to the quality threshold
and beyond, the quality improvement process may be long and expensive. Statistical machine translation utilizes statistical translation models
whose parameters stem from the analysis of monolingual and bilingual corpora.
Building statistical translation models is a quick process, but the technology
relies heavily on existing multilingual corpora. A minimum of 2 million words
for a specific domain and even more for general language are required.
Theoretically it is possible to reach the quality threshold but most companies
do not have such large amounts of existing multilingual corpora to build the
necessary translation models. Additionally, statistical machine translation is
CPU intensive and requires an extensive hardware configuration to run
translation models for average performance levels. Rule-based MT provides good out-of-domain quality and is by nature
predictable. Dictionary-based customization guarantees improved quality and
compliance with corporate terminology. But translation results may lack the
fluency readers expect. In terms of investment, the customization cycle needed
to reach the quality threshold can be long and costly. The performance is high
even on standard hardware. Statistical MT provides good quality when large and qualified corpora are
available. The translation is fluent, meaning it reads well and therefore meets
user expectations. However, the translation is neither predictable nor
consistent. Training from good corpora is automated and cheaper. But training
on general language corpora, meaning text other than the specified domain, is
poor. Furthermore, statistical MT requires significant hardware to build and
manage large translation models. Rule-Based MT Statistical MT + Consistent and predictable quality – Unpredictable translation quality + Out-of-domain translation quality – Poor out-of-domain quality + Knows grammatical rules – Does not know grammar + High performance and robustness – High CPU and disk space requirements + Consistency between versions – Inconsistency between versions – Lack of fluency + Good fluency – Hard to handle exceptions to rules + Good for catching exceptions to rules – High development and customization costs + Rapid and cost-effective development costs provided the
required corpus exists Given the overall requirements, there is a clear need for a third approach
through which users would reach better translation quality and high performance
(similar to rule-based MT), with less investment (similar to statistical MT)Rule-Based Machine Translation Technology
Statistical Machine Translation Technology
Rule-Based MT vs. Statistical MT