Machine translation quality estimation

Machine translation quality estimation (MTQE or QE) is the task of automatically estimating the quality of a machine translation, without using a reference translation.

This research area within machine translation and machine learning gave birth to the quality prediction models used in production systems like ModelFront.

Evaluation vs. estimation

Quality evaluation metrics, like BLEU, are for comparing machine translation systems. They basically work like average edit distance.

They are only intended to be directionally correct, not accurate at the sentence level — BLEU doesn't even consider the source text.

But to calculate a score, they require (human) reference translations.

So they cannot be used for new content in production, for which there are no human translations yet.

Quality estimation scores a translation, based on the original and the translation. No reference translation. So it can be used for new input.

Estimation vs. prediction

A raw quality estimation model outputs only a score, typically 0.0 to 1.0, that estimates the quality of a translation.

But to convert scores like 0.9 (90%) into decisions, thresholds need to be carefully chosen, for each language and content type.

So in practice, scores were not usable in production workflows by translation buyers, translation teams or professional human translators. They often confuse it with fuzzy match scores.

Quality prediction actually outputs a boolean flag, ✓ or ✗.

Timeline

The evolution of quality estimation models mirrored the evolution of machine translation and machine learning models in general.

Quality estimation research began at the Workshop on Statistical Machine Translation (WMT) in 2013, where researchers established defined the task.

They competed using the feature engineering approaches that were state of the art at the time, at the end of the statistical machine translation era.

      •─────────────•─────────────•─────────────•───────
      |             |             |             |
     2013          2016          2019          2020
     QuEst         QuEst++       OpenKiwi      ModelFront

2013 QuEst (opens in a new tab)	2016 QuEst++ (opens in a new tab)	2019 OpenKiwi (opens in a new tab)	2020 ModelFront (opens in a new tab)
Open-source library (opens in a new tab)	Open-source library (opens in a new tab)	Open-source library (opens in a new tab)	Production system
Feature engineering	Feature engineering	Deep learning	Multilingual large language models (LLMs)
Python scripts and Java program	Python scripts and Java program	Python framework	API and integrations
Score 0.0-1.0	Score 0.0-1.0	Score 0.0-1.0	Flag ✓ or ✗

The key early research community was driven by like Lucia Specia, then at the University of Sheffield and Imperial College London, and Radu Soricut from Google Research. Unbabel researchers like Fábio Kepler and André Martins led the development of OpenKiwi.

Learn More

Join the mission
Are you interested in joining the mission to accelerate human-quality translation?
Browse jobs at ModelFront