Human quality as a benchmark

The bar for ModelFront is exceptionally high, unlike machine translation or chatbots, and more like Waymo or AI to cancer detection.

ModelFront was only successful in rolling out AI to check and fix AI in the real world because ModelFront's AI keeps human quality.

Verification vs. generation

For unverified generation, 80% accuracy can be valuable, because the alternative was to get nothing at all, or humans would manually check and fix the output anyway. For example, a bad machine translation is better than no translation at all, and users understand that it is untrusted.

But in applied verification — the actual check and fix — the whole point is to scale, while also keeping human quality.

Accuracy is the wrong metric, and even 90% is not good enough.

The content automated by ModelFront is valuable content that buyers and users expect to be correct. For example, technical or even legal documents.

One bad miss hurts much more than a few false alarms.

So it's not about accuracy on evaluation sets, but about recall in the real world. And 80% or even 90% is not good enough.

More nuances

There are other nuances to using human quality as a benchmark.

Humans aren't perfect either, so matching or beating the quality of the humans is enough to launch in production. Human quality is not perfection, and perfection is not required.
It's not just about the number of errors, but also the type of errors, and humans and machines make different types of mistakes.
In practice, machines doing critical work are held to a higher standard than humans are, and monitored more closely.

Inside ModelFront, we've developed tools and methodologies for automatic evaluation and human evaluation, to measure against human quality. And these tools and methodologies are accessible to ModelFront customers, to help them test before production launch, and monitor in production.

Join the mission
Are you interested in joining the mission to accelerate human-quality translation?
Browse jobs at ModelFront