The God Metric is Dead: Why Words Are Smarter Than Numbers
How the localization industry's obsession with numbers masks catastrophic failures, and what to do instead
There is a seduction in the number 86.23%.
It looks clean. It looks objective. It looks safe. If I walk into a board meeting and tell the C-suite, “Our translation quality index improved by 2% this quarter,” everyone nods. We have successfully turned the chaotic, messy, subjective reality of human language into a tidy engineering problem.
But if we are honest with ourselves, we know the truth: That number is a hallucination.
Why the Score is a Lie
I don’t use the word “hallucination” lightly. In AI, a hallucination is a confident response that is factually grounded in nothing. The “Quality Score” is the industry’s collective hallucination.
Why? Because it attempts to impose a linear, mathematical scale on a multi-dimensional, subjective reality.
When we see “86%,” we intuitively believe it is “better” than “82%.” We assume it means the translation is “mostly good.” But language doesn’t work like a math test. You cannot get partial credit for a sentence that kills the user.
If a medical manual is translated perfectly for 500 pages, but on page 501 it omits the word “not” in the sentence “Do not cut the red wire,” what is the quality score?
Mathematically, it might still be 99.9% accurate.
In reality, the quality is zero. The translation is fatal.
The number hallucinates safety. It tells you the file is “High Quality” because the aggregate data looks good, masking the specific, catastrophic failure hiding in paragraph four. It gives you a false sense of certainty in a domain that is inherently uncertain.
The Original Sin: Counting Matches, Missing Meaning
For the last forty years, we have been obsessed with quantifying the unquantifiable. We convinced ourselves that if we could just find the right formula—BLEU, TER, COMET, MTQE—we could compress “meaning” into a floating-point number.
But the DNA of these metrics is flawed. They were born in an era when computers couldn’t understand words; they could only count them.
Consider the classic failure mode of N-gram (word count) metrics:
Reference: “The system is unavailable.”
Machine Output: “The system is available.”
A metric like BLEU looks at this and sees four out of five matching words. It calculates an 80% accuracy rate. It gives this translation a B+.
But semantically? It is a disaster. It states the exact opposite of the truth. Yet, purely quantitative data will report this as a success. This is why we must stop trusting the aggregate number. It measures surface similarity, not semantic truth.
The Crisis of Actionability
The problem with the “God Metric” isn’t just that it can be deceptive. The deeper problem is that it isn’t actionable.
Imagine you are a Localization Manager. You receive a report stating that your French translation quality dropped from 92% to 84%.
What do you do with that information?
Do you fire the vendor?
Do you retrain the engine?
Do you update the style guide?
You don’t know. The number is an abstraction. It creates anxiety, but it doesn’t offer a path to resolution. You are drowning in data, but you have zero control.
To experience control, data must tell you what is wrong and how to fix it. This is where we need to pivot from Quantitative Scoring to Qualitative Diagnostics.
Enter “Translation Smells”
In software engineering, developers talk about “Code Smells.” A code smell isn’t necessarily a bug that crashes the system. The code compiles. But the code looks “smelly”, maybe a function is too complex, or the logic is circular. It indicates a deeper weakness.
At Bureau Works, we believe localization needs to adopt this same philosophy. We call them Translation Smells.
We use Generative AI not to “score” text, but to sniff out these specific semantic issues. And unlike a number, a “smell” is immediately actionable.
Consider the difference in utility between a Score and a Smell:
The Score Approach:
“This segment has a Quality Estimation (MTQE) score of 0.6.”
Action: Unknown. Check manually? Ignore? Panic?
The Smell Approach:
“Caution: Gender Bias Detected. The source text is neutral (’The doctor’), but the target text defaults to masculine (’El médico’).”
Action: Change the gender to be inclusive.
“Caution: Tonal Clash. The source is playful and casual, but the translation is bureaucratic and formal.”
Action: Rewrite to match the brand voice.
In these instances, the language-based data empowers the user. It respects the user’s intelligence and gives them the tools to act, rather than just a grade to fear.
The Confidence Score: A Security Blanket
Now, if you log into Bureau Works today, you will still see a score. We call it the Confidence Score.
Why is it there, if I’ve just spent 600 words arguing against it?
Because the market needs predictability. Procurement departments need numbers for spreadsheets to track trends over time. Executives need simple KPIs to verify that the system is stable.
We view the score not as a driver of value, but as a layer of predictability. It is an innocuous feature, a “security blanket” that lets stakeholders feel comfortable that the system is healthy.
But for the actual user—the translator, the editor, the localization manager—the score doesn’t do much. It doesn’t fix the translation. It doesn’t clarify the tone.
The value isn’t that the file is “85% Confident.” The value is knowing exactly that the 15% uncertainty comes from three specific ambiguous legal terms that require human review.
The score is the map. The smells are the territory. Do not confuse the two.
Authorship is the Ultimate Metric
Why does this distinction matter so much? Why fight for “Smells” over “Scores”?
Because of the Last Mile.
As AI gets better at translation, the role of the human linguist is shifting. We are no longer just translating from scratch; we are verifying, editing, and curating.
If we rely on black-box scores, we surrender our agency to the machine. We accept “86%” because the computer said so.
But when we use Translation Smells, we reclaim Authorship.
Authorship isn’t about typing every word. Authorship is about Choice. When the system flags a “Tone Smell,” saying the text is too aggressive, the human has a choice to make.
They can say: “Oh, good catch. I’ll soften that.”
Or they can say: “No, this is a warning label. It needs to sound aggressive. Ignore.“
That moment, the decision to Ignore or Apply, is where human value lives. That is the spark of creativity and context that no algorithm can fully replicate.
Data is only good if it empowers you to make that choice with confidence.
The Future is Semantic
We are at a transition point in the industry. The people clinging to MTQE and BLEU scores are doing so because it’s what they know. It feels rigorous because it involves math. It feels like “Science.”
But the future isn’t about better math. It’s about better linguistics.
It’s about moving from “Is this 86% correct?” to “Does this smell like our brand?”
We need to stop hiding behind the safety of the aggregate number. We need to expose the specific, messy, beautiful details of language. Because when we strip away the score and look at the words, we finally gain the one thing we’ve been chasing all along:
Control.








My hero - thank you for this. I've spent the best parts of a decade trying to get people to understand the importance of human linguistics in tech and digital products. LLMs are large LANGUAGE models and yet we pour a lot of time and resources to understand computational lingusitics and conversational design (which are needed, don't get me wrong!) but there's more to language, we need to better understand the user language - human language, pragmatics, sociolinguistics, power, hierarchy, identity, authority.
I like the criticism of the blind scoring, but I don't quite understand the point of "smelling". If you have a tool advanced enough to catch those biases and formulate them so well, wouldn't it then do the translation itself applying those corrections already? I feel like this is the kind of test that works on texts made by less advanced tools, while translations made with a model that is aware of all these nuances wouldn't detect its own "smells". Or am I missing something?