Lost in Translation: Exploiting Unicode Normalization

 

At Black Hat USA 2025, my daughter and I had the privilege of presenting a topic that sits at the uncomfortable intersection of application security, text encoding, and real‑world defensive blind spots: Unicode normalization abuse. What started as a collection of “weird edge cases” has grown into a repeatable class of vulnerabilities that attackers actively use to bypass modern security controls, especially WAFs and input validation logic.

This post distills the core ideas, examples, and lessons from our talk into a single narrative for defenders, bug bounty hunters, and anyone who handles untrusted text.

References:

Blackhat Videohttps://www.youtube.com/watch?v=ETB2w-f3pM4

Slides: https://i.blackhat.com/BH-USA-25/Presentations/USA-25-Barnett-Lost-In-Translation-Exploiting-Unicode-compressed.pdf


Why Unicode Still Breaks Security Logic

Unicode was designed to be universal — to support every language, symbol, and writing system. Security logic, on the other hand, is often written with ASCII‑centric assumptions baked in.

That mismatch creates an attack surface.

In modern architectures, input data is rarely processed once. It flows through multiple systems:

  • Browsers

  • CDNs and WAFs

  • Load balancers

  • Application frameworks

  • Databases

Each layer may decode, normalize, truncate, or reinterpret text differently. When security decisions are made before all transformations are complete, attackers can exploit the gaps.

We call this an impedance mismatch — and Unicode is one of the most reliable ways to trigger it.


The Real‑World Context: WAFs and Bug Bounties

As bug bounty adoption has increased, many organizations rely heavily on CDN‑based WAFs as their first line of defense. This creates a pattern we see repeatedly:

  1. Researcher finds a bypass

  2. Bug bounty report is shared with the CDN/WAF vendor

  3. A virtual patch reduces duplicate reports

  4. The underlying application bug remains unfixed or delayed

If the WAF and the application do not process Unicode identically, virtual patches can introduce a false sense of safety.


Abuse Class 1: Unicode‑Aware URL Decoding

A surprising number of URL decoders are not fully multi-byte aware.

Example:

  • Unicode character: U+0391 (Greek Capital Letter Alpha)

  • UTF‑8 encoding: %CE%91

Some decoders treat percent‑encoded bytes as independent single bytes, then down‑convert to 7‑bit ASCII by dropping the most significant bit.

The result? Data mutates after inspection.

This enables:

  • Filter bypasses

  • Signature evasion

  • Payload smuggling

We demonstrated how tools like Burp, CyberChef, and Caido help visualize these transformations — and how attackers chain them together.


Abuse Class 2: Overlong UTF‑8 Encodings

Although overlong UTF‑8 encodings are invalid per the standard, some decoders still accept them.

Attackers exploit this to:

  • Encode forbidden characters in non‑canonical forms

  • Evade pattern‑based detection

  • Trigger differential decoding between layers

If one layer rejects overlong sequences but another accepts them, security logic becomes inconsistent.


Abuse Class 3: Unicode Visual Confusables

Not all attacks rely on byte‑level tricks. Some rely on what humans think they are seeing.

Unicode contains thousands of confusable characters — symbols that look identical or nearly identical to ASCII characters.

Examples:

  • Greek Alpha (Α) vs Latin A (A)

  • Curly quotes vs straight quotes

  • Mathematical symbols masquerading as operators

We explored how confusables enable:

  • SSTI bypasses

  • Source code review blind spots

  • Logic errors in sanitization routines

Modern editors like VS Code now warn about confusables — but runtime environments usually do not.


Unicode Normalization and Order‑of‑Operations Bugs

One of the most critical lessons from the talk:

Normalization order matters more than normalization itself.

We showed multiple examples where:

  1. Input is sanitized

  2. Unicode is normalized afterward

or

  1. Unicode is normalized

  2. Sanitization is applied to the wrong representation

A concrete example involved best‑fit mapping on Windows code pages, where Unicode characters are silently mapped to ASCII equivalents — sometimes removing security‑relevant characters like quotes.

Changing the order of operations completely changes the security outcome.


Abuse Class 4: Unicode Truncation

Some applications accept Unicode input but later truncate it into ASCII by:

  • Taking only the low byte of a codepoint

  • Assuming 1 byte = 1 character

This enables attacks where:

  • %uXXXX values collapse into dangerous ASCII characters

  • Filters pass benign‑looking Unicode

  • Execution contexts receive malicious ASCII

We tied this behavior to real‑world vulnerabilities, including research presented at DEF CON on Unicode overflows and email parsing.


Case Folding

Additional attack surfaces include:

  • Unicode case folding, where upper/lowercase transformations differ from ASCII rules


Tooling, Labs, and Research References

During the talk we shared tools and references that help defenders and researchers explore Unicode safely:

  • Unicode IDNA utilities

  • UTF‑8 visualizers

  • Confusable detection libraries

  • Normalization tables

  • XSS and SSTI Unicode labs

We strongly recommend defenders test their applications using Unicode‑aware fuzzing, not just ASCII payloads.


Final Takeaways

Unicode vulnerabilities are not exotic.

They are:

  • Common

  • Repeatable

  • Systemic

If your security logic assumes:

  • One character == one byte

  • ASCII equivalence

  • Single‑pass decoding

…then Unicode will eventually betray you.

The fix is not “block Unicode,” but to:

  • Normalize early

  • Normalize consistently

  • Validate after all transformations

  • Align WAF and application decoding logic

If you do not control every step of text processing, attackers will.


Thanks to everyone who attended our Black Hat USA 2025 session, and especially to the community that continues to explore the strange and wonderful ways text breaks security.


Comments

Popular Posts