Skip to content

lang_tools.language.normalization

Accent / diacritic normalization helpers.

Pattern rules

Stateless utility module. Uses unicodedata.NFD decomposition and combining-character filtering as the default backbone. Per-language overrides (via the Language.normalization_map) handle special cases like French œ -> oe or German ß -> ss which NFD does not decompose.

Functions:

  • extract_accented_chars

    List every accented or composed character present in text.

  • has_accent

    Return True if text contains any accented or composed character.

  • normalize

    Strip diacritics and lowercase a string.

extract_accented_chars

extract_accented_chars(
    text: str, language: Language | None = None
) -> list[str]

List every accented or composed character present in text.

Parameters:

  • text (str) –

    Input string.

  • language (Language | None, default: None ) –

    Optional language whose normalization_map is applied to the per-character check.

Returns:

  • list[str]

    Characters from text whose normalized form differs from themselves,

  • list[str]

    preserving original order. Duplicates are kept (caller can deduplicate).

Source code in src/lang_tools/language/normalization.py
def extract_accented_chars(text: str, language: Language | None = None) -> list[str]:
    """List every accented or composed character present in `text`.

    Args:
        text: Input string.
        language: Optional language whose `normalization_map` is applied to the
            per-character check.

    Returns:
        Characters from `text` whose normalized form differs from themselves,
        preserving original order. Duplicates are kept (caller can deduplicate).
    """
    return [ch for ch in text if normalize(ch, language) != ch.lower()]

has_accent

has_accent(
    text: str, language: Language | None = None
) -> bool

Return True if text contains any accented or composed character.

Parameters:

  • text (str) –

    Input string.

  • language (Language | None, default: None ) –

    Optional language whose normalization_map participates in the comparison (so ligatures count as accents).

Returns:

  • bool

    True when normalize(text) differs from text.lower().

Source code in src/lang_tools/language/normalization.py
def has_accent(text: str, language: Language | None = None) -> bool:
    """Return True if `text` contains any accented or composed character.

    Args:
        text: Input string.
        language: Optional language whose `normalization_map` participates in
            the comparison (so ligatures count as accents).

    Returns:
        True when `normalize(text)` differs from `text.lower()`.
    """
    return normalize(text, language) != text.lower()

normalize

normalize(
    text: str, language: Language | None = None
) -> str

Strip diacritics and lowercase a string.

Uses Unicode NFD decomposition followed by combining-mark removal. If a Language is supplied, its normalization_map is applied first to handle ligatures and other characters NFD cannot decompose.

Parameters:

  • text (str) –

    Input string.

  • language (Language | None, default: None ) –

    Optional language whose normalization_map is applied first.

Returns:

  • str

    Lowercased, accent-stripped form of text.

Example

Stripping Portuguese diacritics::

normalize("Ação")  # -> "acao"
Source code in src/lang_tools/language/normalization.py
def normalize(text: str, language: Language | None = None) -> str:
    """Strip diacritics and lowercase a string.

    Uses Unicode NFD decomposition followed by combining-mark removal. If a
    `Language` is supplied, its `normalization_map` is applied first to handle
    ligatures and other characters NFD cannot decompose.

    Args:
        text: Input string.
        language: Optional language whose `normalization_map` is applied first.

    Returns:
        Lowercased, accent-stripped form of ``text``.

    Example:
        Stripping Portuguese diacritics::

            normalize("A\u00e7\u00e3o")  # -> "acao"
    """
    if language is not None:
        for src, dst in language.normalization_map.items():
            text = text.replace(src, dst)
    decomposed = unicodedata.normalize("NFD", text)
    stripped = "".join(ch for ch in decomposed if not unicodedata.combining(ch))
    return stripped.lower()