lang_tools.language.normalization
¶
Accent / diacritic normalization helpers.
Pattern rules
Stateless utility module. Uses unicodedata.NFD decomposition and
combining-character filtering as the default backbone. Per-language
overrides (via the Language.normalization_map) handle special cases like
French œ -> oe or German ß -> ss which NFD does not
decompose.
Functions:
-
extract_accented_chars–List every accented or composed character present in
text. -
has_accent–Return True if
textcontains any accented or composed character. -
normalize–Strip diacritics and lowercase a string.
extract_accented_chars
¶
List every accented or composed character present in text.
Parameters:
-
text(str) –Input string.
-
language(Language | None, default:None) –Optional language whose
normalization_mapis applied to the per-character check.
Returns:
-
list[str]–Characters from
textwhose normalized form differs from themselves, -
list[str]–preserving original order. Duplicates are kept (caller can deduplicate).
Source code in src/lang_tools/language/normalization.py
has_accent
¶
Return True if text contains any accented or composed character.
Parameters:
-
text(str) –Input string.
-
language(Language | None, default:None) –Optional language whose
normalization_mapparticipates in the comparison (so ligatures count as accents).
Returns:
-
bool–True when
normalize(text)differs fromtext.lower().
Source code in src/lang_tools/language/normalization.py
normalize
¶
Strip diacritics and lowercase a string.
Uses Unicode NFD decomposition followed by combining-mark removal. If a
Language is supplied, its normalization_map is applied first to handle
ligatures and other characters NFD cannot decompose.
Parameters:
-
text(str) –Input string.
-
language(Language | None, default:None) –Optional language whose
normalization_mapis applied first.
Returns:
-
str–Lowercased, accent-stripped form of
text.
Example
Stripping Portuguese diacritics::
normalize("Ação") # -> "acao"