lang_tools.language
¶
Language configuration and accent / normalization utilities.
This subpackage provides the canonical Language model (code, name, accented
characters, normalization map, keyboard layout) and stateless helpers for
diacritic stripping and accent inspection.
Public API
Language: per-language configuration model.
LANGUAGE_PRESETS: mapping of ISO 639-1 code to preset Language.
get_language: lookup helper that raises UnknownLanguageError on miss.
UnknownLanguageError: raised when an unknown language code is requested.
normalize: strip diacritics and lowercase.
has_accent: boolean check for any accented character in text.
extract_accented_chars: list accented characters present in text.
Modules:
-
language–Languagemodel and per-language presets. -
normalization–Accent / diacritic normalization helpers.
Classes:
-
Language–Per-language configuration shared across exercises and ingestion.
-
UnknownLanguageError–Raised when a language code is not present in
LANGUAGE_PRESETS.
Functions:
-
extract_accented_chars–List every accented or composed character present in
text. -
get_language–Return the preset
Languagefor the given ISO 639-1 code. -
has_accent–Return True if
textcontains any accented or composed character. -
normalize–Strip diacritics and lowercase a string.
Language
¶
Bases: BaseModel
Per-language configuration shared across exercises and ingestion.
Attributes:
-
code(str) –ISO 639-1 code (e.g.
"pt"). -
name(str) –English name (e.g.
"Portuguese"). -
native_name(str) –Native spelling (e.g.
"Português"). -
accented_chars(set[str]) –Set of accented/composed characters used by the language.
-
normalization_map(dict[str, str]) –Explicit per-character map from accented to base form (covers ligatures and special cases NFD cannot decompose).
-
keyboard_rows(list[list[str]]) –Rows of base letters for the on-screen keyboard layout hint (used by on-screen input widgets in wordle and diacritic-typing exercises).
-
accent_keys(list[str]) –Sorted list of accented characters; derived from
accented_chars. Consumed by on-screen input widgets as the extra diacritic key row.
accent_keys
property
¶
Sorted accented characters for the on-screen diacritic key row.
UnknownLanguageError
¶
extract_accented_chars
¶
List every accented or composed character present in text.
Parameters:
-
text(str) –Input string.
-
language(Language | None, default:None) –Optional language whose
normalization_mapis applied to the per-character check.
Returns:
-
list[str]–Characters from
textwhose normalized form differs from themselves, -
list[str]–preserving original order. Duplicates are kept (caller can deduplicate).
Source code in src/lang_tools/language/normalization.py
get_language
¶
Return the preset Language for the given ISO 639-1 code.
Parameters:
-
code(str) –ISO 639-1 code (case-insensitive).
Returns:
-
Language–The matching
Languagepreset.
Raises:
-
UnknownLanguageError–If the code is not in
LANGUAGE_PRESETS.
Source code in src/lang_tools/language/language.py
has_accent
¶
Return True if text contains any accented or composed character.
Parameters:
-
text(str) –Input string.
-
language(Language | None, default:None) –Optional language whose
normalization_mapparticipates in the comparison (so ligatures count as accents).
Returns:
-
bool–True when
normalize(text)differs fromtext.lower().
Source code in src/lang_tools/language/normalization.py
normalize
¶
Strip diacritics and lowercase a string.
Uses Unicode NFD decomposition followed by combining-mark removal. If a
Language is supplied, its normalization_map is applied first to handle
ligatures and other characters NFD cannot decompose.
Parameters:
-
text(str) –Input string.
-
language(Language | None, default:None) –Optional language whose
normalization_mapis applied first.
Returns:
-
str–Lowercased, accent-stripped form of
text.
Example
Stripping Portuguese diacritics::
normalize("Ação") # -> "acao"