Skip to content

lang_tools.language

Language configuration and accent / normalization utilities.

This subpackage provides the canonical Language model (code, name, accented characters, normalization map, keyboard layout) and stateless helpers for diacritic stripping and accent inspection.

Public API

Language: per-language configuration model. LANGUAGE_PRESETS: mapping of ISO 639-1 code to preset Language. get_language: lookup helper that raises UnknownLanguageError on miss. UnknownLanguageError: raised when an unknown language code is requested. normalize: strip diacritics and lowercase. has_accent: boolean check for any accented character in text. extract_accented_chars: list accented characters present in text.

Modules:

  • language

    Language model and per-language presets.

  • normalization

    Accent / diacritic normalization helpers.

Classes:

  • Language

    Per-language configuration shared across exercises and ingestion.

  • UnknownLanguageError

    Raised when a language code is not present in LANGUAGE_PRESETS.

Functions:

  • extract_accented_chars

    List every accented or composed character present in text.

  • get_language

    Return the preset Language for the given ISO 639-1 code.

  • has_accent

    Return True if text contains any accented or composed character.

  • normalize

    Strip diacritics and lowercase a string.

Language

Bases: BaseModel

Per-language configuration shared across exercises and ingestion.

Attributes:

  • code (str) –

    ISO 639-1 code (e.g. "pt").

  • name (str) –

    English name (e.g. "Portuguese").

  • native_name (str) –

    Native spelling (e.g. "Português").

  • accented_chars (set[str]) –

    Set of accented/composed characters used by the language.

  • normalization_map (dict[str, str]) –

    Explicit per-character map from accented to base form (covers ligatures and special cases NFD cannot decompose).

  • keyboard_rows (list[list[str]]) –

    Rows of base letters for the on-screen keyboard layout hint (used by on-screen input widgets in wordle and diacritic-typing exercises).

  • accent_keys (list[str]) –

    Sorted list of accented characters; derived from accented_chars. Consumed by on-screen input widgets as the extra diacritic key row.

accent_keys property

accent_keys: list[str]

Sorted accented characters for the on-screen diacritic key row.

UnknownLanguageError

UnknownLanguageError(code: str)

Bases: KeyError

Raised when a language code is not present in LANGUAGE_PRESETS.

Initialize with the offending code.

Parameters:

  • code (str) –

    The unknown ISO 639-1 language code.

Source code in src/lang_tools/language/language.py
def __init__(self, code: str) -> None:
    """Initialize with the offending code.

    Args:
        code: The unknown ISO 639-1 language code.
    """
    super().__init__(f"Unknown language code: {code!r}")
    self.code = code

extract_accented_chars

extract_accented_chars(
    text: str, language: Language | None = None
) -> list[str]

List every accented or composed character present in text.

Parameters:

  • text (str) –

    Input string.

  • language (Language | None, default: None ) –

    Optional language whose normalization_map is applied to the per-character check.

Returns:

  • list[str]

    Characters from text whose normalized form differs from themselves,

  • list[str]

    preserving original order. Duplicates are kept (caller can deduplicate).

Source code in src/lang_tools/language/normalization.py
def extract_accented_chars(text: str, language: Language | None = None) -> list[str]:
    """List every accented or composed character present in `text`.

    Args:
        text: Input string.
        language: Optional language whose `normalization_map` is applied to the
            per-character check.

    Returns:
        Characters from `text` whose normalized form differs from themselves,
        preserving original order. Duplicates are kept (caller can deduplicate).
    """
    return [ch for ch in text if normalize(ch, language) != ch.lower()]

get_language

get_language(code: str) -> Language

Return the preset Language for the given ISO 639-1 code.

Parameters:

  • code (str) –

    ISO 639-1 code (case-insensitive).

Returns:

  • Language

    The matching Language preset.

Raises:

Source code in src/lang_tools/language/language.py
def get_language(code: str) -> Language:
    """Return the preset `Language` for the given ISO 639-1 code.

    Args:
        code: ISO 639-1 code (case-insensitive).

    Returns:
        The matching `Language` preset.

    Raises:
        UnknownLanguageError: If the code is not in `LANGUAGE_PRESETS`.
    """
    key = code.lower()
    if key not in LANGUAGE_PRESETS:
        raise UnknownLanguageError(code)
    return LANGUAGE_PRESETS[key]

has_accent

has_accent(
    text: str, language: Language | None = None
) -> bool

Return True if text contains any accented or composed character.

Parameters:

  • text (str) –

    Input string.

  • language (Language | None, default: None ) –

    Optional language whose normalization_map participates in the comparison (so ligatures count as accents).

Returns:

  • bool

    True when normalize(text) differs from text.lower().

Source code in src/lang_tools/language/normalization.py
def has_accent(text: str, language: Language | None = None) -> bool:
    """Return True if `text` contains any accented or composed character.

    Args:
        text: Input string.
        language: Optional language whose `normalization_map` participates in
            the comparison (so ligatures count as accents).

    Returns:
        True when `normalize(text)` differs from `text.lower()`.
    """
    return normalize(text, language) != text.lower()

normalize

normalize(
    text: str, language: Language | None = None
) -> str

Strip diacritics and lowercase a string.

Uses Unicode NFD decomposition followed by combining-mark removal. If a Language is supplied, its normalization_map is applied first to handle ligatures and other characters NFD cannot decompose.

Parameters:

  • text (str) –

    Input string.

  • language (Language | None, default: None ) –

    Optional language whose normalization_map is applied first.

Returns:

  • str

    Lowercased, accent-stripped form of text.

Example

Stripping Portuguese diacritics::

normalize("Ação")  # -> "acao"
Source code in src/lang_tools/language/normalization.py
def normalize(text: str, language: Language | None = None) -> str:
    """Strip diacritics and lowercase a string.

    Uses Unicode NFD decomposition followed by combining-mark removal. If a
    `Language` is supplied, its `normalization_map` is applied first to handle
    ligatures and other characters NFD cannot decompose.

    Args:
        text: Input string.
        language: Optional language whose `normalization_map` is applied first.

    Returns:
        Lowercased, accent-stripped form of ``text``.

    Example:
        Stripping Portuguese diacritics::

            normalize("A\u00e7\u00e3o")  # -> "acao"
    """
    if language is not None:
        for src, dst in language.normalization_map.items():
            text = text.replace(src, dst)
    decomposed = unicodedata.normalize("NFD", text)
    stripped = "".join(ch for ch in decomposed if not unicodedata.combining(ch))
    return stripped.lower()