lang_tools.language ¶

Language configuration and accent / normalization utilities.

This subpackage provides the canonical Language model (code, name, accented characters, normalization map, keyboard layout) and stateless helpers for diacritic stripping and accent inspection.

Public API

Language: per-language configuration model. LANGUAGE_PRESETS: mapping of ISO 639-1 code to preset Language. get_language: lookup helper that raises UnknownLanguageError on miss. UnknownLanguageError: raised when an unknown language code is requested. normalize: strip diacritics and lowercase. has_accent: boolean check for any accented character in text. extract_accented_chars: list accented characters present in text.

Modules:

language –

Language model and per-language presets.
normalization –

Accent / diacritic normalization helpers.

Classes:

Language –

Per-language configuration shared across exercises and ingestion.
UnknownLanguageError –

Raised when a language code is not present in LANGUAGE_PRESETS.

Functions:

extract_accented_chars –

List every accented or composed character present in text.
get_language –

Return the preset Language for the given ISO 639-1 code.
has_accent –

Return True if text contains any accented or composed character.
normalize –

Strip diacritics and lowercase a string.

Language ¶

Bases: BaseModel

Per-language configuration shared across exercises and ingestion.

Attributes:

code (str) –

ISO 639-1 code (e.g. "pt").
name (str) –

English name (e.g. "Portuguese").
native_name (str) –

Native spelling (e.g. "Português").
accented_chars (set[str]) –

Set of accented/composed characters used by the language.
normalization_map (dict[str, str]) –

Explicit per-character map from accented to base form (covers ligatures and special cases NFD cannot decompose).
keyboard_rows (list[list[str]]) –

Rows of base letters for the on-screen keyboard layout hint (used by on-screen input widgets in wordle and diacritic-typing exercises).
accent_keys (list[str]) –

Sorted list of accented characters; derived from accented_chars. Consumed by on-screen input widgets as the extra diacritic key row.

accent_keys `property` ¶

accent_keys: list[str]

Sorted accented characters for the on-screen diacritic key row.

UnknownLanguageError ¶

UnknownLanguageError(code: str)

Bases: KeyError

Raised when a language code is not present in LANGUAGE_PRESETS.

Initialize with the offending code.

Parameters:

code (str) –

The unknown ISO 639-1 language code.

Source code in src/lang_tools/language/language.py

def __init__(self, code: str) -> None:
    """Initialize with the offending code.

    Args:
        code: The unknown ISO 639-1 language code.
    """
    super().__init__(f"Unknown language code: {code!r}")
    self.code = code

extract_accented_chars ¶

extract_accented_chars(
    text: str, language: Language | None = None
) -> list[str]

List every accented or composed character present in text.

Parameters:

text (str) –

Input string.
language (Language | None, default: None ) –

Optional language whose normalization_map is applied to the per-character check.

Returns:

list[str] –

Characters from text whose normalized form differs from themselves,
list[str] –

preserving original order. Duplicates are kept (caller can deduplicate).

Source code in src/lang_tools/language/normalization.py

def extract_accented_chars(text: str, language: Language | None = None) -> list[str]:
    """List every accented or composed character present in `text`.

    Args:
        text: Input string.
        language: Optional language whose `normalization_map` is applied to the
            per-character check.

    Returns:
        Characters from `text` whose normalized form differs from themselves,
        preserving original order. Duplicates are kept (caller can deduplicate).
    """
    return [ch for ch in text if normalize(ch, language) != ch.lower()]

get_language ¶

get_language(code: str) -> Language

Return the preset Language for the given ISO 639-1 code.

Parameters:

code (str) –

ISO 639-1 code (case-insensitive).

Returns:

Language –

The matching Language preset.

Raises:

UnknownLanguageError –

If the code is not in LANGUAGE_PRESETS.

Source code in src/lang_tools/language/language.py

def get_language(code: str) -> Language:
    """Return the preset `Language` for the given ISO 639-1 code.

    Args:
        code: ISO 639-1 code (case-insensitive).

    Returns:
        The matching `Language` preset.

    Raises:
        UnknownLanguageError: If the code is not in `LANGUAGE_PRESETS`.
    """
    key = code.lower()
    if key not in LANGUAGE_PRESETS:
        raise UnknownLanguageError(code)
    return LANGUAGE_PRESETS[key]

has_accent ¶

has_accent(
    text: str, language: Language | None = None
) -> bool

Return True if text contains any accented or composed character.

Parameters:

text (str) –

Input string.
language (Language | None, default: None ) –

Optional language whose normalization_map participates in the comparison (so ligatures count as accents).

Returns:

bool –

True when normalize(text) differs from text.lower().

Source code in src/lang_tools/language/normalization.py

def has_accent(text: str, language: Language | None = None) -> bool:
    """Return True if `text` contains any accented or composed character.

    Args:
        text: Input string.
        language: Optional language whose `normalization_map` participates in
            the comparison (so ligatures count as accents).

    Returns:
        True when `normalize(text)` differs from `text.lower()`.
    """
    return normalize(text, language) != text.lower()

normalize ¶

normalize(
    text: str, language: Language | None = None
) -> str

Strip diacritics and lowercase a string.

Uses Unicode NFD decomposition followed by combining-mark removal. If a Language is supplied, its normalization_map is applied first to handle ligatures and other characters NFD cannot decompose.

Parameters:

text (str) –

Input string.
language (Language | None, default: None ) –

Optional language whose normalization_map is applied first.

Returns:

str –

Lowercased, accent-stripped form of text.

Example

Stripping Portuguese diacritics::

normalize("Ação")  # -> "acao"

Source code in src/lang_tools/language/normalization.py

def normalize(text: str, language: Language | None = None) -> str:
    """Strip diacritics and lowercase a string.

    Uses Unicode NFD decomposition followed by combining-mark removal. If a
    `Language` is supplied, its `normalization_map` is applied first to handle
    ligatures and other characters NFD cannot decompose.

    Args:
        text: Input string.
        language: Optional language whose `normalization_map` is applied first.

    Returns:
        Lowercased, accent-stripped form of ``text``.

    Example:
        Stripping Portuguese diacritics::

            normalize("A\u00e7\u00e3o")  # -> "acao"
    """
    if language is not None:
        for src, dst in language.normalization_map.items():
            text = text.replace(src, dst)
    decomposed = unicodedata.normalize("NFD", text)
    stripped = "".join(ch for ch in decomposed if not unicodedata.combining(ch))
    return stripped.lower()

lang_tools.language ¶

Language ¶

accent_keys property ¶

UnknownLanguageError ¶

extract_accented_chars ¶

get_language ¶

has_accent ¶

normalize ¶

accent_keys `property` ¶