lang_tools.words.ingestion ¶

Word ingestion pipelines.

Public API

load_wiktionary_jsonl: parse a kaikki.org JSONL dump into Word objects. WikiRecord, WikiSense: typed shapes for kaikki.org records. load_csv: parse a brazilian-bites style CSV into Word objects. load_static_list: parse a worldly-words style minimal list. merge_words: merge two Word records that share the same (text, language). deduplicate: collapse an iterable of Words into a unique-by-id list.

Modules:

csv_loader –

CSV ingestion for brazilian-bites style vocabulary files.
dedup –

Word merging and deduplication.
static_list –

Static word-list ingestion (worldly-words style).
wiktionary –

Wiktionary JSONL ingestion (kaikki.org dumps -> Word).

Classes:

WikiRecord –

One line of a kaikki.org Wiktionary JSONL dump.
WikiSense –

One sense entry inside a kaikki.org Wiktionary record.

Functions:

deduplicate –

Collapse an iterable of Words by ID, merging duplicates.
load_csv –

Yield Word objects parsed from a vocabulary CSV.
load_static_list –

Yield Word objects from a minimal list of dicts.
load_wiktionary_jsonl –

Yield Word objects parsed from a kaikki.org JSONL file.
merge_words –

Return a new Word that combines fields from left and right.

WikiRecord ¶

Bases: BaseModel

One line of a kaikki.org Wiktionary JSONL dump.

Attributes:

word (str) –

The headword.
pos (str | None) –

Part of speech.
senses (list[WikiSense]) –

List of senses.
categories (list[dict]) –

Top-level category tags.
form_of (list[dict]) –

Lemma references when this entry is an inflected form.

WikiSense ¶

Bases: BaseModel

One sense entry inside a kaikki.org Wiktionary record.

Attributes:

glosses (list[str]) –

List of definition strings.
raw_glosses (list[str]) –

Raw (uncleaned) gloss strings.
examples (list[dict]) –

List of example dicts {"text": ..., "english": ...}.
categories (list[dict]) –

Wiktionary category tags.
tags (list[str]) –

Per-sense tag list (e.g. ["transitive"]).
topics (list[str]) –

Topic tags.

deduplicate ¶

deduplicate(words: Iterable[Word]) -> list[Word]

Collapse an iterable of Words by ID, merging duplicates.

Parameters:

words (Iterable[Word]) –

Iterable of words; may contain duplicates by ID.

Returns:

list[Word] –

Insertion-ordered list of distinct Word records.

Source code in src/lang_tools/words/ingestion/dedup.py

def deduplicate(words: Iterable[Word]) -> list[Word]:
    """Collapse an iterable of `Word`s by ID, merging duplicates.

    Args:
        words: Iterable of words; may contain duplicates by ID.

    Returns:
        Insertion-ordered list of distinct `Word` records.
    """
    out: dict[str, Word] = {}
    for word in words:
        if word.id in out:
            out[word.id] = merge_words(out[word.id], word)
        else:
            out[word.id] = word
    return list(out.values())

load_csv ¶

load_csv(source: Path | IO[str]) -> Iterator[Word]

Yield Word objects parsed from a vocabulary CSV.

The CSV must contain at least text and language columns. Optional columns include:

part_of_speech, frequency (high/medium/low)
topics and secondary_topics (comma-separated)
translation_<lang> (one column per target language)
example_sentence + example_translation
false_friend_language, false_friend_word, false_friend_meaning, false_friend_similarity

Parameters:

source (Path | IO[str]) –

Path to the CSV file or an already-open text stream.

Yields:

Word –

Parsed Word instances.

Raises:

CSVColumnsMissingError –

If the header does not contain the required columns.

Source code in src/lang_tools/words/ingestion/csv_loader.py

def load_csv(source: Path | IO[str]) -> Iterator[Word]:
    """Yield `Word` objects parsed from a vocabulary CSV.

    The CSV must contain at least ``text`` and ``language`` columns. Optional
    columns include:

    - ``part_of_speech``, ``frequency`` (``high``/``medium``/``low``)
    - ``topics`` and ``secondary_topics`` (comma-separated)
    - ``translation_<lang>`` (one column per target language)
    - ``example_sentence`` + ``example_translation``
    - ``false_friend_language``, ``false_friend_word``, ``false_friend_meaning``,
      ``false_friend_similarity``

    Args:
        source: Path to the CSV file or an already-open text stream.

    Yields:
        Parsed `Word` instances.

    Raises:
        CSVColumnsMissingError: If the header does not contain the required columns.
    """
    rows = _iter_rows(source)
    try:
        first = next(rows)
    except StopIteration:
        return
    missing = _CSV_REQUIRED - first.keys()
    if missing:
        raise CSVColumnsMissingError(missing)
    yield _row_to_word(first)
    for row in rows:
        yield _row_to_word(row)

load_static_list ¶

load_static_list(entries: Iterable[dict]) -> Iterator[Word]

Yield Word objects from a minimal list of dicts.

Parameters:

entries (Iterable[dict]) –

Iterable of dicts. Each dict must contain text and language; normalized is filled in automatically when absent.

Yields:

Word –

Word instances tagged with sources=["static_list"].

Source code in src/lang_tools/words/ingestion/static_list.py

def load_static_list(entries: Iterable[dict]) -> Iterator[Word]:
    """Yield `Word` objects from a minimal list of dicts.

    Args:
        entries: Iterable of dicts. Each dict must contain ``text`` and
            ``language``; ``normalized`` is filled in automatically when absent.

    Yields:
        `Word` instances tagged with ``sources=["static_list"]``.
    """
    for entry in entries:
        yield Word(
            text=entry["text"],
            language=entry["language"],
            normalized=entry.get("normalized", ""),
            sources=["static_list"],
        )

load_wiktionary_jsonl ¶

load_wiktionary_jsonl(
    source: Path | IO[str],
    language: str,
    *,
    keep_pos: Iterable[str] | None = _DEFAULT_KEEP_POS,
    require_accent: bool = False,
    skip_form_of: bool = True,
) -> Iterator[Word]

Yield Word objects parsed from a kaikki.org JSONL file.

Parameters:

source (Path | IO[str]) –

Filesystem path or open text stream.
language (str) –

ISO 639-1 code to stamp on every produced Word.
keep_pos (Iterable[str] | None, default: _DEFAULT_KEEP_POS ) –

Allowed part-of-speech values; pass None to keep all.
require_accent (bool, default: False ) –

If True, only yield words with at least one diacritic.
skip_form_of (bool, default: True ) –

If True, skip entries that are inflected-form pointers.

Yields:

Word –

Word instances for records that pass every filter.

Source code in src/lang_tools/words/ingestion/wiktionary.py

def load_wiktionary_jsonl(
    source: Path | IO[str],
    language: str,
    *,
    keep_pos: Iterable[str] | None = _DEFAULT_KEEP_POS,
    require_accent: bool = False,
    skip_form_of: bool = True,
) -> Iterator[Word]:
    """Yield `Word` objects parsed from a kaikki.org JSONL file.

    Args:
        source: Filesystem path or open text stream.
        language: ISO 639-1 code to stamp on every produced `Word`.
        keep_pos: Allowed part-of-speech values; pass ``None`` to keep all.
        require_accent: If True, only yield words with at least one diacritic.
        skip_form_of: If True, skip entries that are inflected-form pointers.

    Yields:
        `Word` instances for records that pass every filter.
    """
    keep = frozenset(keep_pos) if keep_pos is not None else None
    for raw_line in _iter_jsonl(source):
        line = raw_line.strip()
        if not line:
            continue
        try:
            payload = json.loads(line)
        except json.JSONDecodeError:
            continue
        record = WikiRecord.model_validate(payload)
        if not record.word:
            continue
        if skip_form_of and record.form_of:
            continue
        if keep is not None and (record.pos or "").lower() not in keep:
            continue
        word = _record_to_word(record, language)
        if require_accent and not word.has_accent:
            continue
        yield word

merge_words ¶

merge_words(left: Word, right: Word) -> Word

Return a new Word that combines fields from left and right.

Scalar fields (part_of_speech, frequency) prefer non-null over null; if both differ, left wins. Collection fields (translations, topics, glosses, examples, false_friends, sources) are merged. The resulting text keeps the left spelling so that accent information is preserved when one side has it and the other does not.

Parameters:

left (Word) –

First record (its text is preserved).
right (Word) –

Second record.

Returns:

Word –

A new merged Word instance. left and right are not mutated.

Raises:

ValueError –

If the two records have different IDs.

Source code in src/lang_tools/words/ingestion/dedup.py

def merge_words(left: Word, right: Word) -> Word:
    """Return a new `Word` that combines fields from `left` and `right`.

    Scalar fields (``part_of_speech``, ``frequency``) prefer non-null over
    null; if both differ, ``left`` wins. Collection fields (``translations``,
    ``topics``, ``glosses``, ``examples``, ``false_friends``, ``sources``) are
    merged. The resulting `text` keeps the `left` spelling so that accent
    information is preserved when one side has it and the other does not.

    Args:
        left: First record (its `text` is preserved).
        right: Second record.

    Returns:
        A new merged `Word` instance. `left` and `right` are not mutated.

    Raises:
        ValueError: If the two records have different IDs.
    """
    if left.id != right.id:
        msg = f"Cannot merge words with different IDs: {left.id} vs {right.id}"
        raise ValueError(msg)

    has_accent_left = any(c.isalpha() and not c.isascii() for c in left.text)
    text = left.text if has_accent_left else right.text
    if not text:
        text = left.text

    return Word(
        text=text,
        language=left.language,
        normalized=left.normalized or right.normalized,
        part_of_speech=left.part_of_speech or right.part_of_speech,
        frequency=left.frequency or right.frequency,
        translations={**right.translations, **left.translations},
        topics=_merge_lists(left.topics, right.topics),
        glosses=_merge_lists(left.glosses, right.glosses),
        examples=_merge_lists(left.examples, right.examples),
        false_friends=_merge_lists(left.false_friends, right.false_friends),
        sources=_merge_lists(left.sources, right.sources),
    )