Skip to content

lang_tools.words.ingestion

Word ingestion pipelines.

Public API

load_wiktionary_jsonl: parse a kaikki.org JSONL dump into Word objects. WikiRecord, WikiSense: typed shapes for kaikki.org records. load_csv: parse a brazilian-bites style CSV into Word objects. load_static_list: parse a worldly-words style minimal list. merge_words: merge two Word records that share the same (text, language). deduplicate: collapse an iterable of Words into a unique-by-id list.

Modules:

  • csv_loader

    CSV ingestion for brazilian-bites style vocabulary files.

  • dedup

    Word merging and deduplication.

  • static_list

    Static word-list ingestion (worldly-words style).

  • wiktionary

    Wiktionary JSONL ingestion (kaikki.org dumps -> Word).

Classes:

  • WikiRecord

    One line of a kaikki.org Wiktionary JSONL dump.

  • WikiSense

    One sense entry inside a kaikki.org Wiktionary record.

Functions:

  • deduplicate

    Collapse an iterable of Words by ID, merging duplicates.

  • load_csv

    Yield Word objects parsed from a vocabulary CSV.

  • load_static_list

    Yield Word objects from a minimal list of dicts.

  • load_wiktionary_jsonl

    Yield Word objects parsed from a kaikki.org JSONL file.

  • merge_words

    Return a new Word that combines fields from left and right.

WikiRecord

Bases: BaseModel

One line of a kaikki.org Wiktionary JSONL dump.

Attributes:

  • word (str) –

    The headword.

  • pos (str | None) –

    Part of speech.

  • senses (list[WikiSense]) –

    List of senses.

  • categories (list[dict]) –

    Top-level category tags.

  • form_of (list[dict]) –

    Lemma references when this entry is an inflected form.

WikiSense

Bases: BaseModel

One sense entry inside a kaikki.org Wiktionary record.

Attributes:

  • glosses (list[str]) –

    List of definition strings.

  • raw_glosses (list[str]) –

    Raw (uncleaned) gloss strings.

  • examples (list[dict]) –

    List of example dicts {"text": ..., "english": ...}.

  • categories (list[dict]) –

    Wiktionary category tags.

  • tags (list[str]) –

    Per-sense tag list (e.g. ["transitive"]).

  • topics (list[str]) –

    Topic tags.

deduplicate

deduplicate(words: Iterable[Word]) -> list[Word]

Collapse an iterable of Words by ID, merging duplicates.

Parameters:

  • words (Iterable[Word]) –

    Iterable of words; may contain duplicates by ID.

Returns:

  • list[Word]

    Insertion-ordered list of distinct Word records.

Source code in src/lang_tools/words/ingestion/dedup.py
def deduplicate(words: Iterable[Word]) -> list[Word]:
    """Collapse an iterable of `Word`s by ID, merging duplicates.

    Args:
        words: Iterable of words; may contain duplicates by ID.

    Returns:
        Insertion-ordered list of distinct `Word` records.
    """
    out: dict[str, Word] = {}
    for word in words:
        if word.id in out:
            out[word.id] = merge_words(out[word.id], word)
        else:
            out[word.id] = word
    return list(out.values())

load_csv

load_csv(source: Path | IO[str]) -> Iterator[Word]

Yield Word objects parsed from a vocabulary CSV.

The CSV must contain at least text and language columns. Optional columns include:

  • part_of_speech, frequency (high/medium/low)
  • topics and secondary_topics (comma-separated)
  • translation_<lang> (one column per target language)
  • example_sentence + example_translation
  • false_friend_language, false_friend_word, false_friend_meaning, false_friend_similarity

Parameters:

  • source (Path | IO[str]) –

    Path to the CSV file or an already-open text stream.

Yields:

  • Word

    Parsed Word instances.

Raises:

Source code in src/lang_tools/words/ingestion/csv_loader.py
def load_csv(source: Path | IO[str]) -> Iterator[Word]:
    """Yield `Word` objects parsed from a vocabulary CSV.

    The CSV must contain at least ``text`` and ``language`` columns. Optional
    columns include:

    - ``part_of_speech``, ``frequency`` (``high``/``medium``/``low``)
    - ``topics`` and ``secondary_topics`` (comma-separated)
    - ``translation_<lang>`` (one column per target language)
    - ``example_sentence`` + ``example_translation``
    - ``false_friend_language``, ``false_friend_word``, ``false_friend_meaning``,
      ``false_friend_similarity``

    Args:
        source: Path to the CSV file or an already-open text stream.

    Yields:
        Parsed `Word` instances.

    Raises:
        CSVColumnsMissingError: If the header does not contain the required columns.
    """
    rows = _iter_rows(source)
    try:
        first = next(rows)
    except StopIteration:
        return
    missing = _CSV_REQUIRED - first.keys()
    if missing:
        raise CSVColumnsMissingError(missing)
    yield _row_to_word(first)
    for row in rows:
        yield _row_to_word(row)

load_static_list

load_static_list(entries: Iterable[dict]) -> Iterator[Word]

Yield Word objects from a minimal list of dicts.

Parameters:

  • entries (Iterable[dict]) –

    Iterable of dicts. Each dict must contain text and language; normalized is filled in automatically when absent.

Yields:

  • Word

    Word instances tagged with sources=["static_list"].

Source code in src/lang_tools/words/ingestion/static_list.py
def load_static_list(entries: Iterable[dict]) -> Iterator[Word]:
    """Yield `Word` objects from a minimal list of dicts.

    Args:
        entries: Iterable of dicts. Each dict must contain ``text`` and
            ``language``; ``normalized`` is filled in automatically when absent.

    Yields:
        `Word` instances tagged with ``sources=["static_list"]``.
    """
    for entry in entries:
        yield Word(
            text=entry["text"],
            language=entry["language"],
            normalized=entry.get("normalized", ""),
            sources=["static_list"],
        )

load_wiktionary_jsonl

load_wiktionary_jsonl(
    source: Path | IO[str],
    language: str,
    *,
    keep_pos: Iterable[str] | None = _DEFAULT_KEEP_POS,
    require_accent: bool = False,
    skip_form_of: bool = True,
) -> Iterator[Word]

Yield Word objects parsed from a kaikki.org JSONL file.

Parameters:

  • source (Path | IO[str]) –

    Filesystem path or open text stream.

  • language (str) –

    ISO 639-1 code to stamp on every produced Word.

  • keep_pos (Iterable[str] | None, default: _DEFAULT_KEEP_POS ) –

    Allowed part-of-speech values; pass None to keep all.

  • require_accent (bool, default: False ) –

    If True, only yield words with at least one diacritic.

  • skip_form_of (bool, default: True ) –

    If True, skip entries that are inflected-form pointers.

Yields:

  • Word

    Word instances for records that pass every filter.

Source code in src/lang_tools/words/ingestion/wiktionary.py
def load_wiktionary_jsonl(
    source: Path | IO[str],
    language: str,
    *,
    keep_pos: Iterable[str] | None = _DEFAULT_KEEP_POS,
    require_accent: bool = False,
    skip_form_of: bool = True,
) -> Iterator[Word]:
    """Yield `Word` objects parsed from a kaikki.org JSONL file.

    Args:
        source: Filesystem path or open text stream.
        language: ISO 639-1 code to stamp on every produced `Word`.
        keep_pos: Allowed part-of-speech values; pass ``None`` to keep all.
        require_accent: If True, only yield words with at least one diacritic.
        skip_form_of: If True, skip entries that are inflected-form pointers.

    Yields:
        `Word` instances for records that pass every filter.
    """
    keep = frozenset(keep_pos) if keep_pos is not None else None
    for raw_line in _iter_jsonl(source):
        line = raw_line.strip()
        if not line:
            continue
        try:
            payload = json.loads(line)
        except json.JSONDecodeError:
            continue
        record = WikiRecord.model_validate(payload)
        if not record.word:
            continue
        if skip_form_of and record.form_of:
            continue
        if keep is not None and (record.pos or "").lower() not in keep:
            continue
        word = _record_to_word(record, language)
        if require_accent and not word.has_accent:
            continue
        yield word

merge_words

merge_words(left: Word, right: Word) -> Word

Return a new Word that combines fields from left and right.

Scalar fields (part_of_speech, frequency) prefer non-null over null; if both differ, left wins. Collection fields (translations, topics, glosses, examples, false_friends, sources) are merged. The resulting text keeps the left spelling so that accent information is preserved when one side has it and the other does not.

Parameters:

  • left (Word) –

    First record (its text is preserved).

  • right (Word) –

    Second record.

Returns:

  • Word

    A new merged Word instance. left and right are not mutated.

Raises:

  • ValueError

    If the two records have different IDs.

Source code in src/lang_tools/words/ingestion/dedup.py
def merge_words(left: Word, right: Word) -> Word:
    """Return a new `Word` that combines fields from `left` and `right`.

    Scalar fields (``part_of_speech``, ``frequency``) prefer non-null over
    null; if both differ, ``left`` wins. Collection fields (``translations``,
    ``topics``, ``glosses``, ``examples``, ``false_friends``, ``sources``) are
    merged. The resulting `text` keeps the `left` spelling so that accent
    information is preserved when one side has it and the other does not.

    Args:
        left: First record (its `text` is preserved).
        right: Second record.

    Returns:
        A new merged `Word` instance. `left` and `right` are not mutated.

    Raises:
        ValueError: If the two records have different IDs.
    """
    if left.id != right.id:
        msg = f"Cannot merge words with different IDs: {left.id} vs {right.id}"
        raise ValueError(msg)

    has_accent_left = any(c.isalpha() and not c.isascii() for c in left.text)
    text = left.text if has_accent_left else right.text
    if not text:
        text = left.text

    return Word(
        text=text,
        language=left.language,
        normalized=left.normalized or right.normalized,
        part_of_speech=left.part_of_speech or right.part_of_speech,
        frequency=left.frequency or right.frequency,
        translations={**right.translations, **left.translations},
        topics=_merge_lists(left.topics, right.topics),
        glosses=_merge_lists(left.glosses, right.glosses),
        examples=_merge_lists(left.examples, right.examples),
        false_friends=_merge_lists(left.false_friends, right.false_friends),
        sources=_merge_lists(left.sources, right.sources),
    )