Skip to content

lang_tools.words.ingestion.wiktionary

Wiktionary JSONL ingestion (kaikki.org dumps -> Word).

A line in a kaikki.org JSONL file is one WikiRecord. This module parses such records and maps the relevant fields onto the unified Word schema. Filtering options follow the suggestions in linux-box-cloudflare/scratch_space/vibes/10-language-overview/06-word-ingestion.md.

Classes:

  • WikiRecord

    One line of a kaikki.org Wiktionary JSONL dump.

  • WikiSense

    One sense entry inside a kaikki.org Wiktionary record.

Functions:

WikiRecord

Bases: BaseModel

One line of a kaikki.org Wiktionary JSONL dump.

Attributes:

  • word (str) –

    The headword.

  • pos (str | None) –

    Part of speech.

  • senses (list[WikiSense]) –

    List of senses.

  • categories (list[dict]) –

    Top-level category tags.

  • form_of (list[dict]) –

    Lemma references when this entry is an inflected form.

WikiSense

Bases: BaseModel

One sense entry inside a kaikki.org Wiktionary record.

Attributes:

  • glosses (list[str]) –

    List of definition strings.

  • raw_glosses (list[str]) –

    Raw (uncleaned) gloss strings.

  • examples (list[dict]) –

    List of example dicts {"text": ..., "english": ...}.

  • categories (list[dict]) –

    Wiktionary category tags.

  • tags (list[str]) –

    Per-sense tag list (e.g. ["transitive"]).

  • topics (list[str]) –

    Topic tags.

load_wiktionary_jsonl

load_wiktionary_jsonl(
    source: Path | IO[str],
    language: str,
    *,
    keep_pos: Iterable[str] | None = _DEFAULT_KEEP_POS,
    require_accent: bool = False,
    skip_form_of: bool = True,
) -> Iterator[Word]

Yield Word objects parsed from a kaikki.org JSONL file.

Parameters:

  • source (Path | IO[str]) –

    Filesystem path or open text stream.

  • language (str) –

    ISO 639-1 code to stamp on every produced Word.

  • keep_pos (Iterable[str] | None, default: _DEFAULT_KEEP_POS ) –

    Allowed part-of-speech values; pass None to keep all.

  • require_accent (bool, default: False ) –

    If True, only yield words with at least one diacritic.

  • skip_form_of (bool, default: True ) –

    If True, skip entries that are inflected-form pointers.

Yields:

  • Word

    Word instances for records that pass every filter.

Source code in src/lang_tools/words/ingestion/wiktionary.py
def load_wiktionary_jsonl(
    source: Path | IO[str],
    language: str,
    *,
    keep_pos: Iterable[str] | None = _DEFAULT_KEEP_POS,
    require_accent: bool = False,
    skip_form_of: bool = True,
) -> Iterator[Word]:
    """Yield `Word` objects parsed from a kaikki.org JSONL file.

    Args:
        source: Filesystem path or open text stream.
        language: ISO 639-1 code to stamp on every produced `Word`.
        keep_pos: Allowed part-of-speech values; pass ``None`` to keep all.
        require_accent: If True, only yield words with at least one diacritic.
        skip_form_of: If True, skip entries that are inflected-form pointers.

    Yields:
        `Word` instances for records that pass every filter.
    """
    keep = frozenset(keep_pos) if keep_pos is not None else None
    for raw_line in _iter_jsonl(source):
        line = raw_line.strip()
        if not line:
            continue
        try:
            payload = json.loads(line)
        except json.JSONDecodeError:
            continue
        record = WikiRecord.model_validate(payload)
        if not record.word:
            continue
        if skip_form_of and record.form_of:
            continue
        if keep is not None and (record.pos or "").lower() not in keep:
            continue
        word = _record_to_word(record, language)
        if require_accent and not word.has_accent:
            continue
        yield word