lang_tools.words.ingestion.wiktionary ¶

Wiktionary JSONL ingestion (kaikki.org dumps -> Word).

A line in a kaikki.org JSONL file is one WikiRecord. This module parses such records and maps the relevant fields onto the unified Word schema. Filtering options follow the suggestions in linux-box-cloudflare/scratch_space/vibes/10-language-overview/06-word-ingestion.md.

Classes:

WikiRecord –

One line of a kaikki.org Wiktionary JSONL dump.
WikiSense –

One sense entry inside a kaikki.org Wiktionary record.

Functions:

load_wiktionary_jsonl –

Yield Word objects parsed from a kaikki.org JSONL file.

WikiRecord ¶

Bases: BaseModel

One line of a kaikki.org Wiktionary JSONL dump.

Attributes:

word (str) –

The headword.
pos (str | None) –

Part of speech.
senses (list[WikiSense]) –

List of senses.
categories (list[dict]) –

Top-level category tags.
form_of (list[dict]) –

Lemma references when this entry is an inflected form.

WikiSense ¶

Bases: BaseModel

One sense entry inside a kaikki.org Wiktionary record.

Attributes:

glosses (list[str]) –

List of definition strings.
raw_glosses (list[str]) –

Raw (uncleaned) gloss strings.
examples (list[dict]) –

List of example dicts {"text": ..., "english": ...}.
categories (list[dict]) –

Wiktionary category tags.
tags (list[str]) –

Per-sense tag list (e.g. ["transitive"]).
topics (list[str]) –

Topic tags.

load_wiktionary_jsonl ¶

load_wiktionary_jsonl(
    source: Path | IO[str],
    language: str,
    *,
    keep_pos: Iterable[str] | None = _DEFAULT_KEEP_POS,
    require_accent: bool = False,
    skip_form_of: bool = True,
) -> Iterator[Word]

Yield Word objects parsed from a kaikki.org JSONL file.

Parameters:

source (Path | IO[str]) –

Filesystem path or open text stream.
language (str) –

ISO 639-1 code to stamp on every produced Word.
keep_pos (Iterable[str] | None, default: _DEFAULT_KEEP_POS ) –

Allowed part-of-speech values; pass None to keep all.
require_accent (bool, default: False ) –

If True, only yield words with at least one diacritic.
skip_form_of (bool, default: True ) –

If True, skip entries that are inflected-form pointers.

Yields:

Word –

Word instances for records that pass every filter.

Source code in src/lang_tools/words/ingestion/wiktionary.py

def load_wiktionary_jsonl(
    source: Path | IO[str],
    language: str,
    *,
    keep_pos: Iterable[str] | None = _DEFAULT_KEEP_POS,
    require_accent: bool = False,
    skip_form_of: bool = True,
) -> Iterator[Word]:
    """Yield `Word` objects parsed from a kaikki.org JSONL file.

    Args:
        source: Filesystem path or open text stream.
        language: ISO 639-1 code to stamp on every produced `Word`.
        keep_pos: Allowed part-of-speech values; pass ``None`` to keep all.
        require_accent: If True, only yield words with at least one diacritic.
        skip_form_of: If True, skip entries that are inflected-form pointers.

    Yields:
        `Word` instances for records that pass every filter.
    """
    keep = frozenset(keep_pos) if keep_pos is not None else None
    for raw_line in _iter_jsonl(source):
        line = raw_line.strip()
        if not line:
            continue
        try:
            payload = json.loads(line)
        except json.JSONDecodeError:
            continue
        record = WikiRecord.model_validate(payload)
        if not record.word:
            continue
        if skip_form_of and record.form_of:
            continue
        if keep is not None and (record.pos or "").lower() not in keep:
            continue
        word = _record_to_word(record, language)
        if require_accent and not word.has_accent:
            continue
        yield word