lang_tools.words.ingestion.wiktionary
¶
Wiktionary JSONL ingestion (kaikki.org dumps -> Word).
A line in a kaikki.org JSONL file is one WikiRecord. This module parses such
records and maps the relevant fields onto the unified Word schema. Filtering
options follow the suggestions in
linux-box-cloudflare/scratch_space/vibes/10-language-overview/06-word-ingestion.md.
Classes:
-
WikiRecord–One line of a kaikki.org Wiktionary JSONL dump.
-
WikiSense–One sense entry inside a kaikki.org Wiktionary record.
Functions:
-
load_wiktionary_jsonl–Yield
Wordobjects parsed from a kaikki.org JSONL file.
WikiRecord
¶
Bases: BaseModel
One line of a kaikki.org Wiktionary JSONL dump.
Attributes:
-
word(str) –The headword.
-
pos(str | None) –Part of speech.
-
senses(list[WikiSense]) –List of senses.
-
categories(list[dict]) –Top-level category tags.
-
form_of(list[dict]) –Lemma references when this entry is an inflected form.
WikiSense
¶
Bases: BaseModel
One sense entry inside a kaikki.org Wiktionary record.
Attributes:
-
glosses(list[str]) –List of definition strings.
-
raw_glosses(list[str]) –Raw (uncleaned) gloss strings.
-
examples(list[dict]) –List of example dicts
{"text": ..., "english": ...}. -
categories(list[dict]) –Wiktionary category tags.
-
tags(list[str]) –Per-sense tag list (e.g.
["transitive"]). -
topics(list[str]) –Topic tags.
load_wiktionary_jsonl
¶
load_wiktionary_jsonl(
source: Path | IO[str],
language: str,
*,
keep_pos: Iterable[str] | None = _DEFAULT_KEEP_POS,
require_accent: bool = False,
skip_form_of: bool = True,
) -> Iterator[Word]
Yield Word objects parsed from a kaikki.org JSONL file.
Parameters:
-
source(Path | IO[str]) –Filesystem path or open text stream.
-
language(str) –ISO 639-1 code to stamp on every produced
Word. -
keep_pos(Iterable[str] | None, default:_DEFAULT_KEEP_POS) –Allowed part-of-speech values; pass
Noneto keep all. -
require_accent(bool, default:False) –If True, only yield words with at least one diacritic.
-
skip_form_of(bool, default:True) –If True, skip entries that are inflected-form pointers.
Yields:
-
Word–Wordinstances for records that pass every filter.