lang_tools.words.ingestion
¶
Word ingestion pipelines.
Public API
load_wiktionary_jsonl: parse a kaikki.org JSONL dump into Word objects.
WikiRecord, WikiSense: typed shapes for kaikki.org records.
load_csv: parse a brazilian-bites style CSV into Word objects.
load_static_list: parse a worldly-words style minimal list.
merge_words: merge two Word records that share the same (text, language).
deduplicate: collapse an iterable of Words into a unique-by-id list.
Modules:
-
csv_loader–CSV ingestion for
brazilian-bitesstyle vocabulary files. -
dedup–Word merging and deduplication.
-
static_list–Static word-list ingestion (worldly-words style).
-
wiktionary–Wiktionary JSONL ingestion (kaikki.org dumps ->
Word).
Classes:
-
WikiRecord–One line of a kaikki.org Wiktionary JSONL dump.
-
WikiSense–One sense entry inside a kaikki.org Wiktionary record.
Functions:
-
deduplicate–Collapse an iterable of
Words by ID, merging duplicates. -
load_csv–Yield
Wordobjects parsed from a vocabulary CSV. -
load_static_list–Yield
Wordobjects from a minimal list of dicts. -
load_wiktionary_jsonl–Yield
Wordobjects parsed from a kaikki.org JSONL file. -
merge_words–Return a new
Wordthat combines fields fromleftandright.
WikiRecord
¶
Bases: BaseModel
One line of a kaikki.org Wiktionary JSONL dump.
Attributes:
-
word(str) –The headword.
-
pos(str | None) –Part of speech.
-
senses(list[WikiSense]) –List of senses.
-
categories(list[dict]) –Top-level category tags.
-
form_of(list[dict]) –Lemma references when this entry is an inflected form.
WikiSense
¶
Bases: BaseModel
One sense entry inside a kaikki.org Wiktionary record.
Attributes:
-
glosses(list[str]) –List of definition strings.
-
raw_glosses(list[str]) –Raw (uncleaned) gloss strings.
-
examples(list[dict]) –List of example dicts
{"text": ..., "english": ...}. -
categories(list[dict]) –Wiktionary category tags.
-
tags(list[str]) –Per-sense tag list (e.g.
["transitive"]). -
topics(list[str]) –Topic tags.
deduplicate
¶
Collapse an iterable of Words by ID, merging duplicates.
Parameters:
Returns:
Source code in src/lang_tools/words/ingestion/dedup.py
load_csv
¶
Yield Word objects parsed from a vocabulary CSV.
The CSV must contain at least text and language columns. Optional
columns include:
part_of_speech,frequency(high/medium/low)topicsandsecondary_topics(comma-separated)translation_<lang>(one column per target language)example_sentence+example_translationfalse_friend_language,false_friend_word,false_friend_meaning,false_friend_similarity
Parameters:
Yields:
-
Word–Parsed
Wordinstances.
Raises:
-
CSVColumnsMissingError–If the header does not contain the required columns.
Source code in src/lang_tools/words/ingestion/csv_loader.py
load_static_list
¶
Yield Word objects from a minimal list of dicts.
Parameters:
-
entries(Iterable[dict]) –Iterable of dicts. Each dict must contain
textandlanguage;normalizedis filled in automatically when absent.
Yields:
-
Word–Wordinstances tagged withsources=["static_list"].
Source code in src/lang_tools/words/ingestion/static_list.py
load_wiktionary_jsonl
¶
load_wiktionary_jsonl(
source: Path | IO[str],
language: str,
*,
keep_pos: Iterable[str] | None = _DEFAULT_KEEP_POS,
require_accent: bool = False,
skip_form_of: bool = True,
) -> Iterator[Word]
Yield Word objects parsed from a kaikki.org JSONL file.
Parameters:
-
source(Path | IO[str]) –Filesystem path or open text stream.
-
language(str) –ISO 639-1 code to stamp on every produced
Word. -
keep_pos(Iterable[str] | None, default:_DEFAULT_KEEP_POS) –Allowed part-of-speech values; pass
Noneto keep all. -
require_accent(bool, default:False) –If True, only yield words with at least one diacritic.
-
skip_form_of(bool, default:True) –If True, skip entries that are inflected-form pointers.
Yields:
-
Word–Wordinstances for records that pass every filter.
Source code in src/lang_tools/words/ingestion/wiktionary.py
merge_words
¶
Return a new Word that combines fields from left and right.
Scalar fields (part_of_speech, frequency) prefer non-null over
null; if both differ, left wins. Collection fields (translations,
topics, glosses, examples, false_friends, sources) are
merged. The resulting text keeps the left spelling so that accent
information is preserved when one side has it and the other does not.
Parameters:
Returns:
-
Word–A new merged
Wordinstance.leftandrightare not mutated.
Raises:
-
ValueError–If the two records have different IDs.