lang_tools.words
¶
Canonical word data model and ingestion pipelines.
Public API
Word: unified word entity. Gloss, GlossExample, WordExample, FalseFriend: supporting types. FrequencyLevel: literal type for word frequency. word_id: deterministic ID for a (text, language) pair.
Modules:
-
ingestion–Word ingestion pipelines.
-
word–Canonical
Wordmodel and supporting types. -
word_id–Deterministic ID for a (text, language) pair.
Classes:
-
FalseFriend–False-friend metadata pointing at a misleading cognate.
-
Gloss–A single sense / definition for a word.
-
GlossExample–Usage example attached to a Wiktionary-style sense.
-
Word–Unified word entity covering vocab, dictionary, and game data sources.
-
WordExample–Curated example sentence in the word's language.
FalseFriend
¶
Bases: BaseModel
False-friend metadata pointing at a misleading cognate.
Attributes:
-
language(str) –ISO 639-1 code of the language the cognate exists in.
-
similar_word(str) –The misleading cognate.
-
similarity_score(float | None) –Optional 0.0-1.0 visual / phonetic similarity.
-
actual_meaning(str) –What the cognate actually means.
Gloss
¶
Bases: BaseModel
A single sense / definition for a word.
Attributes:
-
text(str) –Definition text (usually English).
-
examples(list[GlossExample]) –List of usage examples.
GlossExample
¶
Word
¶
Bases: BaseModel
Unified word entity covering vocab, dictionary, and game data sources.
Attributes:
-
text(str) –Canonical form with accents preserved.
-
language(str) –ISO 639-1 code.
-
normalized(str) –Accent-stripped, lowercased form. Auto-derived from
textwhen not supplied. -
part_of_speech(str | None) –Word class label (
"noun","verb", ...). -
frequency(FrequencyLevel | None) –Optional frequency tier.
-
translations(dict[str, str]) –Mapping from target language code to translated text.
-
topics(list[str]) –Free-form topic tags.
-
glosses(list[Gloss]) –Wiktionary-style sense list.
-
examples(list[WordExample]) –Curated example sentences.
-
false_friends(list[FalseFriend]) –List of false-friend metadata entries.
-
sources(list[str]) –Provenance tags (
"wiktionary","csv","llm", ...).
accented_chars
property
¶
Accented characters present in text, in original order.