lang_tools.words ¶

Canonical word data model and ingestion pipelines.

Public API

Word: unified word entity. Gloss, GlossExample, WordExample, FalseFriend: supporting types. FrequencyLevel: literal type for word frequency. word_id: deterministic ID for a (text, language) pair.

Modules:

ingestion –

Word ingestion pipelines.
word –

Canonical Word model and supporting types.
word_id –

Deterministic ID for a (text, language) pair.

Classes:

FalseFriend –

False-friend metadata pointing at a misleading cognate.
Gloss –

A single sense / definition for a word.
GlossExample –

Usage example attached to a Wiktionary-style sense.
Word –

Unified word entity covering vocab, dictionary, and game data sources.
WordExample –

Curated example sentence in the word's language.

FalseFriend ¶

Bases: BaseModel

False-friend metadata pointing at a misleading cognate.

Attributes:

language (str) –

ISO 639-1 code of the language the cognate exists in.
similar_word (str) –

The misleading cognate.
similarity_score (float | None) –

Optional 0.0-1.0 visual / phonetic similarity.
actual_meaning (str) –

What the cognate actually means.

Gloss ¶

Bases: BaseModel

A single sense / definition for a word.

Attributes:

text (str) –

Definition text (usually English).
examples (list[GlossExample]) –

List of usage examples.

GlossExample ¶

Bases: BaseModel

Usage example attached to a Wiktionary-style sense.

Attributes:

text (str) –

Example sentence in the word's language.
translation (str | None) –

English translation (optional).

Word ¶

Bases: BaseModel

Unified word entity covering vocab, dictionary, and game data sources.

Attributes:

text (str) –

Canonical form with accents preserved.
language (str) –

ISO 639-1 code.
normalized (str) –

Accent-stripped, lowercased form. Auto-derived from text when not supplied.
part_of_speech (str | None) –

Word class label ("noun", "verb", ...).
frequency (FrequencyLevel | None) –

Optional frequency tier.
translations (dict[str, str]) –

Mapping from target language code to translated text.
topics (list[str]) –

Free-form topic tags.
glosses (list[Gloss]) –

Wiktionary-style sense list.
examples (list[WordExample]) –

Curated example sentences.
false_friends (list[FalseFriend]) –

List of false-friend metadata entries.
sources (list[str]) –

Provenance tags ("wiktionary", "csv", "llm", ...).

accented_chars `property` ¶

accented_chars: list[str]

Accented characters present in text, in original order.

has_accent `property` ¶

has_accent: bool

True if text contains any accented character.

id `property` ¶

id: str

Deterministic ID derived from (text, language).

length `property` ¶

length: int

Length of text in characters (used by the Wordle exercise).

WordExample ¶

Bases: BaseModel

Curated example sentence in the word's language.

Attributes:

sentence (str) –

Example sentence in the word's language.
translation (str | None) –

Translation in the user's language (optional).

lang_tools.words ¶

FalseFriend ¶

Gloss ¶

GlossExample ¶

Word ¶

accented_chars property ¶

has_accent property ¶

id property ¶

length property ¶

WordExample ¶

accented_chars `property` ¶

has_accent `property` ¶

id `property` ¶

length `property` ¶