lang_tools.words.ingestion.dedup
¶
Word merging and deduplication.
Two words with the same (text, language) (and therefore the same Word.id)
should collapse to one record whose metadata is the union of both. Richer
metadata wins where the two records disagree on a scalar field.
Functions:
-
deduplicate–Collapse an iterable of
Words by ID, merging duplicates. -
merge_words–Return a new
Wordthat combines fields fromleftandright.
deduplicate
¶
Collapse an iterable of Words by ID, merging duplicates.
Parameters:
Returns:
Source code in src/lang_tools/words/ingestion/dedup.py
merge_words
¶
Return a new Word that combines fields from left and right.
Scalar fields (part_of_speech, frequency) prefer non-null over
null; if both differ, left wins. Collection fields (translations,
topics, glosses, examples, false_friends, sources) are
merged. The resulting text keeps the left spelling so that accent
information is preserved when one side has it and the other does not.
Parameters:
Returns:
-
Word–A new merged
Wordinstance.leftandrightare not mutated.
Raises:
-
ValueError–If the two records have different IDs.