lang_tools.words.word_id
¶
Deterministic ID for a (text, language) pair.
Used as Word.id so the same word ingested from multiple sources collapses to
the same identifier. SHA-1 truncated to 16 hex chars (~64 bits) is plenty for
collision avoidance within a per-language dictionary.
Functions:
-
word_id–Return a deterministic 16-char ID for
(text, language).
word_id
¶
Return a deterministic 16-char ID for (text, language).
The text is normalized (accent-stripped, lowercased) before hashing so that
different casings of the same word collapse to the same ID. Languages
differ in how they treat ligatures, but this helper is intentionally
language-agnostic to avoid a circular import on the Language model.
Parameters:
Returns:
-
str–16-character lowercase hex string.
Example
::
word_id("Ação", "pt") # -> "..." (16 hex chars)