Semantic Deduplication Why Dedup Before Indexing Web scrapes, email archives, and aggregated document sets routinely contain 5-30% near-duplicates. Symptoms when left unaddressed: - Retrieval ranks N copies of the same document in the top-K, pushing out diverse results. - Embedding fine-tuning over-weights redundant content. - Storage and query cost scale with duplication factor. - Evaluation metrics get inflated (one "correct" doc appears many times). Three Levels of Duplication | Level | Method | Cost | Recall on paraphrases | |---|---|---|---| | Exact | SHA256 of normalized text | O(n) | 0…