BM25 Tuning The Formula - : term frequency in document d - : document length in tokens - : average document length across the corpus - : term-frequency saturation (how fast extra occurrences stop helping) - : length normalization (how much longer documents are penalized) - : inverse document frequency of t k1 and b Defaults | Collection type | k1 | b | Why | |---|---|---|---| | Lucene default | 1.2 | 0.75 | Safe general-purpose | | Short homogeneous docs (titles, tweets) | 1.0-1.2 | 0.3-0.5 | Length already similar; less penalization | | Long heterogeneous docs (web, manuals) | 1.2-1.5 | 0.75…