<P> where N is the total number of documents in the collection, and n (q i) (\ displaystyle n (q_ (i))) is the number of documents containing q i (\ displaystyle q_ (i)). </P> <P> There are several interpretations for IDF and slight variations on its formula . In the original BM25 derivation, the IDF component is derived from the Binary Independence Model . </P> <P> Please note that the above formula for IDF shows potentially major drawbacks when using it for terms appearing in more than half of the corpus documents . These terms' IDF is negative, so for any two almost - identical documents, one which contains the term and one which does not contain it, the latter will possibly get a larger score . This means that terms appearing in more than half of the corpus will provide negative contributions to the final document score . This is often an undesirable behavior, so many real - world applications would deal with this IDF formula in a different way: </P> <Ul> <Li> Each summand can be given a floor of 0, to trim out common terms; </Li> <Li> The IDF function can be given a floor of a constant ε (\ displaystyle \ epsilon), to avoid common terms being ignored at all; </Li> <Li> The IDF function can be replaced with a similarly shaped one which is non-negative, or strictly positive to avoid terms being ignored at all . </Li> </Ul>

Differences between tf-idf based ranking and bm25 ranking