202602251131
Status: #reference
Tags:
State: #nascient

TF/IDF: Term Frequency/Inverse Document Frequency

The TF/IDF is a simple, but clever metric made to score the match between a document and a term. It is like many traditional NLP methods it is not able to do much regarding the contextual meaning of a word, hence it uses the next best thing frequency. While some methods like the bag of word go as far as counting co-occurences of words, TF/IDF does not go that far.

It aggregates two measures, the first is the frequency of the term within the document, which means if a term occurs once in a document, it has a TF of 1... Actually, not quite.
Indeed, the exact value of TF is entirely dependent on how you choose to compute it:

IDF or inverse document frequency is the log of the inverse ratio of number of documents in which the term appears over the total number of documents.

The first measure, is a simple measure of how "important" that term is in the document under consideration, the more common it is, the stronger the match between the document and that term. The second term is a normalization constant of sort which tells us how common the term is in general. Indeed, if a word appears 100 times in your document, but it appears in basically every document, then the fact it occurs a lot in one document does not tell us much.

On the other hand, if a term that basically never occurs, occurs 10 times in this document while the absolute number is less than 100, this document is a much starker match.

Why the log one might ask, we use the log to smooth the punishment, if a word occurs in 1000 documents instead of say 10,000 documents, past a certain threshold they are both "common" terms. Without the smoothing applied by the logarithmic function, we'd lose the ability to match on common words. The fact we take the inverse ratio is simply due to the fact logarithm is negative for values under 1, and we want our scores to be positive.

TF/IDF thus is simply the multiplication between those two measures. The higher it is, the stronger the match for this specific term.

Since we can compute a TF/IDF for any terms in the query, we can create vectors of size V where V is the size of the vocabulary *(how many keys the index has).

Note that the TF/IDF is just one measure, and it has it's own issues, for example the sensitivity of the TF/IDF to the TF, is entirely given by the IDF (and vice-versa). Big number go brrr.
Furthermore, while remediation measures exist to account for document length (like the augmented frequency for the TF), the TF can difficultly account for the document length. A word which occurs 5 times in a 20 word document is likely more important than a word which occurs 20 times in a 100,000 word document. Term frequency as in can't account for that, while this is partially accounted for by the normalization in the cosine similarity, the L2 normalization is much strong and drastically punishes documents that are too long. For all those reasons more advanced (and today standard) measures like the BM25 were created, which not only fix the sensitivity issue, but also come pre-baked with tunable normalization factors and saturation (meant to determine when to stop increasing score after a given threshold is reached) factors.

File Folder Last Modified
Vector-Space Models 1. Cosmos 11:31 AM - February 25, 2026