202602251131
Status: #reference
Tags:
State: #nascient
TF/IDF: Term Frequency/Inverse Document Frequency
The TF/IDF is a simple, but clever metric made to score the match between a document and a term. It is like many traditional NLP methods it is not able to do much regarding the contextual meaning of a word, hence it uses the next best thing frequency. While some methods like the bag of word go as far as counting co-occurences of words, TF/IDF does not go that far.
It aggregates two measures, the first is the frequency of the term within the document, which means if a term occurs once in a document, it has a TF of 1... Actually, not quite.
Indeed, the exact value of TF is entirely dependent on how you choose to compute it:
- Absolute Counts: In this case, the explanation given above is correct
- Term Frequency: Most common, ratio between count of terms and terms in documents.
- Boolean Frequency: 1 if term is in document, 0 otherwise.
- Logarithmic scaling: 1 + log(count in doc) (so that 100 occurrences is not twice as good as 50, and a tenth of 1000)
- Augmented frequency: Divide the term frequency, by the term frequency of the most common word (to not bias too much towards longer documents)
IDF or inverse document frequency is the log of the inverse ratio of number of documents in which the term appears over the total number of documents.
The first measure, is a simple measure of how "important" that term is in the document under consideration, the more common it is, the stronger the match between the document and that term. The second term is a normalization constant of sort which tells us how common the term is in general. Indeed, if a word appears 100 times in your document, but it appears in basically every document, then the fact it occurs a lot in one document does not tell us much.
On the other hand, if a term that basically never occurs, occurs 10 times in this document while the absolute number is less than 100, this document is a much starker match.
Why the log one might ask, we use the log to smooth the punishment, if a word occurs in 1000 documents instead of say 10,000 documents, past a certain threshold they are both "common" terms. Without the smoothing applied by the logarithmic function, we'd lose the ability to match on common words. The fact we take the inverse ratio is simply due to the fact logarithm is negative for values under 1, and we want our scores to be positive.
TF/IDF thus is simply the multiplication between those two measures. The higher it is, the stronger the match for this specific term.
Since we can compute a TF/IDF for any terms in the query, we can create vectors of size
Note that the TF/IDF is just one measure, and it has it's own issues, for example the sensitivity of the TF/IDF to the TF, is entirely given by the IDF (and vice-versa). Big number go brrr.
Furthermore, while remediation measures exist to account for document length (like the augmented frequency for the
Relevant Links
| File | Folder | Last Modified |
|---|---|---|
| Vector-Space Models | 1. Cosmos | 11:31 AM - February 25, 2026 |