202602181709
Status: #reference
Tags: Information Retrieval
State: #nascient
Zipf's Law
Zipf's Law is the Information Retrieval applied, discrete form of the Pareto Distribution. It's a law derived from the Power Laws which tell us that roughly the frequency of a term is inversely proportional to its rank in the frequency table.
The most lowly ranked the term is, not only is it less frequent than it's partner ranked just before it, but it is so by a power.
This means if I have word ranked
This is a discrete case, of a phenomenon first observed in economics where in almost every country where Pareto studied it, the distribution of salaries was strongly concentrated towards 0, but no matter how high in value you went (within reason) while the density fell, it never reached 0.
More importantly, it stayed substantially higher than the equivalent Gaussian Distribution curve which is derived from additive laws and processes.
Why does this matter for Information Retrieval, because in practice the most important words tend to be the rarest for the exact reasons captured by the Inverse Document Frequency (IDF) and family.
Which means for your dataset to not be dominated with useless meaningless words like "the", "of", "at" which occurs all the time, you need to account for that and probably cut them off from your dataset. Each increase in the rank threshold while not drastically reducing the vocabulary which must be stored in memory, will significantly drop the number and heft of the posting lists which are kept.
Which leads to the second point, while the 200,000th term appears
If we take the word Compute:
- compute
- computing
- computer
- computes
Are likely all relevant to a query about computing stuff, but we take 4 spots in memory for all of them. Reducing them all to one common compute or even comput would allow us to increase recall by not needing to match on any specific version. This is called Stemming and is one of the core preprocessing steps in modern NLP, it is often compared to Lemmatization which is another more advanced way of reducing terms to common token units.
Relevant Links
| File | Folder | Last Modified |
|---|