202602181535
Status: #reference
Tags: Information Retrieval, Linear Algebra
State: #awakened
Vector-Space Models
If Boolean Information Retrieval Models are the Kinro's of the world (rules are the rules), inflexible, but quite useful in their inflexibility sometimes, vector space models are the nuanced brother.
The core issue that Vector-Space Models (VSMs) aim to solve is the disparity between the Information Need of a user, and the retrieved document. More specifically, the failure of boolean models to consider partial matches and rank them.
It takes the observation that in a given document, a document which contains 100 occurrences of the word "red", is likely a significantly better match (everything else being equal) than another containing only 5, but is only marginally better than one containing 99.
Which means while in a Boolean model each document is more or less a bag of bits flipped to 1 if the document contains the
How do they Work?
The core difference between the primitive Boolean Information Retrieval Models and the more modern (but dated at this point) Vector-Space model is that VSMs introduce ranking and scoring as first-class citizens.
The general flow is this:
- Query hits, can be boolean, but can be any string of tokens
- The query of words is converted into a vector of dimension
where is the size of the vocabulary. - For the documents which contain at least one of the terms, we compare the query vector to the relevant document vector.
- Score the document performance based on alignment between query and document measures as Cosine Similarity.
- Rank the documents in decreasing order based on alignment
The first big obvious advantage of this method is that we never require a perfect match between the document and the query, which means we expect a higher recall (relative to Information Need) since partial matches will be retrieved. Consequently, as is often the case, we also expect lower precision since now documents which don't contain even most of the words in the query might still be retrieved.
Now. The flow has been explained, but how do you actually do any of that? While we know we can easily compute the alignment between vectors by doing a dot product, and similarity is naught but a normalized version of that dot product, what about the rest?
How do compute the match of a document and term? Cue in TF/IDF, the (potential) value that lives in each and any of these
TF-IDF ~ Term Frequency-Inverse Document Frequency
Relevant Links
| File | Folder | Last Modified |
|---|---|---|
| Boolean Information Retrieval Models | 1. Cosmos | 3:35 PM - February 18, 2026 |