202602181535
Status: #reference
Tags: Information Retrieval, Linear Algebra
State: #awakened

Vector-Space Models

If Boolean Information Retrieval Models are the Kinro's of the world (rules are the rules), inflexible, but quite useful in their inflexibility sometimes, vector space models are the nuanced brother.

The core issue that Vector-Space Models (VSMs) aim to solve is the disparity between the Information Need of a user, and the retrieved document. More specifically, the failure of boolean models to consider partial matches and rank them.

It takes the observation that in a given document, a document which contains 100 occurrences of the word "red", is likely a significantly better match (everything else being equal) than another containing only 5, but is only marginally better than one containing 99.

Which means while in a Boolean model each document is more or less a bag of bits flipped to 1 if the document contains the ithterm in the inverted index and 0 otherwise, with vector space models we start considering how close to a perfect match (1) a document is and each document is now a vector where the ithterm is the degree to which this document matches the ithterm in the vocabulary, while the query can still be boolean in nature, the transition to VSMs generally imply that queries are now treated as documents in their own rights.

How do they Work?

The core difference between the primitive Boolean Information Retrieval Models and the more modern (but dated at this point) Vector-Space model is that VSMs introduce ranking and scoring as first-class citizens.

The general flow is this:

  1. Query hits, can be boolean, but can be any string of tokens
  2. The query of words is converted into a vector of dimension V where V is the size of the vocabulary.
  3. For the documents which contain at least one of the terms, we compare the query vector to the relevant document vector.
  4. Score the document performance based on alignment between query and document measures as Cosine Similarity.
  5. Rank the documents in decreasing order based on alignment

The first big obvious advantage of this method is that we never require a perfect match between the document and the query, which means we expect a higher recall (relative to Information Need) since partial matches will be retrieved. Consequently, as is often the case, we also expect lower precision since now documents which don't contain even most of the words in the query might still be retrieved.

Now. The flow has been explained, but how do you actually do any of that? While we know we can easily compute the alignment between vectors by doing a dot product, and similarity is naught but a normalized version of that dot product, what about the rest?

How do compute the match of a document and term? Cue in TF/IDF, the (potential) value that lives in each and any of these V cells, and our measure of document match with a given word.

TF-IDF ~ Term Frequency-Inverse Document Frequency

File Folder Last Modified
Boolean Information Retrieval Models 1. Cosmos 3:35 PM - February 18, 2026