202602181409
Status: #reference
Tags: Information Retrieval
State: #awakened

Boolean Information Retrieval Models

What are they?

Boolean models are the most summary kind of models one could imagine. The idea is to reply to a boolean question (yes/no).

That question is does the query match a document, more specifically does the document contain the query.

As you can do in Boolean Algebra, the concepts of AND, OR, etc. exist as well. Which means that in practice the scope of things that can be covered can be deceptively vast, as I can look for documents that contain "Red" AND "dog", or documents that contain "Blue" but NOT "berry", etc.

If they are so powerful, why do I call them basic?

Their main issue is their Manichean view of the world. If a document is a partial fit, it will not be retrieved. While in some contexts this might be desirable, often in information retrieval we do not care about "exact" matches so much as we do semantic or even partial matches. If there's no document that contains both red and dog, but I have documents that contain either red or dog, I might still want to see them.

This mismatch between the methodology and something we call the Information Need is the confusing reason why a system may do exactly what it's been coded to do, and still be a bad system respective to your goal.

Boolean models in light of that, tend to be fairly bad models (relative to information need!)

However, this doesn't make them useless, in fact their boolean nature and the ability to compose queries make them a foundation of many more advanced modern retrieval systems.

They are often used as a foundation to more nuanced Vector-Space Models which can meaningfully score the match between a query and a document, as opposed to Boolean models which can only return a set of matches.

How do they Work?

Like most Information Retrieval systems, they are dependent on indices more specifically, inverted indices to function. Otherwise, the complexity in searching the collection would be extremely prohibitive.

The inverted index is an index which uses the tokens as keys and use the documents as the values (as opposed to a Forward Index which does the opposite.)

In practice, it is often implemented as a HashMap which utilizes the tokens as keys to list of sorted document IDs (called Postings) containing them along with the associated counts.

While the counts can be foregone in a Boolean model since it really only cares about whether absolute term occurrence, it is often collected nonetheless since the expanse of traversing the collection is already being done anyways.

In practice.

Query hits of form: "term" BOOLEAN_CONNECTOR "term"
System retrieves the documents with non-zero count of term 1 (which can be easily done since the term points to the documents which contain it) and this posting list can be compared to another list.
If the BOOLEAN_CONNECTOR is a AND, the resulting set would be the intersection of the two sets. If it's a OR it'd be the union.

While I am using set terminology, this is much too inefficient in practice. Instead, we exploit the order of the postings (which are sorted) to find the resulting merged set.
The algorithms in question are explained in more details here:

The key point is that we use these sorted postings to gain $O (n + m)$ linear time complexity instead of the $O (m n)$ we'd have otherwise with naive set operations. Even still then, we can see here, a few gaping problems. While we already observed that boolean systems are deceptively vast, they are also deceptively complicated to use.

Indeed, for any collection of documents of non-trivial size, we will often struggle with a feast/famine dichotomy where a given query might be too vast and match many documents which are returned without ranking AND on the other side a query might be too precise and match 0 documents despite many partial matches existing.

Since the many documents returned are often irrelevant to our information need, we will observe a low Precision (1-false discovery proportion) if the query is too vast. On the other hand, if it is too specific, many documents matching parts of the query will be ignored, leading to staggeringly low Recall (Statistical Power).

For the negation which I had until now omitted, we can simply consider the set of documents compared to the full collection of documents. But this generally leads to the feast problem outlined before on steroid, which is the reason it and query of its ilk Pure Negative Queries are generally illegal.

Which means we only really use it to mean $A - B$ , not really as $\neg B$ .

This is to say nothing of the cardinality issue, in practice if we have 3 terms A,B,C and we have C is the shortest, if the terms are connected by a AND, it is more efficient to consider the shortest posting lists first to reduce the search space. This is not two dissimilar to query arrangement in Relational Database Management System where we may have many valid queries, but we favor those that minimize the size of the in-flight set we're operating on.

Relevant Links

File	Folder	Last Modified
Vector-Space Models	1. Cosmos	11:31 AM - February 25, 2026