202412121304
Status: #idea
Tags: Machine Learning, Clustering Methods, Unsupervised Learning
State: #awakened

Spectral Clustering

A graph based method which turns the clustering problem into one of disconnecting graphs with the minimal cost.

  1. Form a similarity graph containing all the observations using some similarity measure like the Gaussian Kernel or knearest neighbors to compute edges.
  2. Then from the graph you can create a new graph where observations that are too dissimilar do not have an edge, and those that are similar do (note that we don't care about observations similarity to themselves). We prune the graph and only retain significant connections.
  3. We compute the Laplacian matrix L of the graph which has connection Laplace's Equations and which is given either by AtA where A is the incidence matrix of our graph, or alternatively by DB where D is the degree matrix with the degree of each node along the diagonal, and B is an adjacency matrix. Note that this is the unnormalized version there exist two others:
    • Normalized (symmetric): Lsym=D1/2LD1/2.
    • Normalized (random walk): Lrw=ID1A.
    • Sadly which we pick will affect the results.
  4. Thanks to the properties of the matrix, we know for sure that 0 will be an eigenvalue of the matrix, the matrix is positive semi-definite.
  5. To split in two groups we start from the smallest eigenvalues and consider the smallest eigenvalue (excluding) the 0 and use it's eigenvector (the Fiedler Vector) to partition in two sets.

If we want to do more than two clusters, we start the same way but then:

  1. we select the K first eigenvectors starting from the smallest (still excluding the zero eigenvector.)
  2. From these eigenvector we form a matrix n×k where each column is an eigenvector, this is called the Spectral Embedding
  3. We run a standard clustering algorithm on this matrix and cluster in the spectral space typically using kmeans.
  4. We assign the observations to the cluster obtained from the previous step.

Once we went through the previous processing step, the hope is that the clusters become linearly separable in the spectral embedding space (which generally is true.) As a result, kmeans is a great tool to use in conjunction with spectral clustering.

When to use?

I haven't personally used it, but it seems that it does especially well on non-convex data and data that is not linearly separable (I mean if your data is linearly separable, just do k-means.)