6 minute read cs

Before deep learning, what do people use to embed images to feed into machine learning models? They compute image keypoints, using a lot of image priors (aka. inductive bias). Keypoints are defined as “unique” patches within in the image, which stay the same when the camera angle changes. Even before embedding, keypoints are used to map pixel points between images from different camera angles (which are important in 3D reconstruction, camera post estimation, etc.). Now for embedding, these keypoints, which are numerical, can be concatenated(?) into a latent vector of the image.

There are two popular types of keypoints: corners and blobs. This article covers classifical algorithms to detect these two types, namely Harris Corner Detection and Laplacian of Gaussian, respectively. Finally, it will introduce SIFT, which make uses of these keypoints for matching and image representation.

Harris Corner Detector

The goal is to detect the pixel locations of all corners in the image. The key idea is to define corners as regions with large variation in intensity in all directions (while a flat region has no varitation, and an edge has no variations along the edge).

In math, let consider a grayscale image \(I\), and \(I(x, y)\) returns the intensity of pixel (x, y). We define a patch \(p\) via its window function \(w_p\):

\[w_p(x, y) = \begin{cases} 1 & \text{if } (x, y) \text{ is in the patch} \\ 0 & \text{otherwise} \end{cases}\]

Then we can quantify the “change in intensity” of \(p\) along a direction \((\Delta x, \Delta y)\) using the following function:

\[f_p(\Delta x, \Delta y) = \sum_{x_k, y_k} w_p(x_k, y_k) (I(x_k, y_k) - I(x_k + \Delta x, y_k + \Delta y))^2\]

Note that \(f_p\), despite looking global via the summation, is specific to patch \(p\) due to the window function \(w_p\) that supress out-of-window pixels. The point \((x_k, y_k)\) can be seen as one of the points within the window, where its directional change is being summated. Now patch \(p\) contains a corner iff \(f_p(\Delta x, \Delta y)\) is “large” $\forall \Delta x, \Delta y$. Call this condition “(*)”.

Instead of trying out many \((\Delta x, \Delta y)\), we have an analytical solution. To check (*), we only need to check if the minimum value of \(f(\Delta x, \Delta y)\) is large. How to find the minimum? First, using Taylor approximation, we can estimate \(f_p(\Delta x, \Delta y)\):

\[f_p(\Delta x, \Delta y) \approx \begin{bmatrix} \Delta x & \Delta y \end{bmatrix} M_p \begin{bmatrix} \Delta x \\ \Delta y \end{bmatrix}\]

\(M_p \in \mathbb{R}^{2 \times 2}\) is independent of \(\Delta x, \Delta y\), and only depends on \(w_p\) and the image gradients \(I_x\) and \(I_y\). Therefore, the minimizer of \(f\) is the least dominant eigenvector of \(M\) (see how Data Representations knowledge comes in handy!). This can be solved in an instant. We are done. Note that Harris can be still seen as an image filter that is a lot more complicated than a linear filter.

Non-maximum supression (NMS). One actual corner can generate many nearby patches satifying (*). We deduplicate them by NMS. This algorithm is typically seen in deduplicating bounding boxes. For corner patches, we can just remove the patch where its \(f\) value is smaller than any of its neighbors.

Laplacian of Gaussian for Blob Detection

Harris is good, but it is sensitive to scaling: if you zoom into a corner, you will get an edge. By keeping a fixed-size patch, Harris corner detector is scale-variant.

Note that to find keypoints, we don’t have to stick to corners. We can find blobs instead. A blob is a regional where pixels share similar intensity and different significantly from the surrounding area. It turns out blobs can be found in a scale-invariant manner, even with noisy data.

First, let’s look at finding edges in 1D noisy data first. From an image \(f\), we first apply Gaussian filter to get \(h \star f\) to remove noises. Then we apply the derivative filter to highlight the edges: \(\frac{\delta}{\delta x}(h \star f)\). Note that the filter form of derivative is \(\begin{bmatrix} -1 & 0 & 1 \end{bmatrix}\). Conveniently, convolution is associative, so we can simplify these two operations into one single kernel called “Derivative of Gaussian”: \((\frac{\delta}{\delta x}h)\)

Now, blobs are just homogenous regions surrounded by an edge. To find those blobs, we use the 2D Laplacian filter: \(\nabla \textbf{I}=\frac{\delta^2 \textbf{I}}{\delta x^2}\frac{\delta^2 \textbf{I}}{\delta y^2}\). In filter form, it is simply:

\[\begin{bmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{bmatrix}\]

Now to find blobs without knowing the optimal scales, you will search over all downscaling factors of the image, and apply “Laplacian of Gaussian” (LoG) filter on the downscaled images. The key insight is that, with an optimal scale, the response from the LoG filter is the highest. Therefore, LoG will tell us both the locations of the blobs and the scales where the blobs are detected. A good scale-invariant blob is one with high response across many scales. Those are also features of the image.

SIFT (Scale Invariant Feature Transform)

Now, SIFT makes keypoints useful. A big application of keypoints is to enable matching of pixel points across different views. Therefore, after detecting the keypoints locations, we need to represent each keypoint as a vector, so that it can be compared in the embedding space with other keypoints from possibly other viewpoints.

SIFT does so by the following 2 steps:

  1. Detect all the cross-scale keypoints. For efficiency, it uses “Difference of Gaussians” (DoG), which is an approximation of LoG. And instead of changing the filter for different scales, it downsamples the image to different scale1. Keypoints are maximax and minima of DoG images.
  2. Compute descriptors.
    1. Each descriptor is computed using the 16x16 patch around the keypoint. For each pixel in the patch, it first computes the image gradients, which is a 2D vector.
    2. Then, the 16x16 patch is grouped into 4x4 cells. Each cell contains 16 gradient vectors. For each cell, a histogram of 8 orientations is computed.
    3. To be rotation-invariant, we rotate all orientations by the dominant orientation.

Now, 16 cells, each with 8 orientation magnitudes, result in a 128-dimensional vector. This is the SIFT descriptor of the keypoint. These descriptors can be used to match keypoints across views!

Personally, I first came across SIFT in I Can Has Cheezburger? A Nonparanormal Approach to Combining Textual and Visual Information for Predicting and Generating Popular Meme Descriptions (Weng and Wang, NAACL’15). This is one of the first work on computational meme processing that I could find. They studied the problem of automatically generating the fill text (which they call “meme descriptions”) for meme templates. Via Pyramid of Histograms of Visual Words (PHOW), SIFT was used to embed images for text generation.

Further reads

  1. This is more efficient than trying multiple scales on the original image! Think of \(\sum_{k}1/{2^k}\) vs. \(\sum_{k}1\). 

Updated:

Comments