Course Review: Data Representations
Given a dataset, I will heat up the whole GPU. (Anonymous vibe-mathematician)
I took Data Representations by Prof. Haim Schweitzer last semester. Based on its description on the UTD website, this course is about “useful representations of data for efficient manipulation and visualization”, with special emphasis on “the performance of these techniques on big data”.
After taking it, I think it is a pretty useful course. In this review, I will first summarize my learnings along the two main topics of the course, namely feature engineering and clustering algorithms. Then I will make some remarks about the course taking experience.

A majestic sunset at a parking lot on campus
Feature engineering (80%)
Depending on the context, feature engineering can also be seen as finding a good projection of the data to a lower-dimensional space. The former, concerning “features”, is purposed towards building predictive models for future data. Meanwhile, the latter focuses more on helping humans understand the current data, mostly because we can comprehend at most 3 dimensions (which is very low).
The feature engineering problem deals with a dataset, typically represented as a matrix $A$. Each row of A is an example (or instance/record), while each column is a feature. In supervised settings, the last column(s) can be reserved for the labels (i.e., target features for prediction).
Because we are dealing with matrices, this subject referes a lot to tools from linear algebra. First, the techniques heavily utilized eigenvectors ($v$) and eigenvalues ($\lambda$):
\[Av = \lambda v\]along with various properties of eigenvectors, such as one of them maximizes/minimizes $vAv^T$ (given a matrix $A$). Then, it is important to know the popular matrix decomposition algorithms. There is Singular Value Decomposition, which represents $A=U\Sigma V^T$. There is Eigvenvalue Decomposition, which decomposes $A = V\Sigma V^{-1}$. And there is QR Decomposition, which represents $A=QR$.
Then, all feature engineering use cases can be classified among two dimentions:
- Is the data labeled? If yes, the problem is supervised. Otherwise, it is unsupervised.
- Do we want novel features or a subset of existing ones? If novel, the problem is called feature extraction. Otherwise, it is feature selection.
This results in four settings for our problems:
Unsupervised feature extraction
In this setting, we are only given the feature values (across all examples), and need to compose a few meaningful features out of them. This is very useful to visualize a new unlabeled dataset for exploration purposes.
PCA is the most popular methods here. It has 4 steps:
- Centering: $A \leftarrow A - \mu$
- Covariance matrix: $C = \frac{1}{n}A^TA$
- Eigenvalue decomposition: $C=VDV^{-1}$
- Projection: $A’ = AV_k$, where $V_k$ is the matrix of the top $k$ eigenvectors (i.e., the ones with the largest eigenvalues).
In one sentence, PCA finds the top $k$ eigenvectors of the covariance matrix, and uses them as the projection basis.
MDS is another method for unsupervised feature extraction. It is used when we are not given the features, but only the distances between examples. It finds a projection of the data to a low-dimensional space such that the distances between examples in the projected space are as close as possible to the original distances. It has three steps:
- Compute the distance matrix $D$ of the data.
- Double centering: $B = -\frac{1}{2}HDH$, where $H=I-\frac{1}{n}11^T$ is the centering matrix.
- Eigenvalue decomposition: $B=VDV^{-1}$.
- Projection: $A’ = VD^{1/2}$, where $D^{1/2}$ is the diagonal matrix of the square roots of the eigenvalues.
Unsupervised feature selection
Moving on to feature selection, we no longer need to compose new features. Instead, we just want to select a subset of the existing features. This is good for interpretability. For example, the features extracted by PCA are linear combinations of the original features, which can be hard to understand. In contrast, feature selection is more interpretable, because we are just selecting a subset of the original features.
A prerequisite technique to know is the reservoir method to do online sampling. In its simple setting, given a stream of data, we want to sample $k$ examples from it, such that each example has the same probability of being selected. A naive algorithm for random sampling is to store all the examples in memory, and then randomly select $k$ examples from it. However, this is not memory-efficient for big data, because we need to store all the examples in memory. The reservoir method starts with an empty max heap. for each new example from the stream, it assigns a random weight between 0 and 1 for it, and adds it to the heap. If the heap size exceeds $k$, it removes the example with the largest weight from the heap. This way, we only need to store $k$ examples in memory at any time (thus “resevoir”), and each example has the same probability of being selected.
What does this technique have to do with feature selection? It can be used to solve the “column subset selection problem”, which is to select a subset of $k$ features from a dataset with $n$ features, such that the selected features are as informative as possible. To do so, first note that the “informativeness” of a feature can be measured by the norm of its corresponding column in the data matrix. Then, we can use the reservoir method to sample $k$ features from the dataset, where the weight of each feature is proportional to the norm of its corresponding column in the data matrix. This way, we are more likely to select informative features.
Note that, if the input data is a matrix storable in memory, we can simply compute the norm of each column, and select the top $k$ features with the largest norms. However, if the data is too large to fit in memory, we must use techniques like the reservoir method to do online sampling, which is memory-efficient for big data.
Supervised feature extraction
Moving on to the supervised setting, we are given the feature values and the categorical labels, and need to compose a few meaningful features out of them. This is useful for building predictive models for future data. It is also useful for answering questions like “what is a perspective of the data that best separates the examples of different classes?”
This is a very familiar classification problem in machine learning, and there are a lot of techniques to solve it: decision trees, naive bayes, logistic regression, support vector machines, neural networks, etc. In this course, we limit outselves to linear techniques, which are more interpretable and easier to analyze. And the technique we learned is called “Scatter”.
Different from PCA which finds the projection that maximizes the variance of the projected data, Scatter finds the projection that separates the examples of different classes as much as possible. To do so, it defines two matrices: the “between-class scatter matrix” $S_B$ and the “within-class scatter matrix” $S_W$. The between-class scatter matrix measures, well, the scatter of the class means around the overall mean, while the within-class scatter matrix measures the scatter of the examples around their respective class means. Then, Scatter finds the projection that maximizes the ratio of the between-class scatter to the within-class scatter, which is equivalent to finding the eigenvectors of $S_W^{-1}S_B$.
Supervised feature selection
What if we are not allowed to compose new features, but only select a subset of the existing features? This is the problem of supervised feature selection. We discussed two methods to solve this problem: forward addition/backward elimination, and feature ranking with correlations.
Forward addition and backward elimination are two greedy algorithms for feature selection. These methods require a way to evaluate the performance of a subset of features, such as training a predictive model on the those very features and evaluating its performance on a validation set. Forward addition starts with an empty set of features, and iteratively adds the feature that improves the model performance the most until no more improvement can be made. Backward elimination starts with the full set of features, and iteratively removes the feature that degrades the model performance the least until no more degradation can be made.
Feature ranking with correlations is a simpler method that does not require training a predictive model. It ranks the features based on their correlation with the labels (e.g., using Pearson correlation), and selects the top $k$ features with the highest correlation. This method is much faster than forward addition/backward elimination, but it may not capture the interactions between features, which can be important for predictive performance.
Clustering (20%)
The last portion of the course is about clustering algorithms, which are used to group similar examples together. Similar to unsupervise feature engineering, it is useful for finding patterns in the data. While feature engineering asks “what is a good projection of the data?”, clustering asks “what are the good groups of examples in the data?”.
We discussed two versions of the clustering problem: clustering tabular data and clustering nodes on a weighted graph.
For the former, we discussed the k-means algorithm and its variants: basic Kmeans, Kmeans++, and Kmeans parallel. All of these algorithms require a specification of the number of clusters to find (which is a bit weird because we typically don’t know much about the data). Basic K-means starts with $k$ random centroids, and assigns each example to the nearest centroid, and move the centroids based on new cluster information, and repeat until convergence (provably guaranteed). K-means++ improves the initialization step by selecting the initial centroids as $k$ most distant examples from each other, which leads to better convergence and better clustering results. K-means parallel is a parallelized version of K-means, which is for big data.
For the latter, we discussed spectral clustering, which is a technique for clustering nodes on a weighted graph. It works by first computing the Laplacian matrix of the graph, which is defined as $L = D - W$, where $D$ is the degree matrix (a diagonal matrix where each diagonal element is the degree of the corresponding node) and $W$ is the adjacency matrix (a matrix where each element represents the weight of the edge between two nodes). Then, it computes the eigenvectors of the Laplacian matrix, and uses them as features for clustering the nodes using a standard clustering algorithm like K-means.
Fun fact: the spectral clustering algorithm is named “spectral” because it uses the eigenvalues and eigenvectors of the Laplacian matrix, which are also known as the “spectrum” of the graph. This naming takes inspiration from the physical sciences such as astronomy and chemistry, where only a few pieces of information of a physical system (e.g., the light emitted by a star or the energy levels of an atom) is analyzed to understand its properties. In spectral clustering, we only analyze a few eigenvalues and eigenvectors of the Laplacian matrix to understand the structure of the graph and cluster its nodes.
Remarks
I was not super enthusiastic to take the course originally. The topics sounded too familiar with linear algebra and machine learning, which I have taken in the past. However, because I wanted to finish my Masters coursework by my fifth semester at UTD, I had to take something. And this is the most relevant to my research.
The lectures, on the surface, were very boring. It took TREMENDOUS willpower to follow the professor’s speech, which was low and hard to hear. It was in good English, but soft and too fast, sometimes conveying a sense of inconfidence to me. However, I did see the professor’s efforts to convey the idea to his best.
Organization-wise, the course is a little messy, with handouts all over the places, deadlines silently set on the course website, and topics poorly motivated. However, doing some review helps giving the big picture (i.e., the taxonomy). I finally realized the professor has said all these things in class, but somehow I did not get the picture that he wanted to give.
Fortunately, it turns out I learned a lot. Before the course, I didn’t have a good understanding of PCA – I just knew it is a way to project data to 2D. Now I know it is the optimal linear projection given a certain loss function. Before, I didn’t appreciate the class of linear methods in machine learning, because there are a lot of non-linear relationships in real data, and if I want performant models, I’d better spend time on non-linear models (such as multi layer perceptrons). However, linearity gives interpretability, which non-linear models rarely have. By focusing on linear models, it feels like doing a digital detox and going back to the analog world. And finally, before the course, I wouldn’t know the taxonomy of different (linear) feature engineering methods. Now I have a pretty neat one: supervised/unsupervised, and extraction/selection. Among data science practitioners, “data representations” is somewhat of a cliche topic and an overloaded terms with multiple interpretations. This course equiped me with some rigor when discussing about the topic. I am thankful for that.
About Professor Schweitzer, during the lectures, he appeared strict, serious, but stressed out. He could easily get blushed in a lecture when taking time to figure out the technical details. But whenever he has online lectures, he looked very calm and in charge. Given his old age, he is probably a grandpa of someone. From that perspective, he is super sharp – isn’t it cool to have a grandpa who can mathematically prove why MDS is an optimal linear embedding method, and tell you why spectral clustering was named as such? Perhaps, I was too irritated with his teaching method, and the irritatedness from all the students have caused some stress for him.
I’ll end with a memorable interaction with Professor Schweitzer. On the final exam day, I finished my work quickly and submitted it within an hour, which was 1-2 hours before time was up. When I got out of the exam room, I saw the professor and said hi to him. He approached me and said “I was trying to find you in class but couldn’t… I wanted to give you a compliment. You did very well in this class in terms of participation [i.e., giving ideas?]. You also did well in my AI class [the semester before] as well. I don’t know how you did in this exam though. But it was a pleasure to have you as a student in my class.” And he said goodbye to me with an extremely warm smile that I never saw on him.