A General Guide to Machine Learning (Part Three Final)

Oct 05, 2023

Practice

Sometimes when you're trying to teach a program to tell things apart, it doesn't have a lot of examples of a specific thing. When you use something like SVM, you can say how bad it is when the computer gets it wrong. Since there are often some mistakes in the training data.

SVM can learn to tell things apart if you tell them some things are more important than others. For example, if you're teaching a program to identify different types of animals, you can say that spotting a rare animal is more important than spotting a common one. They do this by giving each type of animal a 'weight’.

But if the program can't do this, you can use oversampling. You just make more copies of the things you want so the program gets better at recognizing them. So, if you want it to be better at spotting rare animals, you show it lots of pictures of those rare animals.

Sometimes, you want to make sure it's not too good at recognizing one thing because there are so many of them, so you can try undersampling. That means you take away some examples of that thing from the training data.

Combining Models

The three typical ways to combine models are 1) averaging, 2) majority vote and 3) stacking.

When we want a computer program to make predictions, we don't just use one model; we use a bunch of them. Then, we take all the predictions from these models and mix them together.

To check if this is better than just one model, we try it on some new data and see how well it does. It's like testing a new recipe to see if it tastes better than the old one.

Imagine you have a few friends, and you all want to pick a movie to watch. Each friend suggests a different movie. To decide which movie to watch, you can just go with the one that most of your friends like which is the majority.

In programs, it's similar. We have different models that suggest different answers. We use the 'majority vote' method to choose the answer that most models agree on.

But sometimes, instead of just picking one answer, we create a 'meta-model.' This meta-model looks at what all the other models say and uses that information to give us a final answer (this is stacking).

The good thing about using multiple models together is that it can make them smarter.

This means using models that are good at different stuff, like one that's good at spotting shapes and another that's good at guessing numbers. When you combine these different skills, it usually makes the program work better.

Regularization

Imagine you're practicing for a big race with your friends. But each time you practice, you randomly tell some friends not to run that day. It's like a surprise test for them. This randomness helps you get better at running because you can't rely on the same friends all the time.

In programs, we have this thing called 'dropout.' It's like turning off some parts of the network during training. The more parts we turn off, the more it helps the program learn better. We can control how much we 'drop out' parts, and we figure out the best amount through practice.

Let’s say you're baking cookies, and you want them to be perfect every time. One thing you can do is make sure all your ingredients are just the right size. That's like what 'batch normalization' does for computer programs. It makes sure that all the numbers in our program are just the right size before we use them.

This helps our program learn faster and more steadily. It also has the effect of making our program more stable.

Another thing we use is 'data augmentation.' It's like taking a photo and making it a bit different in lots of ways, like zooming in. This makes our program better at understanding all kinds of pictures.

Transfer Learning

Imagine you're really good at solving one type of puzzle. Now, you want to get good at solving another kind of puzzle.

With transfer learning, you can use what you've learned from the old puzzle to help you with the new puzzle. We have these models that are really good at learning from one set of data.

With transfer learning, we don't have to start from scratch. We can use what the model already knows and adapt it to the new puzzle.

Transfer learning offers an efficient alternative by leveraging an existing labeled dataset, thereby bypassing the costly and time-consuming process of manual annotation.

DensityEstimation

Density estimation involves the task of creating a model for the probability density function of an unknown probability distribution. This concept finds application in different scenarios.

It's important to note that models can also be nonparametric, as demonstrated in kernel regression. This same nonparametric approach can be effectively applied to density estimation, offering an alternative way to capture the underlying distribution without relying on specific parametric assumptions.

Clustering

Clustering is where the objective is to assign labels to examples using an unlabeled dataset. Given the absence of labels, determining the optimality of the learned model becomes more complex compared to supervised learning.

There are a lot of clustering algorithms, and selecting the most suitable one for your dataset can be challenging.

K-Means

The k-means clustering algorithm operates in a specific manner. Initially, you specify the number of clusters, denoted as "k." Then, you randomly place k feature vectors, referred to as centroids, within the feature space.

Next, the algorithm computes the distance between each example and each centroid, typically employing a distance metric like the Euclidean distance. Each example is then assigned to the closest centroid, effectively assigning a centroid ID as a label to each example.

Then the algorithm calculates the average feature vector of the examples assigned to it. These newly average feature vectors become the updated positions of the centroids. This process repeats until convergence, with the centroids shifting to optimize the clustering and reduce the distances between examples and their assigned centroids.

The k-means clustering process continues by recalculating the distances from each example to every centroid. The final model is essentially a record of which centroids are assigned to each example.

It's worth noting that the initial placement of centroids can significantly impact their final positions, which means that running k-means twice on the same dataset may yield two distinct models.

The choice of the hyperparameter k is a decision that falls on the data analyst. Various techniques exist for determining the optimal value of k, but it's important to note that none of these techniques are proven to be optimal.

Most of these methods entail a certain degree of judgment. This often involves assessing specific metrics or visually inspecting cluster assignments to make an "educated guess" about the appropriate value of k. The selection of k requires a degree of empirical experimentation and domain knowledge to arrive at a suitable clustering configuration for the given dataset.

Outlier Detection

Outlier detection involves identifying instances in a dataset that significantly deviate from the typical examples found within that dataset.

To predict if it is an outlier, we use the trained autoencoder model to reconstruct the example from its bottleneck layer. Outliers are expected to be challenging for the model to reconstruct accurately, serving as a signal for their detection.

With one-class classification, the model essentially classifies input examples into two categories: those belonging to the primary class and those categorized as outliers. These methods offer effective strategies for outlier detection, helping to identify data points that have significant deviations from the norm.

[End]

The Software & Data Spectrum