A General Guide to Machine Learning (Part Two)

Sep 28, 2023

Gradient Descent

Gradient Descent is sensitive to the choice of learning rate alpha. It’s also slow for large data sets. There are, however, improvements to this algorithm. One of them is the minibatch stochastic gradient descent. This speeds up the computation by approximating the gradient using smaller subsets of the training data. Stochastic gradient descent has different versions. One is adagrad which scales alpha for each parameter according to the history of gradients. Another is momentum which accelerates stochastic gradient descent by orienting the gradient descent in the relevant direction and reducing oscillations. The most frequently used is RMSprop and Adam.

Most of the time you don’t need to implement machine learning algorithms by yourself. You wouldn’t implement gradient descent either. You would use libraries most of them are open source. A library is a collection of algorithms and supporting tools implemented with stability. The most frequently used in machine learning is scikit-learn.

Some algorithms can accept categorical features. For example, if you have a feature “color” that can take values “blue”, “green, or “purple”, you can keep this feature as is. SVM, linear regression, as well as kNN, expect numerical values for all features. All algorithms implemented in scikit-learn are numerical features.

Models such as Support Vector Machines (SVM) and k-Nearest Neighbors (kNN) provide the predicted class label for a given feature vector. Models like logistic regression and decision trees can produce a score ranging from 0 to 1. This score can be interpreted either as a measure of the model's confidence in its prediction or as the estimated probability that the input instance belongs to a specific class.

Certain algorithms, such as decision tree learning, Support Vector Machines (SVM), and k-Nearest Neighbors (kNN), possess the capability to address both classification and regression tasks. However, there are other algorithms that are specialized, each excelling in either classification or regression, but can’t handle both types of problems.

Feature Engineering

The problem of transforming raw data into a dataset is called feature engineering.

For example, when it comes to processing user interaction logs within a computer system, an approach involves feature engineering. In this process, features are constructed to encapsulate details about the user and pertinent statistics extracted from the log data. For instance, each user might include a feature denoting their subscription cost, alongside other features that capture metrics such as daily, weekly, and yearly connection frequencies.

Certain algorithms mainly operate with numerical feature vectors. When your dataset contains categorical features like "colors" or "days of the week," a common strategy is to convert these categorical attributes into multiple binary features. This transformation allows the algorithms to process and utilize categorical information within the dataset.

Binning

Binning is used in data preprocessing to convert a continuous feature into several binary features, referred to as bins or buckets. This transformation is usually based on predefined value ranges. For instance, instead of representing age as a single continuous feature, an analyst might put age values into distinct bins, such as grouping ages from 0 to 5 years into one bin, ages from 6 to 10 years into a second bin, and so forth.

Thoughtfully crafted binning can assist machine learning algorithms in operating effectively with a reduced number of examples. This occurs because binning provides a hint to the learning algorithm: when a feature falls within a particular range, the precise value of that feature becomes less crucial. This allows the algorithm to generalize patterns more efficiently and learn with fewer data points.

Normalization

Normalization is used to transform the original range of values associated with a numerical feature into a standard range, often within the intervals [-1,1] or [0,1]. For example, consider a feature with a range spanning from 350 to 1450. By subtracting 350 from each value within that feature and then dividing the result by 1100, you can effectively normalize those values, bringing them into the standardized range of [0,1].

It's best that our input data falls within a reasonably compact range. This will help mitigate issues that can arise when computers handle extremely small or exceptionally large numbers, commonly referred to as numerical overflow.

Standardization

Standardization is a process where feature values are adjusted to exhibit the characteristics of a standard normal distribution.

Rule of thumb:

Unsupervised learning benefit from standardization rather than normalization
Standardization is better for a feature if the values this feature takes are distributed close to normal distribution
It can also be better for feature if it sometimes have extremely high or low values
In all other cases, normalization is best

Missing Features

The approach for dealing with missing features:

Removing examples with missing features from the dataset
Using a learning algorithm that can deal with missing features
Using data imputation technique

Selection

Choosing an algorithm can be difficult. If you have time, you can try all of them, but if you have limited time focus on:

Explainability
In-memory vs. out-of-memory
Number of features and examples
Categorical vs. numerical features
Nonlinearity of the data
Training speed
Prediction speed

Underfitting and Overfitting

When a model performs poorly on the training data, making mistakes frequently, it is described as having a high bias or being prone to underfitting.

There could be a bunch of reasons for underfitting, the most important are:

The model is too simple for your data
The feature isn’t informative enough

The solution is to try a more complex model or engineer features with higher predictive power.

Overfitting is another issue that a model may encounter. An overfitting model excels at predicting the training data but performs poorly when applied to at least one of the two holdout sets.

There could be a bunch of reasons for overfitting, the most important are:

Your model is complex for your data
You have too many features but a small number of training examples

Another term used to describe the issue of overfitting is "high variance." This concept originates from statistics and refers to the model's susceptibility to errors caused by minor variations in the training dataset. Basically, if the data were collected in a slightly different manner, the learning process would yield a different model. This is precisely why an overfitting model tends to perform inadequately on test data, as the test and training datasets are independently sampled from the overall dataset.

There are several solutions to this:

Try a simpler model
Reduce the dimensionality of examples in the dataset
Add more training data
Regularize the model

Regularization is the common approach to prevent overfitting.

Regularization

Regularization is a technique designed to compel the learning algorithm to construct a simpler model. In practice, this may result in a slight increase in bias, but it substantially reduces variance.

The two popular forms of regularization are L1 and L2. In order to construct a regularized model, we adjust the objective function by introducing a penalty term. This penalty term increases as the model's complexity rises, encouraging the model to be simpler and more generalizable.

L1 regularization tends to yield a sparse model, meaning that most of its parameters are set to zero. L1 regularization accomplishes feature selection by determining which features are crucial for making predictions and which can be removed. This is best when the goal is to enhance model explainability, as it simplifies the model by highlighting the most influential features while removing less important ones.

If your goal is to maximize the performance of the model, L2 is better.

L1 and L2 can be combined in what's known as elastic net regularization, with L1 and L2 regularization as specific instances. In the literature, you'll come across the terms "ridge regularization" for L2 regularization and "lasso" for L1 regularization.

Model Performance

Machine learning experts rely on various formal metrics to evaluate the performance of models. A well-fitted regression model should yield predicted values that closely align with the observed data values. Sometimes when there are no informative features, you can use the mean model, which predicts the average of the labels in the training data. It's essential that the regression model outperforms the mean model.

To do this, we calculate the mean squared error (MSE) for both the training and test datasets. If the model's MSE on the test data surpasses the MSE, there is a sign of overfitting. To fix this, you can apply regularization techniques or fine-tuning hyperparameters. What constitutes a "significantly higher" MSE depends on the specific problem and should be determined collaboratively between the data analyst and the decision maker or product owner who initiated the model development.

The most widely used metrics are:

The area under the ROC curve
precision/recall
Cost-sensitive accuracy
Accuracy
Confusion matrix

Neural Networks

A neural network is a network of interconnected units, which are grouped into one or more layers. These units are represented as either circles or rectangles. An incoming arrow signifies an input to a unit and indicates its source and an outgoing arrow illustrates the output of a unit.

The output of each unit is determined by the mathematical operation specified in the rectangle. However, circular units pass their input directly to the output without any internal processing.

In each rectangular unit, a sequence unfolds. All incoming inputs are joined into an input vector. This unit carries out a linear transformation on the input vector. Finally, an activation function is applied to the outcome of this linear transformation, yielding a real-numbered output value.

In a feed-forward neural network (FFNN), the output value generated by a unit within one layer serves as an input for each of the units in the subsequent layer.

In a multilayer perceptron, there is an interconnection pattern where all the outputs from one layer are linked to every input of the subsequent layer. This architectural arrangement is referred to as "fully-connected." Inside a neural network, you may encounter layers designated as fully connected layers.

Deep Learning

Deep learning is the training of neural networks that consist of more than just two non-output layers. As the number of layers increases, training networks have challenges. Two obstacles you may encounter are the exploding gradient problem and the vanishing gradient problem. These issues happen when employing gradient descent to train the network's parameters.

In neural network training, the commonly employed algorithm for updating parameter values is known as backpropagation. Backpropagation computes gradients within neural networks by using the chain rule. Throughout the gradient descent process, the neural network's parameters update that are proportionate to the partial derivative of the cost function.

Modern implementations of neural network learning algorithms have made it possible to train deep neural networks comprising hundreds of layers. This is due to various enhancements, including activation functions like ReLU, LSTM, and other gated units, the application of techniques such as skip connections, and the adoption of advanced adaptations of the gradient descent algorithm.

Recurrent neural networks

Recurrent neural networks (RNNs) find applications in tasks involving sequences, such as labeling, classification, or generation. In this context, a sequence is treated as a matrix, where each row represents a feature vector. To train RNN models, you would use backpropagation through time.

Classifying a sequence involves predicting a single class for the entire sequence. Generating a sequence means producing another sequence, which may differ in length, but remains relevant to the input sequence.

RNNs are great for text processing because sentences and texts comprise sequences of words, punctuation marks, or even characters.

Beyond the basic Recurrent Neural Networks (RNNs), there are several extensions and variations. This includes bi-directional RNNs, which consider the past and future context when processing sequences, RNNs with attention mechanisms that focus on relevant parts of the input sequence, and sequence-to-sequence RNN models.

One-Class Classification

One-class classification is the task of identifying objects belonging to a specific class when the training set exclusively contains examples from that class. This is different from the conventional classification problem, which aims to differentiate between multiple classes with training data encompassing objects from all those classes.

An example is in securing a computer network, where the goal is to classify network traffic as normal. In this scenario, there may be few instances of network traffic during an attack, but plenty of examples of regular, non-malicious traffic.

Several algorithms are one-class learning such as one-class Gaussian, one-class k-means, one-class kNN (k-Nearest Neighbors), and one-class SVM (Support Vector Machine). The principle of the one-class Gaussian approach involves modeling the data as if it follows a Gaussian distribution.

Multi-Label Classification

In certain situations, it's necessary to assign more than one label to describe an example in a dataset, which is called multi-label classification. For example, when describing an image, we might want to assign multiple labels like "trees," "mountain," and "road" simultaneously. If the number of possible label values is substantial, but they share a similar nature (e.g., tags), we can transform each labeled example into several instances, each with the same feature vector but only one label.

Algorithms that support multiclass problems, such as decision trees, logistic regression, and neural networks, can be applied to multi-label classification tasks by adapting their strategies to accommodate multiple labels for each input example.

Ensemble Learning

Ensemble learning aims to focus on training numerous low-accuracy models. These less accurate models are typically learned by weak learners, which are algorithms that excel at fast training and prediction but cannot handle complex models.

A common choice for a weak learner is a decision tree learning algorithm. The idea in ensemble learning is that if these trees are diverse and each slightly outperforms random guessing, then by combining the predictions from a bunch of such trees, we can attain a high level of accuracy.

Two ensemble learning techniques are boosting and bagging.

Boosting and Bagging

Boosting involves a process that begins with the original training data. Multiple models are generated using a weak learner, with each new model designed to address the errors made by its predecessors. The final ensemble model is a combination of these individual weak models.

Bagging is creating numerous variations of the training data, each copy is slightly different from the others. The weak learner is then applied to each of these modified datasets, yielding multiple weak models. These models are combined to produce the final ensemble prediction. An effective machine learning algorithm based on bagging is the random forest.

Semi-Supervised Learning

In semi-supervised learning (SSL), a small portion of the dataset is labeled, while the majority of examples remain unlabeled. The objective of SSL is to get a pool of unlabeled data to enhance model performance without the need for additional labeled examples.

One well-known SSL approach is referred to as "self-learning." In self-learning, we initiate the process by employing a learning algorithm to construct an initial model using the labeled examples. We utilize this model to make predictions for all the unlabeled examples, labeling them based on the model's output.

If the confidence score of the model's prediction for an unlabeled example surpasses a predetermined threshold, we will use this labeled example in our training set. We then retrain the model and iterate through this process until a predefined stopping criterion is met. This stopping criterion could involve halting the process if the model's accuracy fails to improve over a specified number of iterations.

[End of Part Two]

The Software & Data Spectrum