2.5 Loss Function
A loss function in machine learning is used to measure how well a model performs by calculating how much its predictions deviate from the correct, or "ground truth," predictions.
Created Date: 2025-05-10
In machine learning (ML), a loss function is used to measure model performance by calculating the deviation of a model’s predictions from the correct, “ground truth” predictions. Optimizing a model entails adjusting model parameters to minimize the output of some loss function.
Suppose we have the following data points for a simple linear model:
Actual values: [3, 4, 5]
Predicted values: [2.5, 4.5, 5.2]
We can calculate the loss using a simple mean squared error (MSE) loss function:
2.5.1 MSE Loss
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the true value.
We can view torch.nn.MSELoss to understand its calculation rules:
CLASS torch.nn.MSELoss(size_average=None, reduce=None, reduction='mean')
Creates a criterion that measures the mean squared error (squared L2 norm) between each element in the input \(x\) and target \(y\).
The unreduced (i.e. with reduction
set to 'none'
) loss can be described as:
where \(N\) is the batch size. If reduction
is not 'none'
(default 'mean'
), then:
\(x\) and \(y\) are tensors of arbitrary shapes with a total of \(N\) elements each. The mean operation still operates over all the elements, and divides by \(N\). The division by \(N\) can be avoided if one sets reduction = 'sum'
.
File mse_loss.py defines dervi_mse(y_pred, y_true)
computes the gradient of the Mean Squared Error (MSE) loss with respect to the predicted values:
def dervi_mse(y_pred, y_true):
return 2 * (y_pred - y_true) / len(y_true)
y_true = numpy.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_pred = numpy.array([1.2, 1.8, 3.5, 4.1, 5.3])
average_loss = ((y_pred - y_true) ** 2).mean()
print(average_loss)
dl_dy = dervi_mse(y_pred, y_true)
print(dl_dy)
0.0856 [ 0.08 -0.08 0.2 0.04 0.12]
In PyTorch, we can use the following code to compute the MSE loss and its gradient:
y_true = torch.tensor(y_true, requires_grad=False)
y_pred = torch.tensor(y_pred, requires_grad=True)
# update gradient
average_loss = torch.nn.MSELoss()(y_pred, y_true)
print(average_loss.item())
average_loss.backward()
print(y_pred.grad)
0.0856 tensor([ 0.0800, -0.0800, 0.2000, 0.0400, 0.1200], dtype=torch.float64)
2.5.2 Cross Entropy Loss
torch.nn.CrossEntropyLoss computes the cross entropy loss between input logits and target.
CLASS torch.nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean', label_smoothing=0.0)
It is useful when training a classification problem with \(C\) classes. If provided, the optional argument weight
should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set.
The input is expected to contain the unnormalized logits for each class (which do not need to be positive or sum to 1, in general). input has to be a Tensor of size \((C)\) for unbatched input, \((minibatch, C)\) or \((minibatch, C, d_1, d_2, \cdots, d_k)\) with \(K \ge 1\) for the K-dimensional case. The last being useful for higher dimension inputs, such as computing cross entropy loss per-pixel for 2D images.
# Suppose we have 3 classes
num_classes = 3
# Predicted scores (logits), not probabilities
# Shape: (batch_size, num_classes)
y_pred = torch.tensor([[2.0, 1.0, 0.1], [0.5, 2.5, 0.3]])
# Ground truth labels (as class indices)
# Shape: (batch_size,)
y_true = torch.tensor([0, 1])
criterion = torch.nn.CrossEntropyLoss()
loss = criterion(y_pred, y_true)
print(f"CrossEntropyLoss: {loss.item():.4f}")
CrossEntropyLoss: 0.3185