Interactive exploration of the most commonly used loss function in machine learning
When predictions are wrong with high confidence, loss increases dramatically. This encourages models to be cautious when uncertain.
Using logarithmic function, loss approaches 0 for correct predictions (probability near 1) and infinity for wrong predictions (probability near 0).
Gradient indicates direction of loss change. Negative gradient means increasing prediction probability reduces loss (when y=1).
Softmax converts logits to probability distribution summing to 1. Exponential function ensures all outputs are positive.
Even similar logit values can produce significantly different probabilities after softmax. Relative differences matter more than absolute differences.
Temperature parameter controls output 'sharpness': high temperature makes distribution more uniform, low temperature makes it sharper.
Compare cross-entropy loss with mean squared error (MSE) in classification tasks
| Feature | Cross-Entropy Loss | Mean Squared Error (MSE) |
|---|---|---|
| Gradient for Wrong Predictions | Large gradient, fast correction | Small gradient, slow convergence |
| Gradient for Correct Predictions | Small gradient, stable convergence | Non-zero gradient, may overshoot |
| Convexity | Convex for sigmoid/softmax | Globally convex |
| Probabilistic Interpretation | Maximum likelihood estimation | Least squares method |
| Best Use Case | Classification tasks | Regression tasks |
Cross-entropy measures difference between two probability distributions. In classification, it represents 'distance' between true and predicted distributions. Minimizing cross-entropy is equivalent to maximizing likelihood.
Cross-entropy = KL divergence + Entropy. Since true distribution's entropy is constant, minimizing cross-entropy is equivalent to minimizing KL divergence.
MSE assumes Gaussian distributed errors, suitable for regression. But for classification, cross-entropy provides stronger gradient signals, especially when predictions are wrong, enabling faster model correction.
Direct log(0) calculation causes numerical underflow. Implementations typically use log(sum(exp(x))) trick to avoid this issue.
Replace hard labels (0,1) with soft labels (e.g., 0.1, 0.9) to prevent overconfidence and improve generalization.
For imbalanced datasets, use weighted cross-entropy to give more weight to minority classes.
Use sigmoid for binary classification, softmax for multi-class. Ensure final layer activation matches loss function.