Cross-Entropy Loss Visualization

Interactive exploration of the most commonly used loss function in machine learning

Binary Cross-Entropy Formula

y: True label (0 or 1)
ŷ: Predicted probability (0 to 1)

Interactive Demo

0.01 0.99

Loss Value

0.3567

Gradient

-0.4762

Confidence

70%
预测正确:模型预测为正类,真实标签也是正类

Loss Curve

y = 1 (True label is positive)
y = 0 (True label is negative)
Current Point

Key Insights

Confidence Punishment

When predictions are wrong with high confidence, loss increases dramatically. This encourages models to be cautious when uncertain.

Logarithmic Scale

Using logarithmic function, loss approaches 0 for correct predictions (probability near 1) and infinity for wrong predictions (probability near 0).

Gradient Interpretation

Gradient indicates direction of loss change. Negative gradient means increasing prediction probability reduces loss (when y=1).

Categorical Cross-Entropy Formula

yᵢ: True class (one-hot encoded)
ŷᵢ: Predicted probability (softmax output)

Softmax Demo (3-Class Classification)

Input Logits

2.0
1.0
-1.0
1.0

Softmax Output Probabilities

Select True Class

Cross-Entropy Loss

0.3265

Predicted Class

Class A

Confidence

70.5%

Softmax Formula

Probability Distribution Comparison

Key Insights

Softmax Normalization

Softmax converts logits to probability distribution summing to 1. Exponential function ensures all outputs are positive.

Logit Difference Effect

Even similar logit values can produce significantly different probabilities after softmax. Relative differences matter more than absolute differences.

Temperature Effect

Temperature parameter controls output 'sharpness': high temperature makes distribution more uniform, low temperature makes it sharper.

Loss Function Comparison

Compare cross-entropy loss with mean squared error (MSE) in classification tasks

Comparison Demo

0.01 0.99

Cross-Entropy Loss

0.3567
梯度: -0.4762

Mean Squared Error (MSE)

0.0900
梯度: -0.6000

Loss Curves Comparison (y=1)

Pros and Cons

Feature Cross-Entropy Loss Mean Squared Error (MSE)
Gradient for Wrong Predictions Large gradient, fast correction Small gradient, slow convergence
Gradient for Correct Predictions Small gradient, stable convergence Non-zero gradient, may overshoot
Convexity Convex for sigmoid/softmax Globally convex
Probabilistic Interpretation Maximum likelihood estimation Least squares method
Best Use Case Classification tasks Regression tasks

Theoretical Background

Information Theory Perspective

Cross-entropy measures difference between two probability distributions. In classification, it represents 'distance' between true and predicted distributions. Minimizing cross-entropy is equivalent to maximizing likelihood.

KL Divergence Relation

Cross-entropy = KL divergence + Entropy. Since true distribution's entropy is constant, minimizing cross-entropy is equivalent to minimizing KL divergence.

Why Not MSE for Classification?

MSE assumes Gaussian distributed errors, suitable for regression. But for classification, cross-entropy provides stronger gradient signals, especially when predictions are wrong, enabling faster model correction.

Practical Tips

Numerical Stability

Direct log(0) calculation causes numerical underflow. Implementations typically use log(sum(exp(x))) trick to avoid this issue.

Label Smoothing

Replace hard labels (0,1) with soft labels (e.g., 0.1, 0.9) to prevent overconfidence and improve generalization.

Class Imbalance

For imbalanced datasets, use weighted cross-entropy to give more weight to minority classes.

Activation Function Choice

Use sigmoid for binary classification, softmax for multi-class. Ensure final layer activation matches loss function.