Multi-layer Perceptron - The Foundation of Deep Learning
Visualize data flowing through input, hidden, and output layers
FFNN: Information flows only forward (Input -> Hidden -> Output), no loops or cycles
See how each layer transforms data: Linear + Nonlinear Activation
Without activation: y = W2(W1x) = (W2W1)x remains linear! Cannot learn complex patterns.
Compare different activation functions and their gradients
| Function | Pros | Issues |
|---|---|---|
| Sigmoid | Smooth | Vanishing |
| Tanh | Zero-centered | Vanishing |
| ReLU | Simple, fast | Dead neurons |
| Leaky ReLU | No dead | Tune alpha |
| GELU | Smooth, theory | Complex |
MLP can approximate any continuous function on compact subsets
A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of R^n
Layer 1: Cuts space into regions. Deeper layers: Form complex decision boundaries through composition.
Watch gradients flow backward through the network
W <- W - lr * dL/dW
Each gradient is computed by multiplying local gradient with upstream gradient (chain rule)
Understanding why every Transformer block contains an MLP/FFN
MLP = Multiple compositions of "Linear Transform + Nonlinear Activation" = Universal Function Approximator