Feedforward Neural Network / MLP

Multi-layer Perceptron - The Foundation of Deep Learning

From Rosenblatt's Perceptron (1958) to Modern Transformers - The Universal Function Approximator

Network Structure Visualization

Visualize data flowing through input, hidden, and output layers

Forward Pass: x -> [Hidden] -> ... -> y

Network Configuration

Animation

Key Concept

FFNN: Information flows only forward (Input -> Hidden -> Output), no loops or cycles

Layer Transformation Demo

See how each layer transforms data: Linear + Nonlinear Activation

z = Wa + b: Linear Transformation
a = sigma(z): Nonlinear Activation

Weights and Bias

Activation Function

Why Nonlinearity?

Without activation: y = W2(W1x) = (W2W1)x remains linear! Cannot learn complex patterns.

Activation Function Gallery

Compare different activation functions and their gradients

Function Details

Formula: f(x) = 1/(1+e^(-x))
Range: (0, 1)
Gradient: f'(x) = f(x)(1-f(x))
Advantages: Smooth, differentiable
Issues: Vanishing gradient

Evolution Table

Function Pros Issues
Sigmoid Smooth Vanishing
Tanh Zero-centered Vanishing
ReLU Simple, fast Dead neurons
Leaky ReLU No dead Tune alpha
GELU Smooth, theory Complex

Universal Approximation Theorem

MLP can approximate any continuous function on compact subsets

Target Function

Theorem (Cybenko, 1989)

A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of R^n

Geometric Intuition

Layer 1: Cuts space into regions. Deeper layers: Form complex decision boundaries through composition.

Backpropagation Visualization

Watch gradients flow backward through the network

Chain Rule: dL/dW = dL/da * da/dz * dz/dW

Simulation

Gradient Flow Status

Click buttons to start

Parameter Update

W <- W - lr * dL/dW

Each gradient is computed by multiplying local gradient with upstream gradient (chain rule)

Transformer MLP Block

Understanding why every Transformer block contains an MLP/FFN

View Mode

Why MLP in Transformer?

1 Attention: Token interaction (global mixing)
2 MLP: Per-token feature refinement (local depth)
3 Complementary: Together they enable rich representations

Experimental Evidence

  • Remove MLP -> Severe performance drop
  • Remove 30-50% Attention -> Minor impact
  • MLP provides critical nonlinear capacity
FFN(x): GELU(xW1 + b1)W2 + b2
Typical: d_model -> 4d_model -> d_model

Network Design

Small Data: 2-4 layers
Big Data: 5-20 layers
Width Rule: 2-10x input dim

Initialization Methods

ReLU Family -> He/Kaiming Init
Sigmoid/Tanh -> Xavier/Glorot Init

Regularization

  • L2 Weight Decay (prevent large weights)
  • Dropout (random deactivation)
  • BatchNorm / LayerNorm (stable training)
  • Early Stopping (prevent overfitting)

Real-World Applications

1
House Price Prediction Regression with tabular features
2
Fraud Detection Binary classification on transaction data
3
Medical Diagnosis Multi-class classification on patient data

Historical Timeline

1958 Rosenblatt: Perceptron
1986 Rumelhart/Hinton: Backpropagation
2011 ReLU Revolution
2017 Transformer + GELU
2018 Turing Award: Hinton, LeCun, Bengio

MLP Limitations

  • High parameter count for high-dimensional inputs
  • No built-in structure exploitation (unlike CNN/RNN)
  • Prone to overfitting on small datasets
  • Fixed input size requirement

One-Line Summary

MLP = Multiple compositions of "Linear Transform + Nonlinear Activation" = Universal Function Approximator