Perceptron/Neuron - The Fundamental Unit of Deep Learning

Interactive visualization of perceptron, activation functions, and neural network fundamentals

Frank Rosenblatt, 1958 - The atomic structure of neural networks

Perceptron Basic Form

Adjust weights and bias to see how the perceptron computes its output

Scalar Form: y = f(∑wixi + b)
Vector Form: y = f(wTx + b)

Inputs (x)

Weights (w)

Bias (b)

Activation Function

Computation

Weighted Sum (z): 0.55
Output (y): 0.63

Why Activation Functions Are Necessary

Without activation functions, neural networks remain linear regardless of depth

Key Insight

Linear Composition of Linear = Linear

f(g(x)) = ax + b, where both are linear

Three Core Purposes

1 Introduce nonlinearity for complex pattern learning
2 Control numerical range of outputs
3 Provide differentiability for backpropagation

Evolution History

1958 Step Function (Rosenblatt)
1980s Sigmoid/Tanh
2011 ReLU (Revolution)
2017+ Swish/GELU

Activation Function Gallery

Compare different activation functions and their derivatives

Function Details

Formula: f(z) = 1/(1+e^(-z))
Range: (0, 1)
Derivative: f'(z) = f(z)(1-f(z))
Advantages: Smooth, differentiable, probabilistic interpretation
Disadvantages: Gradient vanishing, non-zero centered

Real-time Calculator

Gradient Flow Visualization

See how gradients propagate through different activation functions

Backpropagation Formula

∂L/∂wi = ∂L/∂y · f'(z) · xi

If f'(z) ~ 0, gradient vanishes!

Gradient Stability Comparison

Function Large |z| Gradient z=0 Gradient
Sigmoid ≈0 (vanishing) 0.25
Tanh ≈0 (vanishing) 1.0
ReLU 1 (for z>0) 0 or 1
Swish Smooth non-zero 0.5
GELU Smooth non-zero 0.5

Expressiveness

ReLU family: Piecewise linear approximation
GELU/Swish: Smooth nonlinear approximation

From Single Neuron to Deep Networks

Compare linear-only networks vs networks with nonlinear activations

Multi-layer Composition

h(l) = f(W(l)·h(l-1) + b(l))

Network Type

Number of Layers

Target Function

Universal Approximation Theorem

A feedforward network with at least one hidden layer can approximate any continuous function on compact subsets of R^n

Hidden Layers

Default: ReLU
Transformer: GELU / Swish

Output Layers

Task Activation
Binary Classification Sigmoid
Multi-class Classification Softmax
Regression Linear (none)

Initialization Matching

ReLU He Initialization
Tanh/Sigmoid Xavier Initialization

Combination Techniques

  • Activation + BatchNorm for stable training
  • Residual Connection + ReLU/GELU for deep networks
  • LayerNorm + GELU for Transformers

Conceptual Understanding

Weights

Learn "what to look at"

Bias

Learn "threshold"

Activation

Learn "how to respond"

Neuron = Learnable feature transformer with nonlinear gating

One-Line Summary

Perceptron is the atomic structure of neural networks

Activation functions determine whether networks can learn complex patterns

ReLU made deep learning truly trainable

GELU/Swish make large models more stable and powerful