#| echo: false
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
import scipy.stats as stats
from scipy.linalg import solve, qr, svd, inv
from scipy.optimize import minimize, minimize_scalar
from scipy.special import expit as sigmoid  # Numerically stable sigmoid
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

plt.rcParams.update({
    'figure.figsize': (10, 6),
    'font.size': 12,
    'lines.linewidth': 2.5,
    'axes.labelsize': 12,
    'axes.titlesize': 14,
    'legend.fontsize': 11,
    'xtick.labelsize': 11,
    'ytick.labelsize': 11,
    'figure.dpi': 100,
    'axes.grid': True,
    'grid.alpha': 0.3,
    'grid.linewidth': 1
})

np.random.seed(42)

Deep Learning

How Neural Networks Learn

Learning to Classify

What is Machine Learning?

Learning = Task + Performance Measure + Experience

Herbert Simon (1983)

"Learning is any process by which a system improves performance from experience."

Framework

\[ \text{Learning System} = (\mathcal{T}, \mathcal{P}, \mathcal{E}) \]

Task \(\mathcal{T}\): What to accomplish
Performance \(\mathcal{P}\): How to measure success
Experience \(\mathcal{E}\): Data to learn from

Learning occurs when:

\[ \mathcal{P}_{\text{after}}(\mathcal{T}, \mathcal{E}) > \mathcal{P}_{\text{before}}(\mathcal{T}) \]

Example: Email Spam Filter

Task (\(\mathcal{T}\)): Classify emails as spam/not spam
Performance (\(\mathcal{P}\)): % correctly classified
Experience (\(\mathcal{E}\)): Database of labeled emails

Example: Self-Driving Car

\(\mathcal{T}\): Navigate roads safely
\(\mathcal{P}\): Miles without intervention
\(\mathcal{E}\): Hours of human driving data

Generalization is the Goal of Machine Learning

Do not care about performance on the dataset we have
Do care about performance on similar data that has no labels
Accuracy/Generalization trade-off (bias-variance trade):
- Optimizing accuracy to the extreme reduces capability to generalize

Machine Learning Inverts Traditional Programming

#| echo: false
"""
Diagram contrasting traditional programming (rules + data → output) with machine learning (data + expected output → learned program).
"""

#| fig-align: center

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from matplotlib.patches import FancyBboxPatch

fig, axes = plt.subplots(1, 3, figsize=(15, 6))

# Traditional Programming
ax = axes[0]
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.axis('off')

# Input
rect1 = FancyBboxPatch((1, 6), 2, 1.5, boxstyle="round,pad=0.1",
                       facecolor='#E3F2FD', edgecolor='#1976D2', linewidth=2)
ax.add_patch(rect1)
ax.text(2, 6.75, 'Rules', ha='center', fontsize=11, fontweight='bold')

# Program
rect2 = FancyBboxPatch((1, 3), 2, 1.5, boxstyle="round,pad=0.1",
                       facecolor='#FFF9C4', edgecolor='#F57C00', linewidth=2)
ax.add_patch(rect2)
ax.text(2, 3.75, 'Data', ha='center', fontsize=11, fontweight='bold')

# Process box
rect3 = FancyBboxPatch((4.5, 4), 2, 2, boxstyle="round,pad=0.1",
                       facecolor='#F5F5F5', edgecolor='#616161', linewidth=2)
ax.add_patch(rect3)
ax.text(5.5, 5, 'Traditional\nProgram', ha='center', fontsize=10, fontweight='bold')

# Output
rect4 = FancyBboxPatch((7.5, 4.25), 2, 1.5, boxstyle="round,pad=0.1",
                       facecolor='#E8F5E9', edgecolor='#4CAF50', linewidth=2)
ax.add_patch(rect4)
ax.text(8.5, 5, 'Output', ha='center', fontsize=11, fontweight='bold')

# Arrows
ax.arrow(3, 6.75, 1.3, -1.5, head_width=0.15, head_length=0.1, fc='black')
ax.arrow(3, 3.75, 1.3, 0.5, head_width=0.15, head_length=0.1, fc='black')
ax.arrow(6.5, 5, 0.9, 0, head_width=0.15, head_length=0.1, fc='black')

ax.set_title('Traditional Programming', fontsize=13, fontweight='bold', color='#424242')
ax.text(5, 1, 'if (temp > 30):\n    return "hot"\nelse:\n    return "cold"', 
        ha='center', fontsize=9, family='monospace', style='italic')

# Machine Learning
ax = axes[1]
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.axis('off')

# Input
rect1 = FancyBboxPatch((1, 6), 2, 1.5, boxstyle="round,pad=0.1",
                       facecolor='#FFF9C4', edgecolor='#F57C00', linewidth=2)
ax.add_patch(rect1)
ax.text(2, 6.75, 'Data', ha='center', fontsize=11, fontweight='bold')

# Expected Output
rect2 = FancyBboxPatch((1, 3), 2, 1.5, boxstyle="round,pad=0.1",
                       facecolor='#E8F5E9', edgecolor='#4CAF50', linewidth=2)
ax.add_patch(rect2)
ax.text(2, 3.75, 'Expected\nOutput', ha='center', fontsize=10, fontweight='bold')

# ML box
rect3 = FancyBboxPatch((4.5, 4), 2, 2, boxstyle="round,pad=0.1",
                       facecolor='#FFE0B2', edgecolor='#FF6F00', linewidth=2)
ax.add_patch(rect3)
ax.text(5.5, 5, 'Machine\nLearning', ha='center', fontsize=10, fontweight='bold')

# Program
rect4 = FancyBboxPatch((7.5, 4.25), 2, 1.5, boxstyle="round,pad=0.1",
                       facecolor='#E3F2FD', edgecolor='#1976D2', linewidth=2)
ax.add_patch(rect4)
ax.text(8.5, 5, 'Program', ha='center', fontsize=11, fontweight='bold')

# Arrows
ax.arrow(3, 6.75, 1.3, -1.5, head_width=0.15, head_length=0.1, fc='black')
ax.arrow(3, 3.75, 1.3, 0.5, head_width=0.15, head_length=0.1, fc='black')
ax.arrow(6.5, 5, 0.9, 0, head_width=0.15, head_length=0.1, fc='black')

ax.set_title('Machine Learning', fontsize=13, fontweight='bold', color='#8B0000')
ax.text(5, 1, 'Learns from 1000s\nof examples', 
        ha='center', fontsize=9, style='italic')

# The Result
ax = axes[2]
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.axis('off')

# Examples at top
examples = [
    "Recognize faces",
    "Translate languages",
    "Drive cars",
    "Diagnose diseases",
    "Predict markets"
]

y_start = 8
for i, example in enumerate(examples):
    rect = FancyBboxPatch((1, y_start - i*1.3), 8, 0.9, boxstyle="round,pad=0.05",
                          facecolor='#F3E5F5', edgecolor='#7B1FA2', linewidth=1,
                          alpha=0.7 - i*0.1)
    ax.add_patch(rect)
    ax.text(5, y_start - i*1.3 + 0.45, example, ha='center', fontsize=10)

ax.set_title('Tasks Impossible to Program Explicitly', fontsize=13, fontweight='bold', color='#7B1FA2')

plt.suptitle('The Paradigm Shift: From Rules to Learning', fontsize=16, fontweight='bold', y=0.98)
plt.tight_layout()
plt.show()

Theory-Driven vs Data-Driven Approaches

Classical: Theory-Driven

Modern: Data-Driven

Model Complexity: When to Stop Adding Parameters

George Box (1976)

"All models are wrong, but some are useful"

"Since all models are wrong the scientist cannot obtain a 'correct' one by excessive elaboration"

Box's warning: More parameters ≠ better science

MNIST Classification: Accuracy vs Complexity

Nearest neighbor: 3% error, \(\mathcal{O}(n)\) inference
Linear classifier: 8% error, \(\mathcal{O}(d)\) inference
2-layer network: 2% error, 50K parameters
ConvNet (LeNet-5): 0.8% error, 60K parameters
ResNet-50: 0.2% error, 25M parameters

Question: Is 0.2% → 0.1% worth 25M parameters?

Worrying Selectively

It is inappropriate to be concerned about mice when there are tigers abroad

Start simple
Add complexity purposefully
Validate empirically

#| echo: false
"""
Log-log plot showing error rate and training time versus model complexity, with a highlighted "sweet spot" region where performance gains balance computational cost.
"""

#| fig-align: center

import numpy as np
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 5))

# Real data inspired by actual model performances
model_complexity = np.array([1, 10, 100, 1000, 10000, 100000])
error_rate = np.array([12, 8, 3, 1.5, 0.8, 0.7])
training_time = np.array([0.1, 1, 10, 60, 600, 3600])  # seconds

ax.loglog(model_complexity, error_rate, 'o-', linewidth=2, markersize=8, color='#C62828', label='Error Rate (%)')
ax.loglog(model_complexity, training_time/60, 's--', linewidth=2, markersize=7, color='#1976D2', label='Training Time (min)')

# Mark the sweet spot
ax.axvspan(100, 1000, alpha=0.2, color='green')
ax.text(300, 10, 'Ideal?', fontsize=11, fontweight='bold', ha='center', color='#2E7D32')

ax.set_xlabel('Model Parameters', fontsize=11)
ax.set_ylabel('Error / Time', fontsize=11)
ax.set_title('The Complexity-Performance Trade-off', fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Course Structure: Statistical Foundations to Neural Networks

#| echo: false
"""
Course structure diagram showing three columns: statistical models (yellow), data-driven methods (blue), and neural networks (green), illustrating how the course bridges classical and modern machine learning approaches.
"""

#| fig-align: center

import matplotlib.pyplot as plt
import matplotlib.patches as patches

fig, ax = plt.subplots(figsize=(14, 9))
ax.set_xlim(0, 12)
ax.set_ylim(0, 9)
ax.axis('off')

# Column headers
ax.text(2, 8.3, 'statistical models', fontsize=15, color='#1976D2', fontweight='bold', ha='center')
ax.text(6, 8.3, 'data driven', fontsize=15, color='#1976D2', fontweight='bold', ha='center')

# Define grid parameters
col_width = 3.2
col_gap = 0.8
row_gap = 0.15
start_x = 0.4
start_y = 7.5

# Column 1 (Yellow) - Statistical Models
yellow_specs = [
    (1.3, ['MMSE Estimation', 'Linear/Affine MMSE Est.', 'FIR Wiener filtering']),
    (1.2, ['Bayesian decision theory', 'Hard decisions', 'soft decisions (APP)']),
    (0.8, ['ML/MAP parameter', 'estimation']),
    (0.9, ['Karhunen-Loeve expansion', 'sufficient statistics'])
]

y_pos = start_y
for height, texts in yellow_specs:
    rect = patches.FancyBboxPatch((start_x, y_pos - height), col_width, height,
                                  boxstyle="round,pad=0.02",
                                  facecolor='#FFF9C4', edgecolor='#F57C00', linewidth=1.5)
    ax.add_patch(rect)
    text_spacing = height / (len(texts) + 0.5)
    for i, text in enumerate(texts):
        ax.text(start_x + col_width/2, y_pos - text_spacing*(i+0.7), text, 
               ha='center', va='center', fontsize=10.5)
    y_pos -= (height + row_gap)

# Column 2 (Blue) - Data-Driven
blue_specs = [
    (1.3, ['general regression', 'linear LS regression', 'stochastic gradient and', 'batches']),
    (1.2, ['Classification from data', 'linear classifier', 'logistical regression', '(perceptron)']),
    (0.6, ['regularization']),
    (0.9, ['PCA', 'feature design'])
]

x_pos = start_x + col_width + col_gap
y_pos = start_y
for height, texts in blue_specs:
    rect = patches.FancyBboxPatch((x_pos, y_pos - height), col_width, height,
                                  boxstyle="round,pad=0.02",
                                  facecolor='#E3F2FD', edgecolor='#0288D1', linewidth=1.5)
    ax.add_patch(rect)
    text_spacing = height / (len(texts) + 0.5)
    for i, text in enumerate(texts):
        ax.text(x_pos + col_width/2, y_pos - text_spacing*(i+0.7), text,
               ha='center', va='center', fontsize=10.5)
    y_pos -= (height + row_gap)

# Column 3 (Green) - Neural Networks (spans multiple rows)
x_pos = start_x + 2*(col_width + col_gap)
nn_height = 5.0
nn_y = start_y
rect = patches.FancyBboxPatch((x_pos, nn_y - nn_height), col_width, nn_height,
                              boxstyle="round,pad=0.02",
                              facecolor='#E8F5E9', edgecolor='#4CAF50', linewidth=1.5)
ax.add_patch(rect)
ax.text(x_pos + col_width/2, nn_y - 1, 'neural networks', ha='center', fontsize=11.5, fontweight='bold')
ax.text(x_pos + col_width/2, nn_y - 1.8, 'for regression and', ha='center', fontsize=10.5)
ax.text(x_pos + col_width/2, nn_y - 2.4, 'classification', ha='center', fontsize=10.5)
ax.text(x_pos + col_width/2, nn_y - 3.8, 'learning with SGD', ha='center', fontsize=10.5)

# Bottom bar (Green) - Working with data (spans columns 2 and 3)
bottom_x = start_x + col_width + col_gap
bottom_width = 2*col_width + col_gap
bottom_y = 1.5
rect = patches.FancyBboxPatch((bottom_x, bottom_y), bottom_width, 0.7,
                              boxstyle="round,pad=0.02",
                              facecolor='#E8F5E9', edgecolor='#4CAF50', linewidth=1.5)
ax.add_patch(rect)
ax.text(bottom_x + bottom_width/2, bottom_y + 0.35, 'working with data', 
       ha='center', fontsize=11, fontweight='bold')

ellipse_x = start_x + col_width + col_gap/2
ellipse_y = start_y - 1.4
ellipse = patches.Ellipse((ellipse_x, ellipse_y), 3.2, 0.7, 
                         facecolor='#E0E0E0', edgecolor='#616161', 
                         linewidth=1.5, alpha=0.9, zorder=10)
ax.add_patch(ellipse)
ax.text(ellipse_x, ellipse_y, 'GD, SGD, LMS', 
       ha='center', va='center', fontsize=10.5, fontweight='bold', zorder=11)

plt.tight_layout()
plt.show()

Semester Progression: MMSE to Convolutional Networks

#| echo: false
"""
Six-panel overview of course topics: MMSE regression with fitted line, logistic regression decision boundary, MLP architecture diagram, PyTorch training loss curves, CNN layer progression, and a placeholder for weeks 12-14.
"""

#| fig-align: center

# Working directory is now lecture/01, so lib is a direct subdirectory
import sys
if 'lib' not in sys.path:
    sys.path.append('lib')
from plotting_utils import draw_neural_network, plot_loss_curves

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
fig.suptitle('Course Overview', fontsize=16, fontweight='bold')

# Week 3-4: MMSE/Regression
ax = axes[0, 0]
np.random.seed(42)
x = np.linspace(0, 10, 100)
y_true = 2 * x + 1
y_obs = y_true + np.random.normal(0, 2, 100)
ax.scatter(x[::5], y_obs[::5], alpha=0.5, label='Observations', s=30, color='#1976D2')
ax.plot(x, y_true, 'r-', linewidth=2, label='MMSE Estimate')
ax.set_title('Weeks 3-4: MMSE/Regression', fontweight='bold')
ax.set_xlabel('Input')
ax.set_ylabel('Output')
ax.legend()
ax.grid(True, alpha=0.3)

# Week 5: Logistic Regression
ax = axes[0, 1]
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=200, n_features=2, n_redundant=0, 
                          n_informative=2, n_clusters_per_class=1, random_state=42)
ax.scatter(X[y==0, 0], X[y==0, 1], c='#1976D2', alpha=0.5, label='Class 0', s=30)
ax.scatter(X[y==1, 0], X[y==1, 1], c='#C62828', alpha=0.5, label='Class 1', s=30)
x_line = np.linspace(-3, 3, 100)
ax.plot(x_line, -x_line + 0.5, '--', color='#2E7D32', linewidth=2, label='Decision Boundary')
ax.set_title('Week 5: Logistic Regression', fontweight='bold')
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.legend()
ax.grid(True, alpha=0.3)

# Week 5-6: MLP
ax = axes[0, 2]
draw_neural_network(ax, [3, 4, 4, 2])
ax.set_title('Weeks 5-6: Multilayer Perceptron', fontweight='bold')

# Week 8-9: PyTorch Training
ax = axes[1, 0]
epochs = np.linspace(0, 50, 50)
train_loss = 2 * np.exp(-epochs/10) + 0.1 + np.random.normal(0, 0.05, 50)
val_loss = 2 * np.exp(-epochs/12) + 0.15 + np.random.normal(0, 0.08, 50)
plot_loss_curves(ax, train_loss, val_loss)
ax.set_title('Weeks 8-9: PyTorch Training', fontweight='bold')

# Week 10-11: CNN
ax = axes[1, 1]
from matplotlib.patches import Rectangle, FancyBboxPatch

# CNN architecture visualization
layers_info = [
    {'x': 0, 'w': 1.2, 'h': 1.2, 'color': '#E3F2FD', 'label': 'Input\n28×28'},
    {'x': 2, 'w': 1.0, 'h': 1.0, 'color': '#E8F5E9', 'label': 'Conv\n24×24×8'},
    {'x': 3.5, 'w': 0.8, 'h': 0.8, 'color': '#FFF3E0', 'label': 'Pool\n12×12×8'},
    {'x': 4.8, 'w': 0.6, 'h': 0.6, 'color': '#FCE4EC', 'label': 'FC\n128'},
    {'x': 5.8, 'w': 0.3, 'h': 0.3, 'label': '10'}
]

for i, layer in enumerate(layers_info):
    y_center = 0.5
    rect = FancyBboxPatch((layer['x'], y_center - layer['h']/2), layer['w'], layer['h'],
                          boxstyle="round,pad=0.02", 
                          facecolor=layer.get('color', '#F5F5F5'),
                          edgecolor='black', linewidth=1.5)
    ax.add_patch(rect)
    ax.text(layer['x'] + layer['w']/2, y_center, layer['label'], 
           ha='center', va='center', fontsize=9)
    
    if i < len(layers_info) - 1:
        ax.arrow(layer['x'] + layer['w'], y_center, 
                layers_info[i+1]['x'] - layer['x'] - layer['w'] - 0.1, 0,
                head_width=0.05, head_length=0.05, fc='black')

ax.set_xlim(-0.5, 7)
ax.set_ylim(-0.5, 1.5)
ax.axis('off')
ax.set_title('Weeks 10-11: CNN Architecture', fontweight='bold')

# Week 13: RNN/Embeddings
ax = axes[1, 2]
time_steps = 5
for t in range(time_steps):
    rect = FancyBboxPatch((t*1.5, 0), 0.8, 0.8, boxstyle="round,pad=0.02",
                          facecolor='#E1F5FE', edgecolor='#0288D1', linewidth=2)
    ax.add_patch(rect)
    ax.text(t*1.5 + 0.4, 0.4, f'$h_{t}$', ha='center', va='center', fontsize=12)
    if t < time_steps - 1:
        ax.arrow(t*1.5 + 0.8, 0.4, 0.6, 0, head_width=0.05, 
                head_length=0.05, fc='black')
ax.set_title('Week 13: RNN/Sequential Models', fontweight='bold')
ax.set_xlim(-0.5, 7)
ax.set_ylim(-0.2, 1)
ax.axis('off')

plt.tight_layout()
plt.show()

Outline

Foundations

Learning Framework

Task, performance, experience
Generalization as the goal

Hypothesis Classes

Linear models and their limits
Bias-variance tradeoff

Data

Quality vs quantity
Representation and dimensionality

Learning Paradigms

Supervised, unsupervised, reinforcement
Self-supervised methods

Neural Networks

Architecture

Perceptron to deep networks
Universal approximation
Width vs depth

Optimization

Loss landscapes
SGD and variants

Generalization

The mystery of why networks work

Practice

Environment Setup

PyTorch Demo

Fashion-MNIST classifier

Linear Models Fail on Nonlinear Boundaries

#| echo: false
"""
Logistic regression succeeds on linearly separable data (left) but fails on the two-moons dataset (right), demonstrating that linear models cannot learn curved decision boundaries.
"""

#| fig-align: center

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons, make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Create figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# LEFT: Linearly separable data (success case)
np.random.seed(42)
X_linear, y_linear = make_classification(n_samples=200, n_features=2, n_redundant=0,
                                         n_informative=2, n_clusters_per_class=1, 
                                         class_sep=2.0, random_state=42)
X_lin_train, X_lin_test, y_lin_train, y_lin_test = train_test_split(
    X_linear, y_linear, test_size=0.3, random_state=42
)

# Fit linear classifier to linearly separable data
linear_clf_good = LogisticRegression(random_state=42)
linear_clf_good.fit(X_lin_train, y_lin_train)
lin_accuracy = linear_clf_good.score(X_lin_test, y_lin_test)

# Plot linearly separable case
ax1.scatter(X_lin_train[y_lin_train==0, 0], X_lin_train[y_lin_train==0, 1], 
           c='#1976D2', alpha=0.6, label='Class 0', s=50)
ax1.scatter(X_lin_train[y_lin_train==1, 0], X_lin_train[y_lin_train==1, 1], 
           c='#C62828', alpha=0.6, label='Class 1', s=50)

# Add decision boundary for linear case
h = .02
x_min, x_max = X_linear[:, 0].min() - 0.5, X_linear[:, 0].max() + 0.5
y_min, y_max = X_linear[:, 1].min() - 0.5, X_linear[:, 1].max() + 0.5
xx1, yy1 = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z_good = linear_clf_good.predict_proba(np.c_[xx1.ravel(), yy1.ravel()])[:, 1]
Z_good = Z_good.reshape(xx1.shape)

ax1.contourf(xx1, yy1, Z_good, levels=[0, 0.5, 1], colors=['#E3F2FD', '#FFEBEE'], alpha=0.4)
ax1.contour(xx1, yy1, Z_good, levels=[0.5], colors='#2E7D32', linewidths=2)

ax1.set_title(f'Linearly Separable: {lin_accuracy:.1%} Accuracy', fontweight='bold', fontsize=14)
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')
ax1.legend(loc='upper right')
ax1.grid(True, alpha=0.3)

# RIGHT: Two-moons dataset (failure case)
X_moons, y_moons = make_moons(n_samples=200, noise=0.2, random_state=42)
X_moon_train, X_moon_test, y_moon_train, y_moon_test = train_test_split(
    X_moons, y_moons, test_size=0.3, random_state=42
)

# Fit linear classifier to two-moons
linear_clf_bad = LogisticRegression(random_state=42)
linear_clf_bad.fit(X_moon_train, y_moon_train)
moon_accuracy = linear_clf_bad.score(X_moon_test, y_moon_test)

# Plot two-moons case with outlines
from matplotlib.patches import Arc

ax2.scatter(X_moon_train[y_moon_train==0, 0], X_moon_train[y_moon_train==0, 1], 
           c='#1976D2', alpha=0.6, label='Class 0', s=50)
ax2.scatter(X_moon_train[y_moon_train==1, 0], X_moon_train[y_moon_train==1, 1], 
           c='#C62828', alpha=0.6, label='Class 1', s=50)

# Add light moon outlines
#arc1 = Arc((0.5, 0.25), 2.0, 2.0, angle=0, theta1=0, theta2=180, 
#           color='#1976D2', linewidth=1.5, alpha=0.3, linestyle='--')
#ax2.add_patch(arc1)
#arc2 = Arc((0.5, -0.25), 2.0, 2.0, angle=0, theta1=180, theta2=360, 
#           color='#C62828', linewidth=1.5, alpha=0.3, linestyle='--')
#ax2.add_patch(arc2)

# Add decision boundary for two-moons
x_min, x_max = X_moons[:, 0].min() - 0.5, X_moons[:, 0].max() + 0.5
y_min, y_max = X_moons[:, 1].min() - 0.5, X_moons[:, 1].max() + 0.5
xx2, yy2 = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z_bad = linear_clf_bad.predict_proba(np.c_[xx2.ravel(), yy2.ravel()])[:, 1]
Z_bad = Z_bad.reshape(xx2.shape)

ax2.contourf(xx2, yy2, Z_bad, levels=[0, 0.5, 1], colors=['#E3F2FD', '#FFEBEE'], alpha=0.4)
ax2.contour(xx2, yy2, Z_bad, levels=[0.5], colors='#C62828', linewidths=2, linestyles='--')

ax2.set_title(f'Two-Moons Dataset: {moon_accuracy:.1%} Accuracy', fontweight='bold', fontsize=14)
ax2.set_xlabel('Feature 1')
ax2.set_ylabel('Feature 2')
ax2.legend(loc='upper right')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Two-Moons Dataset

Tests whether a model can learn curved decision boundaries. Two interleaving half-circles that cannot be separated by any straight line.

Neural Networks Learn Nonlinear Decision Boundaries

#| echo: true
#| code-fold: true

import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

mlp = MLPClassifier(
    hidden_layer_sizes=(10, 10), 
    max_iter=1000, 
    random_state=42
)
mlp.fit(X_train, y_train)

print(f"Training accuracy: {mlp.score(X_train, y_train):.3f}")
print(f"Test accuracy: {mlp.score(X_test, y_test):.3f}")

#| echo: false
"""
Training data scatter plot alongside the neural network's learned nonlinear decision boundary, showing how the model separates two classes on test data.
"""

#| fig-align: center

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Training data
ax1.scatter(X_train[y_train==0, 0], X_train[y_train==0, 1], 
           c='#1976D2', alpha=0.6, label='Class 0', s=50)
ax1.scatter(X_train[y_train==1, 0], X_train[y_train==1, 1], 
           c='#C62828', alpha=0.6, label='Class 1', s=50)
ax1.set_title('Training Data', fontweight='bold')
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Decision boundary
h = .02
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = mlp.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)

ax2.contourf(xx, yy, Z, levels=20, cmap='RdBu', alpha=0.8)
ax2.scatter(X_test[y_test==0, 0], X_test[y_test==0, 1], c='#1976D2', 
           edgecolor='white', s=60, label='Test Class 0', linewidths=2)
ax2.scatter(X_test[y_test==1, 0], X_test[y_test==1, 1], c='#C62828', 
           edgecolor='white', s=60, label='Test Class 1', linewidths=2)
ax2.set_title('Learned Decision Boundary', fontweight='bold')
ax2.set_xlabel('Feature 1')
ax2.set_ylabel('Feature 2')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Learning Fundamentals

---

Minimize Expected Risk Using Only Finite Samples

Given

Training data: \(\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^N\)
Hypothesis class: \(\mathcal{H}\)
Loss function: \(\mathcal{L}\)

Goal

Find \(h^* \in \mathcal{H}\) that minimizes:

\[ \mathbb{E}_{(\mathbf{x},y) \sim P}[\mathcal{L}(h(\mathbf{x}), y)] \]

But we only have access to:

\[ \frac{1}{N}\sum_{i=1}^N \mathcal{L}(h(\mathbf{x}_i), y_i) \]

Generalization Gap

Minimize error on unseen data using only observed samples

This gap defines machine learning

Three Paradigms: Supervised, Unsupervised, Reinforcement

#| echo: false
"""
Three-panel comparison of learning paradigms: supervised (input-output pairs training a model), unsupervised (unlabeled data finding structure), and reinforcement learning (agent-environment interaction loop).
"""

#| fig-align: center

import matplotlib.pyplot as plt
import matplotlib.patches as patches
import numpy as np

fig, axes = plt.subplots(1, 3, figsize=(16, 5.5))

# Supervised Learning
ax = axes[0]
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.axis('off')

# Data pairs section
data_box = patches.FancyBboxPatch((0.5, 2), 4, 6, boxstyle="round,pad=0.05",
                                  edgecolor='#E0E0E0', facecolor='#FAFAFA', linewidth=1.5, alpha=0.3)
ax.add_patch(data_box)
ax.text(2.5, 8.3, 'Training Data', ha='center', fontsize=10, style='italic')

# Input-output pairs
for i in range(3):
    y_pos = 6.5 - i*1.8
    # Input
    rect_in = patches.FancyBboxPatch((1, y_pos), 1.2, 0.8, boxstyle="round,pad=0.03",
                                     edgecolor='#1976D2', facecolor='#E3F2FD', linewidth=2)
    ax.add_patch(rect_in)
    ax.text(1.6, y_pos+0.4, f'$\\mathbf{{x}}_{{{i+1}}}$', ha='center', va='center', fontsize=11)
    
    # Arrow
    ax.arrow(2.25, y_pos+0.4, 0.5, 0, head_width=0.12, head_length=0.08, fc='#424242', ec='#424242')
    
    # Output
    rect_out = patches.FancyBboxPatch((2.8, y_pos), 1.2, 0.8, boxstyle="round,pad=0.03",
                                      edgecolor='#388E3C', facecolor='#E8F5E9', linewidth=2)
    ax.add_patch(rect_out)
    ax.text(3.4, y_pos+0.4, f'$y_{{{i+1}}}$', ha='center', va='center', fontsize=11)

# Model
rect_model = patches.FancyBboxPatch((5.5, 3.5), 2.5, 3, boxstyle="round,pad=0.08",
                                    edgecolor='#D32F2F', facecolor='#FFEBEE', linewidth=2.5)
ax.add_patch(rect_model)
ax.text(6.75, 5, 'Model\n$f(\\mathbf{x}; \\theta)$', ha='center', va='center', fontsize=11, fontweight='bold')

# Training arrow
arrow_patch = patches.FancyArrowPatch((4.5, 5), (5.4, 5), mutation_scale=25, 
                                      color='#D32F2F', linewidth=2.5, arrowstyle='->')
ax.add_patch(arrow_patch)
ax.text(4.95, 5.4, 'Learn', fontsize=10, fontweight='bold', ha='center')

ax.set_title('Supervised Learning', fontsize=13, fontweight='bold', pad=15)
ax.text(5, 1.2, 'Learns from labeled examples', ha='center', fontsize=9, style='italic', color='#616161')

# Unsupervised Learning
ax = axes[1]
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.axis('off')

# Just inputs section
data_box = patches.FancyBboxPatch((0.5, 2), 2.5, 6, boxstyle="round,pad=0.05",
                                  edgecolor='#E0E0E0', facecolor='#FAFAFA', linewidth=1.5, alpha=0.3)
ax.add_patch(data_box)
ax.text(1.75, 8.3, 'Unlabeled Data', ha='center', fontsize=10, style='italic')

# Just inputs
for i in range(4):
    y_pos = 7 - i*1.3
    rect_in = patches.FancyBboxPatch((1, y_pos), 1.5, 0.8, boxstyle="round,pad=0.03",
                                     edgecolor='#1976D2', facecolor='#E3F2FD', linewidth=2)
    ax.add_patch(rect_in)
    ax.text(1.75, y_pos+0.4, f'$\\mathbf{{x}}_{{{i+1}}}$', ha='center', va='center', fontsize=11)

# Model finding structure
rect_model = patches.FancyBboxPatch((4, 3.5), 2.5, 3, boxstyle="round,pad=0.08",
                                    edgecolor='#7B1FA2', facecolor='#F3E5F5', linewidth=2.5)
ax.add_patch(rect_model)
ax.text(5.25, 5, 'Discover\nPatterns', ha='center', va='center', fontsize=11, fontweight='bold')

# Clusters output
cluster_box = patches.FancyBboxPatch((7.5, 3), 2, 4, boxstyle="round,pad=0.05",
                                     edgecolor='#E0E0E0', facecolor='white', linewidth=1.5, alpha=0.8)
ax.add_patch(cluster_box)
for i, (color, y) in enumerate(zip(['#FF6F00', '#00ACC1', '#FFD600'], [6, 5, 4])):
    circle = patches.Circle((8.5, y), 0.35, color=color, alpha=0.8, ec='white', linewidth=1)
    ax.add_patch(circle)
    ax.text(8.5, y, f'C{i+1}', ha='center', va='center', fontsize=9, fontweight='bold', color='white')

arrow_patch = patches.FancyArrowPatch((3, 5), (3.9, 5), mutation_scale=25,
                                      color='#7B1FA2', linewidth=2.5, arrowstyle='->')
ax.add_patch(arrow_patch)
arrow_patch2 = patches.FancyArrowPatch((6.6, 5), (7.4, 5), mutation_scale=25,
                                       color='#7B1FA2', linewidth=2.5, arrowstyle='->')
ax.add_patch(arrow_patch2)

ax.set_title('Unsupervised Learning', fontsize=13, fontweight='bold', pad=15)
ax.text(5, 1.2, 'Discovers structure without labels', ha='center', fontsize=9, style='italic', color='#616161')

# Reinforcement Learning
ax = axes[2]
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.axis('off')

# Agent
agent_circle = patches.Circle((2.5, 5), 1, facecolor='#FF5252', edgecolor='#B71C1C', linewidth=2.5)
ax.add_patch(agent_circle)
ax.text(2.5, 5, 'Agent\n$\\pi(a|s)$', ha='center', va='center', fontsize=11, fontweight='bold', color='white')

# Environment
env_box = patches.FancyBboxPatch((5.5, 3), 3.5, 4, boxstyle="round,pad=0.08",
                                 edgecolor='#00ACC1', facecolor='#E0F7FA', linewidth=2.5)
ax.add_patch(env_box)
ax.text(7.25, 5, 'Environment', ha='center', va='center', fontsize=12, fontweight='bold')

# Interaction arrows with better positioning
# Action
arrow1 = patches.FancyArrowPatch((3.4, 5.5), (5.4, 5.5), mutation_scale=20,
                                color='#1565C0', linewidth=2, arrowstyle='->')
ax.add_patch(arrow1)
ax.text(4.4, 5.9, 'Action $a_t$', fontsize=10, ha='center')

# State
arrow2 = patches.FancyArrowPatch((5.4, 4.8), (3.4, 4.8), mutation_scale=20,
                                color='#2E7D32', linewidth=2, arrowstyle='->')
ax.add_patch(arrow2)
ax.text(4.4, 4.4, 'State $s_t$', fontsize=10, ha='center')

# Reward
arrow3 = patches.FancyArrowPatch((5.4, 4), (3.4, 4), mutation_scale=20,
                                color='#F57C00', linewidth=2, arrowstyle='->', linestyle='dashed')
ax.add_patch(arrow3)
ax.text(4.4, 3.6, 'Reward $r_t$', fontsize=10, ha='center')

ax.set_title('Reinforcement Learning', fontsize=13, fontweight='bold', pad=15)
ax.text(5, 1.2, 'Learns through trial and feedback', ha='center', fontsize=9, style='italic', color='#616161')

plt.tight_layout()
plt.show()

Modern methods combine paradigms: GPT-4 uses unsupervised pre-training on text, supervised fine-tuning on tasks, and reinforcement learning from human feedback (RLHF).

Deep Learning

How Neural Networks Learn

Learning to Classify

What is Machine Learning?

Learning = Task + Performance Measure + Experience

Herbert Simon (1983)

Framework

Example: Email Spam Filter

Example: Self-Driving Car

Generalization is the Goal of Machine Learning

Machine Learning Inverts Traditional Programming

Theory-Driven vs Data-Driven Approaches

Classical: Theory-Driven

Modern: Data-Driven

Model Complexity: When to Stop Adding Parameters

George Box (1976)

MNIST Classification: Accuracy vs Complexity

Worrying Selectively

Course Structure: Statistical Foundations to Neural Networks

Semester Progression: MMSE to Convolutional Networks

Outline

Foundations

Neural Networks

Practice

Linear Models Fail on Nonlinear Boundaries

Two-Moons Dataset

Neural Networks Learn Nonlinear Decision Boundaries

Learning Fundamentals

Minimize Expected Risk Using Only Finite Samples

Given

Goal

Generalization Gap

Example Task: "2s" Detector

MNIST: Input and Output Representations

Input Space

Output Space

Same Data, Multiple Representations

Representation Determines Learnability

Example: The Data Domain

The choice of how to represent input is very important

Converting Pattern to Binary Vector

Binary Representation

Key Insight

Linear Classifier on Binary Representation

Representation: Binary vectors, length \(d = 49\)

Hypothesize mapping data to label using linear classifier:

Definition: Linear Function

Linear vs Nonlinear Hypothesis Classes

Perceptron: Linear Combination + Nonlinearity

Mathematical Model

Activation Functions Add Nonlinearity

Why Nonlinearity Matters

Closed-Form vs Iterative Optimization

Explicit (Closed-form)

Iterative (Gradient-based)

Gradient Descent Visualization

Iterative Optimization Principle

The Bias-Variance Decomposition

Expected Prediction Error

Bias-Variance in Practice: Polynomial Fitting

EE 541 Core Principles

Theory

Implementation

Listen to the Data

Data Quality Dominates Quantity

Clive Humby (2006)

Illustrative Example: Data Refinement Impact

Representation Transforms Problem Difficulty

High Dimensions Break Geometric Intuition

The Curse of Dimensionality

Label Noise Degrades Performance More Than Limited Data

Amazon Resume Screening: Training on Biased Data

Data Augmentation: Synthetic Diversity from Limited Samples

Standard Augmentations

Mathematical View

ML Learning Paradigms

Three Paradigms: Supervised, Unsupervised, Reinforcement

Supervised Learning: Labeled Data to Function Mapping

Problem Formulation

Core Tasks