HW 5 — Guide
Logistic Regression and the MNIST Dataset
#| label: setup #| include: false import numpy as np import matplotlib.pyplot as plt import h5py
Binary Classification with Logistic Regression
Logistic regression extends linear models to classification by applying a nonlinear transformation to the output. For logistic regression specifically, this transformation maps the linear predictor to conditional probabilities \(P(Y=1|X)\) between 0 and 1, providing a probabilistic framework for making binary decisions.
The Sigmoid Function as a Nonlinear Detector
The logistic (sigmoid) function transforms any real-valued input into the range (0,1):
This transformation functions as a soft threshold detector. The function's gradient provides critical information about sensitivity to input changes:
The gradient reaches its maximum value of 0.25 at z=0 (where \(\sigma(z)=0.5\)), making the function most sensitive precisely at the decision boundary. This mathematical property is fundamental to the learning process in logistic regression.
#| label: sigmoid-visualization #| fig-cap: "Sigmoid function with derivative and tangent line" def sigmoid(z): """Numerically stable implementation of sigmoid function""" return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z))) def sigmoid_derivative(z): """Derivative of the sigmoid function""" s = sigmoid(z) return s * (1 - s) # Visualize sigmoid and its derivative z = np.linspace(-6, 6, 100) sig = sigmoid(z) dsig = sigmoid_derivative(z) # Plot sigmoid and derivative plt.figure(figsize=(10, 5)) plt.plot(z, sig, 'b-', linewidth=2, label='Sigmoid function') plt.plot(z, dsig, 'r-', linewidth=2, label='Derivative') # Illustrate tangent line at a specific point point_z = 2 point_s = sigmoid(point_z) point_d = sigmoid_derivative(point_z) # Mark the point plt.plot(point_z, point_s, 'ko', markersize=8) # Draw tangent line x_tangent = np.array([point_z - 2, point_z + 2]) y_tangent = point_s + point_d * (x_tangent - point_z) plt.plot(x_tangent, y_tangent, 'g--', linewidth=2, label=f'Tangent at z={point_z}, slope={point_d:.3f}') plt.grid(True) plt.legend() plt.title('Sigmoid Function and Its Derivative') plt.xlabel('z') plt.ylabel('σ(z)') plt.axhline(y=0, color='k', linestyle='-', alpha=0.3) plt.axhline(y=1, color='k', linestyle='-', alpha=0.3) plt.axvline(x=0, color='k', linestyle='-', alpha=0.3)
The forward propagation in logistic regression combines the linear model with the sigmoid function:
#| label: forward-propagation def predict_proba(X, w, b): """ Forward propagation for logistic regression Parameters: X: input features matrix (n_samples, n_features) w: weight vector (n_features,) b: bias term (scalar) Returns: Predicted probabilities for class 1 """ z = X @ w + b return sigmoid(z)
Binary Cross-Entropy Loss
Binary cross-entropy loss has foundations in information theory and maximum likelihood estimation:
This loss function comes directly from maximum likelihood estimation. When modeling p(x) as the probability that y=1 given model parameters w, the likelihood of our dataset is:
Taking the negative logarithm yields the binary cross-entropy loss. This direct derivation from probability theory makes binary cross-entropy the mathematically optimal loss function for logistic regression.
#| label: binary-cross-entropy def binary_cross_entropy(y_true, y_pred): """ Compute binary cross-entropy loss with numerical stability Parameters: y_true: True binary labels (0 or 1) y_pred: Predicted probabilities for class 1 Returns: Average binary cross-entropy loss """ epsilon = 1e-15 # Prevent log(0) y_pred = np.clip(y_pred, epsilon, 1 - epsilon) return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
Regularization
Regularization imposes constraints on the system response (weights) to prevent high-gain solutions that would amplify noise in the input data.
L2 Regularization (Ridge)
L2 regularization applies a quadratic penalty on weights, analogous to energy constraints in signal processing. It produces a smoother decision boundary by limiting the energy in the weight vector. From a Bayesian perspective, this represents a Gaussian prior on the weights.
L1 Regularization (Lasso)
L1 regularization promotes sparsity in the weight vector, similar to compressed sensing in signal reconstruction. It enforces a constraint that drives many weights to precisely zero, effectively performing feature selection by identifying the most salient pixels for detecting the target digit.
L1 vs L2 Regularization: Geometric Interpretation
#| label: l1-l2-visualization #| fig-cap: "Geometric comparison of L1 and L2 regularization constraints" from matplotlib.patches import Circle # Create geometric visualization of L1 vs L2 regularization fig, ax = plt.subplots(1, 2, figsize=(12, 5)) # Set up centered coordinates x = np.linspace(-2, 2, 100) y = np.linspace(-2, 2, 100) X, Y = np.meshgrid(x, y) # True optimal point shifted from origin to show tradeoff # Modified to be not at 45 degrees from origin for better visualization optimal_point = np.array([1.2, 0.6]) # Intentionally asymmetric # Standard loss function (elliptical to avoid 45-degree symmetry) Z_loss = 2*(X-optimal_point[0])**2 + (Y-optimal_point[1])**2 # Contour levels for loss function loss_levels = np.array([0.2, 0.5, 1.0, 2.0, 3.0]) # Regularization constraint values l2_constraint = 0.8 # Radius of circle for L2 l1_constraint = 0.8 # Radius of diamond for L1 # L2 regularization plot cs0 = ax[0].contour(X, Y, Z_loss, loss_levels, colors='blue', alpha=0.8, linestyles='-') l2_circle = Circle((0, 0), l2_constraint, fill=False, color='red', linestyle='-', linewidth=2) ax[0].add_patch(l2_circle) # L1 regularization plot cs1 = ax[1].contour(X, Y, Z_loss, loss_levels, colors='blue', alpha=0.8, linestyles='-') # L1 diamond shape l1_points = np.array([ [l1_constraint, 0], [0, l1_constraint], [-l1_constraint, 0], [0, -l1_constraint], [l1_constraint, 0] ]) ax[1].plot(l1_points[:,0], l1_points[:,1], 'r-', linewidth=2) # Mark the true optimal point (standard loss only) for a in ax: a.plot(optimal_point[0], optimal_point[1], 'b*', markersize=10, label='Loss optimum') # Analytically determine the constrained optimal points # For L2: Point on circle closest to optimal point (projection) l2_direction = optimal_point / np.linalg.norm(optimal_point) # Unit vector toward optimal l2_optimal = l2_constraint * l2_direction # Scale to lie on circle # For L1: Point on diamond closest to optimal point # Due to the asymmetric loss, the optimal point will be on the x-axis l1_optimal = np.array([l1_constraint, 0]) # Sparse solution with w₂=0 # Plot optimal points ax[0].plot(l2_optimal[0], l2_optimal[1], 'ko', markersize=8, label='Regularized optimum') ax[1].plot(l1_optimal[0], l1_optimal[1], 'ko', markersize=8, label='Regularized optimum') # Calculate loss at optimal points l2_loss_value = 2*(l2_optimal[0]-optimal_point[0])**2 + (l2_optimal[1]-optimal_point[1])**2 l1_loss_value = 2*(l1_optimal[0]-optimal_point[0])**2 + (l1_optimal[1]-optimal_point[1])**2 # Add custom contours for the exact loss values at optimal points ax[0].contour(X, Y, Z_loss, [l2_loss_value], colors='blue', linewidths=2) ax[1].contour(X, Y, Z_loss, [l1_loss_value], colors='blue', linewidths=2) # Add contour labels plt.clabel(cs0, inline=1, fontsize=8, fmt='%.1f') plt.clabel(cs1, inline=1, fontsize=8, fmt='%.1f') # Labels and formatting ax[0].set_title('L2 Regularization') ax[0].text(-1.8, 1.7, 'Blue: Loss contours', color='blue', fontsize=9) ax[0].text(-1.8, 1.4, 'Red: L2 constraint (‖w‖₂ ≤ c)', color='red', fontsize=9) ax[0].text(0.9, 0.3, 'Both weights\nnon-zero', color='black', fontsize=9) ax[0].legend(loc='upper right', fontsize=9) ax[1].set_title('L1 Regularization') ax[1].text(-1.8, 1.7, 'Blue: Loss contours', color='blue', fontsize=9) ax[1].text(-1.8, 1.4, 'Red: L1 constraint (‖w‖₁ ≤ c)', color='red', fontsize=9) ax[1].text(l1_constraint + 0.1, -0.2, 'w₂=0\n(Sparse solution)', color='black', ha='center', fontsize=9) # Draw tangent lines at optimal points to emphasize tangency def get_tangent_points(center, radius, point): # Get two points on tangent line through optimal point dx, dy = point[0] - center[0], point[1] - center[1] norm = np.sqrt(dx**2 + dy**2) nx, ny = -dy/norm, dx/norm # Normal vector return [[point[0] - nx*0.5, point[1] - ny*0.5], [point[0] + nx*0.5, point[1] + ny*0.5]] # Add tangent line for L2 tangent_l2 = get_tangent_points([0, 0], l2_constraint, l2_optimal) ax[0].plot([tangent_l2[0][0], tangent_l2[1][0]], [tangent_l2[0][1], tangent_l2[1][1]], 'k--', linewidth=1, alpha=0.7) # Set equal axes limits for better comparison for a in ax: a.set_xlim(-2, 2) a.set_ylim(-2, 2) a.grid(True, linestyle='--', alpha=0.5) a.set_xlabel('w₁') a.set_ylabel('w₂') a.axhline(y=0, color='k', linestyle='-', alpha=0.3) a.axvline(x=0, color='k', linestyle='-', alpha=0.3) a.set_aspect('equal') # Equal aspect ratio plt.tight_layout() plt.suptitle('L1 vs L2 Regularization: Geometric Interpretation', y=1.05)
The figure illustrates the geometric difference between L1 and L2 regularization constraints:
-
Left (L2): The circular constraint tangentially intersects the loss contour at a point where both weights are non-zero. The smooth boundary of the L2 constraint rarely coincides with coordinate axes.
-
Right (L1): The diamond-shaped constraint touches the loss contour precisely at a vertex on the w₁ axis (w₂=0). The corners of the L1 constraint align with coordinate axes, promoting sparse solutions.
For MNIST classification, L1 regularization yields sparse solutions where many pixel weights equal zero, effectively performing feature selection. L2 regularization shrinks all weights but rarely nullifies them completely.
Regularization as Bayesian Priors
From a Bayesian perspective, regularization terms correspond directly to prior distributions on model weights. This connection provides additional insight into why different regularization methods produce different behaviors.
In Bayesian inference, we seek to estimate the posterior distribution of weights given the data:
Where \(p(y|w)\) is the likelihood of observing the data given the weights, and \(p(w)\) is the prior probability of the weights. Taking the negative logarithm of the posterior (and dropping the constant denominator), we get:
For logistic regression with a binary cross-entropy loss function, the negative log-likelihood term is:
The regularization term arises from our prior assumptions about the weights. Consider two common prior distributions:
- Gaussian (Normal) Prior: When we assume weights are normally distributed around zero:
Taking the negative logarithm:
Where $\lambda = 1/\sigma^2$. This gives us the L2 regularization term.
- Laplace Prior: When we assume weights follow a Laplace distribution:
Taking the negative logarithm:
Where $\lambda = 1/b$. This gives us the L1 regularization term.
The critical difference lies in the shape of these distributions near zero. The Laplace distribution has a sharp peak at zero, placing higher probability density on exactly zero values than the Gaussian distribution does. This mathematical property explains why L1 regularization promotes sparsity while L2 does not.
In MNIST digit classification, using L1 regularization effectively encodes a prior belief that most pixel weights should be exactly zero, with only a few important pixels contributing to the classification decision.
Computing the Gradient with Regularization
#| label: regularization-gradient def compute_gradient(X, y_true, y_pred, w, lambda_reg=0, reg_type=None): """ Compute gradient for binary logistic regression with regularization Parameters: X: Input features matrix (n_samples, n_features) y_true: True binary labels (0 or 1) y_pred: Predicted probabilities w: Current weights lambda_reg: Regularization strength reg_type: Type of regularization ('l1', 'l2', or None) Returns: Gradient of the loss with respect to weights """ n_samples = X.shape[0] grad_loss = X.T @ (y_pred - y_true) / n_samples if reg_type == 'l2' and lambda_reg > 0: grad_reg = lambda_reg * w elif reg_type == 'l1' and lambda_reg > 0: grad_reg = lambda_reg * np.sign(w) else: grad_reg = 0 return grad_loss + grad_reg
The gradient computation incorporates the regularization term's derivative. For L2 regularization, this is simply \(\lambda w\), while for L1 regularization, it's \(\lambda \cdot \text{sign}(w)\). This difference in gradients explains why L1 regularization can drive weights exactly to zero, while L2 regularization only approaches zero asymptotically.
Learning Rate Selection Strategy
The learning rate determines the step size in gradient descent. The weight update follows:
For high-dimensional inputs like MNIST (784 features), smaller learning rates typically provide better stability. Common approaches include:
- Fixed step size starting with a conservative value (e.g., 0.01)
- Line search to adaptively find the optimal step size at each iteration
- Scheduled decay that reduces the learning rate as training progresses
#| label: gradient-descent def gradient_descent(X, y, learning_rate=0.01, max_iter=1000, lambda_reg=0, reg_type=None, tol=1e-6): """ Simple gradient descent implementation for logistic regression Parameters: X: Input features matrix (n_samples, n_features) y: True binary labels (0 or 1) learning_rate: Step size for gradient updates max_iter: Maximum number of iterations lambda_reg: Regularization strength reg_type: Type of regularization ('l1', 'l2', or None) tol: Convergence tolerance (minimum change in loss) Returns: w: Learned weights b: Learned bias losses: History of loss values during training """ n_features = X.shape[1] # Initialize weights and bias w = np.zeros(n_features) b = 0 losses = [] for iteration in range(max_iter): # Forward pass y_pred = predict_proba(X, w, b) # Compute loss loss = binary_cross_entropy(y, y_pred) if reg_type == 'l2': loss += lambda_reg * 0.5 * np.sum(w**2) elif reg_type == 'l1': loss += lambda_reg * np.sum(np.abs(w)) losses.append(loss) # Check convergence if iteration > 0 and abs(losses[-1] - losses[-2]) < tol: break # Compute gradients grad_w = compute_gradient(X, y, y_pred, w, lambda_reg, reg_type) grad_b = np.mean(y_pred - y) # Update parameters w -= learning_rate * grad_w b -= learning_rate * grad_b return w, b, losses
Convergence Criteria
Establishing proper convergence criteria prevents both premature termination and unnecessary computation. Consider monitoring:
- Absolute loss threshold: Stop when loss falls below a small value
- Relative improvement: Stop when improvements between iterations become minimal
- Validation performance: Stop when validation performance plateaus or degrades
Weight Interpretation as Filter Response
The learned weight vector w in logistic regression represents a spatial filter matched to the target pattern (digit "2"). Areas with positive weights increase the activation when pixel values are high, while negative weights suppress the response.
Visualizing this weight vector as a 28×28 image typically reveals a pattern resembling the target digit, providing insight into the features the model has learned to identify.
#| label: weight-visualization #| eval: false def visualize_weights(w): """ Visualize the learned weights as an image Parameters: w: Weight vector of length 784 (for 28x28 MNIST images) """ plt.figure(figsize=(6, 6)) plt.imshow(w.reshape(28, 28), cmap='RdBu_r') plt.colorbar() plt.title('Weight Vector Visualization') plt.show()
Evaluation Beyond Accuracy
The sensitivity-specificity tradeoff provides an important perspective for binary classifiers:
- Accuracy: Overall correctness (potentially misleading with imbalanced classes)
- Precision: Proportion of true positives among positive predictions
- Recall: Proportion of true positives that are correctly identified
- F1 score: Harmonic mean of precision and recall
These metrics connect to detection theory concepts of probability of detection and false alarm rates in signal processing.
#| label: evaluation-metrics def evaluate_binary_classifier(y_true, y_pred_prob, threshold=0.5): """ Evaluate binary classifier performance metrics Parameters: y_true: True binary labels (0 or 1) y_pred_prob: Predicted probabilities for class 1 threshold: Classification threshold (default: 0.5) Returns: Dictionary of evaluation metrics """ y_pred = (y_pred_prob >= threshold).astype(int) # Compute metrics accuracy = np.mean(y_pred == y_true) true_pos = np.sum((y_pred == 1) & (y_true == 1)) false_pos = np.sum((y_pred == 1) & (y_true == 0)) false_neg = np.sum((y_pred == 0) & (y_true == 1)) true_neg = np.sum((y_pred == 0) & (y_true == 0)) precision = true_pos / (true_pos + false_pos) if (true_pos + false_pos) > 0 else 0 recall = true_pos / (true_pos + false_neg) if (true_pos + false_neg) > 0 else 0 f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0 return { 'accuracy': accuracy, 'precision': precision, 'recall': recall, 'f1_score': f1 }
Learning Curves Analysis
Learning curves plot the model's performance on training and validation sets against the number of iterations. These curves help diagnose:
- Underfitting: Both training and validation errors remain high
- Overfitting: Training error continues to decrease while validation error increases
- Optimal stopping point: Where validation error reaches its minimum
#| label: learning-curves #| eval: false def plot_learning_curves(train_losses, val_losses): """ Plot learning curves to analyze training progress Parameters: train_losses: List of loss values on training set val_losses: List of loss values on validation set """ plt.figure(figsize=(10, 6)) plt.plot(train_losses, label='Training Loss') plt.plot(val_losses, label='Validation Loss') plt.xlabel('Iteration') plt.ylabel('Loss') plt.title('Learning Curves') plt.grid(True) plt.legend() plt.show()
Saving Model Parameters
Save your trained weights and bias as specified:
#| label: save-model #| eval: false def save_model(weights, bias, outFile): """ Save trained model parameters to HDF5 file Parameters: weights: Trained weight vector (length 784) bias: Trained bias term (scalar) outFile: Output file path """ with h5py.File(outFile, 'w') as hf: hf.create_dataset('w', data=np.asarray(weights)) hf.create_dataset('b', data=np.asarray(bias))
The weights should be a 784-length vector and the bias a scalar value.