HW 8 — Guide

#| echo: false
import numpy as np
import matplotlib.pyplot as plt

PyTorch for Deep Learning

What is PyTorch?

PyTorch is an open-source machine learning framework developed by Facebook's AI Research lab (FAIR). It evolved from Torch, a scientific computing framework built in Lua. PyTorch reimplemented Torch's core functionality in Python while adding automatic differentiation capabilities.

The fundamental distinction between PyTorch and frameworks like early TensorFlow lies in their computational graph approach. PyTorch uses a dynamic computational graph ("define-by-run"), where the graph is constructed on-the-fly during execution. This offers greater flexibility for debugging and developing complex models.

PyTorch uses a hybrid architecture: the frontend is Python for ease of use and rapid development, while the computational backend is implemented in C++ and CUDA for performance. This architecture provides:

Python's flexibility and ecosystem integration
C++'s execution speed for computation-intensive operations
CUDA's parallel computing capabilities for GPU acceleration

The design focuses on:

Tensor computation with strong GPU acceleration
Automatic differentiation for building and training neural networks
Deep neural network APIs built on a tape-based autograd system

This hybrid approach resolves the apparent paradox of implementing computationally intensive tasks in Python. While Python itself is relatively slow, the actual numeric computations in PyTorch are performed by optimized C++/CUDA code with minimal Python overhead.

GPU Acceleration

Modern deep learning relies heavily on GPU computing to accelerate matrix operations. PyTorch provides built-in support for NVIDIA GPUs through CUDA and for Apple Silicon hardware through Metal Performance Shaders (MPS).

# Check if GPU is available and set device accordingly
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# For Mac users with Apple Silicon
if not torch.cuda.is_available():
    device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

print(f"Using device: {device}")

GPU acceleration typically provides speed improvements of 10-100x compared to CPU-only training for large neural networks. This speedup comes from parallelizing specific operations:

Matrix multiplications in linear layers see massive parallelization benefits
Convolutional operations are highly optimized for GPU execution
Batch processing allows parallel handling of multiple samples

Not all operations benefit equally: recurrent neural networks (RNNs) have sequential dependencies that limit parallelization, and reinforcement learning algorithms with sequential decision processes may see less dramatic speedups. Modern architectures like Transformers were specifically designed to maximize GPU parallelization potential.

To use GPU acceleration effectively:

Move the model to the GPU
```
model = model.to(device)
```

Move input data to the GPU before forward passes

inputs, labels = inputs.to(device), labels.to(device)

Retrieve CPU tensors when needed (e.g., for numpy operations or visualization)
```
cpu_tensor = gpu_tensor.cpu()
```

Performance considerations when using GPUs:

Data transfer overhead: Moving data between CPU and GPU is relatively slow
Batch size: Larger batch sizes utilize GPU parallelism better, but may cause memory issues
Mixed precision: Using half-precision (FP16) can significantly accelerate training on modern GPUs
Memory management: Large models or datasets may require techniques like gradient checkpointing or model parallelism

Tensors: PyTorch's Core Data Structure

Tensor Basics

Tensors are multi-dimensional arrays similar to NumPy's ndarray, but with added GPU support and automatic differentiation capabilities:

# Create a tensor directly
x = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)

# Create a tensor from NumPy array
import numpy as np
np_array = np.array([[1, 2], [3, 4]])
x = torch.from_numpy(np_array)

# Create special tensors
zeros = torch.zeros(2, 3)               # Tensor of zeros
ones = torch.ones(2, 3)                 # Tensor of ones
rand = torch.rand(2, 3)                 # Random uniform [0, 1)
randn = torch.randn(2, 3)               # Random normal (mean=0, var=1)

Tensor Operations

PyTorch provides extensive operations for tensor manipulation:

# Basic arithmetic
a = torch.tensor([1, 2, 3])
b = torch.tensor([4, 5, 6])
c = a + b                         # Element-wise addition
d = a * b                         # Element-wise multiplication
e = torch.matmul(a, b)            # Dot product

# Reshaping
f = torch.randn(4, 4)
g = f.view(16)                    # Reshape to 1D tensor
h = f.view(-1, 8)                 # -1 dimension is inferred

Automatic Differentiation

The key feature distinguishing tensors from NumPy arrays is automatic differentiation through the autograd system:

x = torch.tensor([2.0], requires_grad=True)
y = x**2 + 3*x + 1
y.backward()                      # Compute gradient dy/dx
print(x.grad)                     # dy/dx = 2x + 3 = 7 at x=2

This system enables the efficient computation of gradients for optimizing neural networks.

Dataset Handling in PyTorch

PyTorch provides a standardized way to work with datasets through the Dataset and DataLoader classes.

Built-in Datasets

PyTorch's torchvision module includes many popular computer vision datasets:

import torchvision
import torchvision.transforms as transforms

# Define transformations
transform = transforms.Compose([
    transforms.ToTensor(),  # Convert images to tensors
    transforms.Normalize((0.5,), (0.5,))  # Normalize with mean and std
])

# Load MNIST dataset
train_dataset = torchvision.datasets.MNIST(
    root='./data',
    train=True,
    download=True,
    transform=transform
)

# Create a DataLoader
train_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=64,
    shuffle=True
)

The transforms module allows for data preprocessing and augmentation:

# More complex transformation pipeline
transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),  # Randomly flip images horizontally
    transforms.RandomRotation(10),      # Randomly rotate up to 10 degrees
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

DataLoader Features

The DataLoader class provides several important features:

dataloader = torch.utils.data.DataLoader(
    dataset,
    batch_size=32,        # Number of samples per batch
    shuffle=True,         # Shuffle the data
    num_workers=4,        # Parallel data loading threads
    pin_memory=True       # Better performance with CUDA
)

Batching: Groups samples into batches for efficient processing
Shuffling: Randomizes the order of samples in each epoch
Parallelism: Loads data using multiple worker processes
Memory pinning: Optimizes memory transfers to CUDA devices

Building Neural Networks

Understanding nn.Module

The foundation of neural network models in PyTorch is the nn.Module class. All network architectures inherit from this base class:

import torch.nn as nn

class SimpleNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleNN, self).__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        x = self.flatten(x)
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

Key aspects of nn.Module:

Initialization: The __init__ method defines the layers and components
Forward pass: The forward method defines how data flows through the layers
Parameter tracking: All parameters (weights and biases) are automatically tracked
Module nesting: Modules can contain other modules for hierarchical designs

nn.Sequential vs Custom nn.Module

nn.Sequential provides a container for a linear sequence of layers:

model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

While nn.Sequential is concise, custom nn.Module subclasses offer several advantages:

Complex data flow: Support for skip connections, multiple inputs/outputs, etc.
Conditional computation: Dynamic behavior based on input or state
Reusable components: Define custom building blocks that can be reused
Programmatic creation: Create layers based on parameters or loops

Example of programmatic layer creation with nn.Module:

class MLP(nn.Module):
    def __init__(self, input_size, hidden_sizes, output_size):
        super(MLP, self).__init__()
        
        # Create layers programmatically
        self.layers = nn.ModuleList()
        all_sizes = [input_size] + hidden_sizes + [output_size]
        
        for i in range(len(all_sizes) - 1):
            self.layers.append(nn.Linear(all_sizes[i], all_sizes[i+1]))
            if i < len(all_sizes) - 2:  # No activation after the last layer
                self.layers.append(nn.ReLU())
    
    def forward(self, x):
        x = x.view(x.size(0), -1)  # Flatten
        for layer in self.layers:
            x = layer(x)
        return x

Layer Types

PyTorch provides various layer types for neural network construction:

# Linear (fully connected) layer
linear = nn.Linear(in_features, out_features, bias=True)

# Common activation functions
relu = nn.ReLU()
leaky_relu = nn.LeakyReLU(negative_slope=0.01)
sigmoid = nn.Sigmoid()
tanh = nn.Tanh()

# Batch normalization
batch_norm = nn.BatchNorm1d(num_features)

Parameters and Training

Loss Functions

Loss functions measure the error between predictions and ground truth. PyTorch provides many common loss functions:

# Binary classification
bce_loss = nn.BCELoss()                        # Binary Cross Entropy (requires sigmoid activation)
bce_with_logits = nn.BCEWithLogitsLoss()       # Combines sigmoid + BCELoss

# Multi-class classification
ce_loss = nn.CrossEntropyLoss()                # Combines LogSoftmax + NLLLoss
nll_loss = nn.NLLLoss()                        # Negative Log Likelihood (requires log-softmax activations)

# Regression
mse_loss = nn.MSELoss()                        # Mean Squared Error
l1_loss = nn.L1Loss()                          # Mean Absolute Error
smooth_l1 = nn.SmoothL1Loss()                  # Huber loss

CrossEntropyLoss and BCEWithLogitsLoss combine an activation function with the loss calculation. When using these, do not apply softmax or sigmoid to your model's output:

# With CrossEntropyLoss (correct)
outputs = model(inputs)  # Raw logits
loss = criterion(outputs, labels)

# NOT recommended
outputs = model(inputs)  # Raw logits
outputs = F.softmax(outputs, dim=1)  # Unnecessary softmax
loss = criterion(outputs, labels)    # This will produce incorrect results

The integrated approach improves numerical stability by avoiding operations like exp(x) for large x values.

Optimizers

Optimizers update model parameters based on gradients. The two most widely used in practice are:

# SGD (Stochastic Gradient Descent)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Adam (Adaptive Moment Estimation)
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))

Optimizer selection guidelines:

SGD: Often preferred for simpler models or when generalization is crucial
Adam: Faster convergence for deep networks and complex tasks

Other optimizers have specific use cases:

RMSprop: Effective for recurrent neural networks
AdamW: Improved weight decay implementation compared to Adam
LBFGS: Second-order optimization method, useful for smaller datasets

Weight Initialization

Proper weight initialization is crucial for neural network training. PyTorch provides initialization functions in the nn.init module:

def init_weights(m):
    if isinstance(m, nn.Linear):
        # Kaiming/He initialization (good for ReLU)
        nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu')
        if m.bias is not None:
            nn.init.constant_(m.bias, 0)

# Apply to all layers
model.apply(init_weights)

Common initialization methods:

Xavier/Glorot: Suitable for tanh or sigmoid activations
Kaiming/He: Better for ReLU activations
Orthogonal: Helpful for recurrent networks

PyTorch uses a variant of Kaiming initialization by default for convolutional and linear layers.

The trailing underscore in kaiming_normal_() indicates an in-place operation—a PyTorch convention for functions that modify tensors directly rather than returning new ones. This approach is memory-efficient because it avoids allocating new memory for large tensors, particularly important during initialization of models with millions of parameters.

Model Training and Evaluation

Training Loop

A complete training loop in PyTorch follows this pattern:

def train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs, device):
    train_losses = []
    val_losses = []
    
    for epoch in range(num_epochs):
        # Training phase
        model.train()
        running_loss = 0.0
        
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            
            # Zero the parameter gradients
            optimizer.zero_grad()
            
            # Forward pass
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            
            # Backward pass and optimize
            loss.backward()
            optimizer.step()
            
            running_loss += loss.item()
        
        epoch_train_loss = running_loss / len(train_loader)
        train_losses.append(epoch_train_loss)
        
        # Validation phase
        model.eval()
        running_loss = 0.0
        
        with torch.no_grad():  # Disable gradient computation
            for inputs, labels in val_loader:
                inputs, labels = inputs.to(device), labels.to(device)
                outputs = model(inputs)
                loss = criterion(outputs, labels)
                running_loss += loss.item()
        
        epoch_val_loss = running_loss / len(val_loader)
        val_losses.append(epoch_val_loss)
        
        print(f'Epoch {epoch+1}/{num_epochs} | '
              f'Train Loss: {epoch_train_loss:.4f} | '
              f'Val Loss: {epoch_val_loss:.4f}')
    
    return train_losses, val_losses

Key components of the training loop:

Set the model to training mode with model.train()
Zero gradients before the forward pass with optimizer.zero_grad()
Compute the loss and perform backpropagation with loss.backward()
Update parameters with optimizer.step()
Set the model to evaluation mode with model.eval() during validation
Disable gradient calculation with torch.no_grad() during validation

Learning Rate Scheduling

Learning rate schedulers adjust the learning rate during training:

# Step learning rate scheduler
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

# Reduce learning rate on plateau
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=5)

# In the training loop
for epoch in range(num_epochs):
    # Train for one epoch
    train(...)
    
    # Update the learning rate
    scheduler.step()  # For StepLR
    # or
    scheduler.step(val_loss)  # For ReduceLROnPlateau

Model Evaluation

To evaluate model performance, calculate metrics like accuracy, precision, recall, or F1-score:

def evaluate_model(model, test_loader, device):
    model.eval()
    correct = 0
    total = 0
    
    with torch.no_grad():
        for inputs, labels in test_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    
    accuracy = 100 * correct / total
    return accuracy

Visualizing and Analyzing Results

Learning Curves

Learning curves help identify overfitting or underfitting:

import matplotlib.pyplot as plt

def plot_learning_curves(train_losses, val_losses, train_accs=None, val_accs=None):
    epochs = range(1, len(train_losses) + 1)
    
    fig, ax1 = plt.subplots(figsize=(10, 6))
    
    # Plot losses
    color = 'tab:blue'
    ax1.set_xlabel('Epochs')
    ax1.set_ylabel('Loss', color=color)
    ax1.plot(epochs, train_losses, color=color, label='Training Loss')
    ax1.plot(epochs, val_losses, color='tab:orange', label='Validation Loss')
    ax1.tick_params(axis='y', labelcolor=color)
    
    # Plot accuracies if provided
    if train_accs and val_accs:
        ax2 = ax1.twinx()
        color = 'tab:red'
        ax2.set_ylabel('Accuracy (%)', color=color)
        ax2.plot(epochs, train_accs, color=color, linestyle='--', label='Training Acc')
        ax2.plot(epochs, val_accs, color='tab:green', linestyle='--', label='Validation Acc')
        ax2.tick_params(axis='y', labelcolor=color)
    
    fig.tight_layout()
    fig.legend(loc='upper right', bbox_to_anchor=(1,1), bbox_transform=ax1.transAxes)
    plt.title('Training and Validation Metrics')
    plt.show()

Saving and Loading Models

Saving and loading models in PyTorch is essential for preserving training progress and deploying models.

Basic Model Saving

The simplest way to save a model is to save its state dictionary:

# Save model state dictionary
torch.save(model.state_dict(), 'model.pth')

# Load model state dictionary
model = MyModel()  # Create an instance of the model
model.load_state_dict(torch.load('model.pth'))
model.eval()  # Set to evaluation mode

Comprehensive Checkpointing

For more comprehensive checkpointing that allows resuming training:

def save_checkpoint(model, optimizer, epoch, scheduler, best_accuracy, filepath):
    """Save model checkpoint with all training state."""
    checkpoint = {
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'scheduler_state_dict': scheduler.state_dict() if scheduler else None,
        'best_accuracy': best_accuracy
    }
    torch.save(checkpoint, filepath)
    print(f"Checkpoint saved at {filepath}")

def load_checkpoint(model, optimizer, scheduler, filepath):
    """Load model checkpoint with all training state."""
    checkpoint = torch.load(filepath)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    if scheduler and 'scheduler_state_dict' in checkpoint:
        scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
    epoch = checkpoint['epoch']
    best_accuracy = checkpoint['best_accuracy'] if 'best_accuracy' in checkpoint else 0
    print(f"Checkpoint loaded from {filepath} (epoch {epoch})")
    return epoch, best_accuracy

Including Variables in Filenames

Including timestamp and performance metrics in filenames helps organize checkpoints:

import time
import datetime

def get_checkpoint_filename(model_name, epoch, accuracy=None):
    """Generate a checkpoint filename with timestamp and metrics."""
    timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
    if accuracy is not None:
        return f"{model_name}_{timestamp}_epoch{epoch}_acc{accuracy:.2f}.pth"
    else:
        return f"{model_name}_{timestamp}_epoch{epoch}.pth"

# Usage in training loop
for epoch in range(start_epoch, num_epochs):
    # Training code...
    
    # Save checkpoint periodically
    if (epoch + 1) % 10 == 0:
        filepath = get_checkpoint_filename("resnet18", epoch, val_accuracy)
        save_checkpoint(model, optimizer, epoch, scheduler, best_accuracy, filepath)

This approach organizes checkpoints with relevant information for easy identification.

TorchScript for Deployment

TorchScript is a way to serialize and optimize PyTorch models for production deployment:

# Convert to TorchScript using tracing
example_input = torch.rand(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)
traced_model.save("model_traced.pt")

# Or using scripting (preferred for models with control flow)
scripted_model = torch.jit.script(model)
scripted_model.save("model_scripted.pt")

print("Model saved for deployment")

# Loading a TorchScript model
loaded_model = torch.jit.load("model_scripted.pt")

TorchScript offers several advantages for deployment:

Language independence: Run in C++ environments without Python
Optimization: Optimize the model for inference performance
Portability: Deploy to various platforms, including mobile devices
Graph-level optimizations: Fuse operations for better performance

TorchScript models can be:

Used in production environments where Python is not available
Integrated into larger applications written in C++
Deployed on resource-constrained devices
Run with optimized inference performance

Inside PyTorch: Exploring the Source Code

Understanding PyTorch's internal implementation provides deeper insights into how it achieves both high performance and flexibility.

Finding Source Code

For installed packages, find the source directory with:

import torch
import torch.nn as nn
import inspect

# Find the source file of a class
print(inspect.getsourcefile(nn.Linear))

Linear Layer Implementation

Let's examine the key components of the nn.Linear implementation:

# Simplified version of nn.Linear's core functionality
def linear_forward(input, weight, bias=None):
    output = input.matmul(weight.t())
    if bias is not None:
        output += bias
    return output

The actual PyTorch implementation includes additional optimizations and special cases, but the core operation is a simple matrix multiplication followed by a bias addition. However, this Python-like code ultimately calls optimized C++/CUDA implementations that perform the actual computation.

Loss Function Implementation

The CrossEntropyLoss combines log softmax and negative log-likelihood:

# Simplified version of CrossEntropyLoss's core functionality
def cross_entropy_loss(input, target, weight=None, reduction='mean'):
    log_softmax = F.log_softmax(input, 1)
    loss = F.nll_loss(log_softmax, target, weight=weight, reduction=reduction)
    return loss

The combined implementation avoids numerical instability issues that could arise from separate softmax and log operations. For large values, direct computation of softmax can lead to overflow, while the combined approach uses log-sum-exp tricks to maintain numerical stability.

Autograd Implementation

The automatic differentiation system is built around the concept of a computational graph:

Forward Pass: Tensors flow through operations, recording the computation history
Backward Pass: Gradients are computed by applying the chain rule, flowing backward through the graph

Each operation in PyTorch implements both a forward function and a backward function that defines how gradients propagate. The C++ backend implements these operations efficiently while the Python frontend provides the user interface.

PyTorch Conventions and Patterns

PyTorch follows several conventions that are helpful to understand:

Naming Conventions

In-place operations: Functions ending with an underscore (tensor.add_()) modify the tensor in place. This approach is memory-efficient because it avoids allocating new memory for large tensors, particularly important during initialization of models with millions of parameters.
Parameter classes: Classes starting with Parameter represent learnable parameters
Module hooks: Functions with hook in the name are used for intercepting forward/backward passes

Tensor Dimension Ordering

PyTorch typically follows this dimension ordering convention:

Batch dimension first (N)
For images: [N, C, H, W] (batch, channels, height, width)
For sequences: [N, L, F] (batch, sequence length, features)

The .detach() Method

The detach() method disconnects a tensor from the computation graph:

# Create a tensor requiring gradients
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x * 2

# Detach y from the computation graph
z = y.detach()

# z's operations won't affect x's gradients
z = z * 3
z.sum().backward()  # This won't affect x.grad

This is useful when you want to use a tensor's values without tracking its computational history. Common use cases include:

Preventing gradient flow through certain parts of a network
Using intermediate results for visualization or logging without affecting gradients
Converting tensors to NumPy arrays for interoperability with other libraries
Implementing algorithms that require stopping gradient propagation, like GANs or certain reinforcement learning techniques