HW 8 — Guide
#| echo: false import numpy as np import matplotlib.pyplot as plt
PyTorch for Deep Learning
What is PyTorch?
PyTorch is an open-source machine learning framework developed by Facebook's AI Research lab (FAIR). It evolved from Torch, a scientific computing framework built in Lua. PyTorch reimplemented Torch's core functionality in Python while adding automatic differentiation capabilities.
The fundamental distinction between PyTorch and frameworks like early TensorFlow lies in their computational graph approach. PyTorch uses a dynamic computational graph ("define-by-run"), where the graph is constructed on-the-fly during execution. This offers greater flexibility for debugging and developing complex models.
PyTorch uses a hybrid architecture: the frontend is Python for ease of use and rapid development, while the computational backend is implemented in C++ and CUDA for performance. This architecture provides:
- Python's flexibility and ecosystem integration
- C++'s execution speed for computation-intensive operations
- CUDA's parallel computing capabilities for GPU acceleration
The design focuses on:
- Tensor computation with strong GPU acceleration
- Automatic differentiation for building and training neural networks
- Deep neural network APIs built on a tape-based autograd system
This hybrid approach resolves the apparent paradox of implementing computationally intensive tasks in Python. While Python itself is relatively slow, the actual numeric computations in PyTorch are performed by optimized C++/CUDA code with minimal Python overhead.
GPU Acceleration
Modern deep learning relies heavily on GPU computing to accelerate matrix operations. PyTorch provides built-in support for NVIDIA GPUs through CUDA and for Apple Silicon hardware through Metal Performance Shaders (MPS).
# Check if GPU is available and set device accordingly device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # For Mac users with Apple Silicon if not torch.cuda.is_available(): device = torch.device("mps" if torch.backends.mps.is_available() else "cpu") print(f"Using device: {device}")
GPU acceleration typically provides speed improvements of 10-100x compared to CPU-only training for large neural networks. This speedup comes from parallelizing specific operations:
- Matrix multiplications in linear layers see massive parallelization benefits
- Convolutional operations are highly optimized for GPU execution
- Batch processing allows parallel handling of multiple samples
Not all operations benefit equally: recurrent neural networks (RNNs) have sequential dependencies that limit parallelization, and reinforcement learning algorithms with sequential decision processes may see less dramatic speedups. Modern architectures like Transformers were specifically designed to maximize GPU parallelization potential.
To use GPU acceleration effectively:
-
Move the model to the GPU
model = model.to(device)
-
Move input data to the GPU before forward passes
inputs, labels = inputs.to(device), labels.to(device)
-
Retrieve CPU tensors when needed (e.g., for numpy operations or visualization)
cpu_tensor = gpu_tensor.cpu()
Performance considerations when using GPUs:
- Data transfer overhead: Moving data between CPU and GPU is relatively slow
- Batch size: Larger batch sizes utilize GPU parallelism better, but may cause memory issues
- Mixed precision: Using half-precision (FP16) can significantly accelerate training on modern GPUs
- Memory management: Large models or datasets may require techniques like gradient checkpointing or model parallelism
Tensors: PyTorch's Core Data Structure
Tensor Basics
Tensors are multi-dimensional arrays similar to NumPy's ndarray, but with added GPU support and automatic differentiation capabilities:
# Create a tensor directly x = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32) # Create a tensor from NumPy array import numpy as np np_array = np.array([[1, 2], [3, 4]]) x = torch.from_numpy(np_array) # Create special tensors zeros = torch.zeros(2, 3) # Tensor of zeros ones = torch.ones(2, 3) # Tensor of ones rand = torch.rand(2, 3) # Random uniform [0, 1) randn = torch.randn(2, 3) # Random normal (mean=0, var=1)
Tensor Operations
PyTorch provides extensive operations for tensor manipulation:
# Basic arithmetic a = torch.tensor([1, 2, 3]) b = torch.tensor([4, 5, 6]) c = a + b # Element-wise addition d = a * b # Element-wise multiplication e = torch.matmul(a, b) # Dot product # Reshaping f = torch.randn(4, 4) g = f.view(16) # Reshape to 1D tensor h = f.view(-1, 8) # -1 dimension is inferred
Automatic Differentiation
The key feature distinguishing tensors from NumPy arrays is automatic differentiation through the autograd system:
x = torch.tensor([2.0], requires_grad=True) y = x**2 + 3*x + 1 y.backward() # Compute gradient dy/dx print(x.grad) # dy/dx = 2x + 3 = 7 at x=2
This system enables the efficient computation of gradients for optimizing neural networks.
Dataset Handling in PyTorch
PyTorch provides a standardized way to work with datasets through the Dataset and DataLoader classes.
Built-in Datasets
PyTorch's torchvision module includes many popular computer vision datasets:
import torchvision import torchvision.transforms as transforms # Define transformations transform = transforms.Compose([ transforms.ToTensor(), # Convert images to tensors transforms.Normalize((0.5,), (0.5,)) # Normalize with mean and std ]) # Load MNIST dataset train_dataset = torchvision.datasets.MNIST( root='./data', train=True, download=True, transform=transform ) # Create a DataLoader train_loader = torch.utils.data.DataLoader( train_dataset, batch_size=64, shuffle=True )
The transforms module allows for data preprocessing and augmentation:
# More complex transformation pipeline transform = transforms.Compose([ transforms.RandomHorizontalFlip(), # Randomly flip images horizontally transforms.RandomRotation(10), # Randomly rotate up to 10 degrees transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,)) ])
DataLoader Features
The DataLoader class provides several important features:
dataloader = torch.utils.data.DataLoader( dataset, batch_size=32, # Number of samples per batch shuffle=True, # Shuffle the data num_workers=4, # Parallel data loading threads pin_memory=True # Better performance with CUDA )
- Batching: Groups samples into batches for efficient processing
- Shuffling: Randomizes the order of samples in each epoch
- Parallelism: Loads data using multiple worker processes
- Memory pinning: Optimizes memory transfers to CUDA devices
Building Neural Networks
Understanding nn.Module
The foundation of neural network models in PyTorch is the nn.Module class. All network architectures inherit from this base class:
import torch.nn as nn class SimpleNN(nn.Module): def __init__(self, input_size, hidden_size, output_size): super(SimpleNN, self).__init__() self.flatten = nn.Flatten() self.fc1 = nn.Linear(input_size, hidden_size) self.relu = nn.ReLU() self.fc2 = nn.Linear(hidden_size, output_size) def forward(self, x): x = self.flatten(x) x = self.fc1(x) x = self.relu(x) x = self.fc2(x) return x
Key aspects of nn.Module:
- Initialization: The
__init__method defines the layers and components - Forward pass: The
forwardmethod defines how data flows through the layers - Parameter tracking: All parameters (weights and biases) are automatically tracked
- Module nesting: Modules can contain other modules for hierarchical designs
nn.Sequential vs Custom nn.Module
nn.Sequential provides a container for a linear sequence of layers:
model = nn.Sequential( nn.Flatten(), nn.Linear(784, 128), nn.ReLU(), nn.Linear(128, 10) )
While nn.Sequential is concise, custom nn.Module subclasses offer several advantages:
- Complex data flow: Support for skip connections, multiple inputs/outputs, etc.
- Conditional computation: Dynamic behavior based on input or state
- Reusable components: Define custom building blocks that can be reused
- Programmatic creation: Create layers based on parameters or loops
Example of programmatic layer creation with nn.Module:
class MLP(nn.Module): def __init__(self, input_size, hidden_sizes, output_size): super(MLP, self).__init__() # Create layers programmatically self.layers = nn.ModuleList() all_sizes = [input_size] + hidden_sizes + [output_size] for i in range(len(all_sizes) - 1): self.layers.append(nn.Linear(all_sizes[i], all_sizes[i+1])) if i < len(all_sizes) - 2: # No activation after the last layer self.layers.append(nn.ReLU()) def forward(self, x): x = x.view(x.size(0), -1) # Flatten for layer in self.layers: x = layer(x) return x
Layer Types
PyTorch provides various layer types for neural network construction:
# Linear (fully connected) layer linear = nn.Linear(in_features, out_features, bias=True) # Common activation functions relu = nn.ReLU() leaky_relu = nn.LeakyReLU(negative_slope=0.01) sigmoid = nn.Sigmoid() tanh = nn.Tanh() # Batch normalization batch_norm = nn.BatchNorm1d(num_features)
Parameters and Training
Loss Functions
Loss functions measure the error between predictions and ground truth. PyTorch provides many common loss functions:
# Binary classification bce_loss = nn.BCELoss() # Binary Cross Entropy (requires sigmoid activation) bce_with_logits = nn.BCEWithLogitsLoss() # Combines sigmoid + BCELoss # Multi-class classification ce_loss = nn.CrossEntropyLoss() # Combines LogSoftmax + NLLLoss nll_loss = nn.NLLLoss() # Negative Log Likelihood (requires log-softmax activations) # Regression mse_loss = nn.MSELoss() # Mean Squared Error l1_loss = nn.L1Loss() # Mean Absolute Error smooth_l1 = nn.SmoothL1Loss() # Huber loss
CrossEntropyLoss and BCEWithLogitsLoss combine an activation function with the loss calculation. When using these, do not apply softmax or sigmoid to your model's output:
# With CrossEntropyLoss (correct) outputs = model(inputs) # Raw logits loss = criterion(outputs, labels) # NOT recommended outputs = model(inputs) # Raw logits outputs = F.softmax(outputs, dim=1) # Unnecessary softmax loss = criterion(outputs, labels) # This will produce incorrect results
The integrated approach improves numerical stability by avoiding operations like exp(x) for large x values.
Optimizers
Optimizers update model parameters based on gradients. The two most widely used in practice are:
# SGD (Stochastic Gradient Descent) optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9) # Adam (Adaptive Moment Estimation) optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
Optimizer selection guidelines:
- SGD: Often preferred for simpler models or when generalization is crucial
- Adam: Faster convergence for deep networks and complex tasks
Other optimizers have specific use cases:
- RMSprop: Effective for recurrent neural networks
- AdamW: Improved weight decay implementation compared to Adam
- LBFGS: Second-order optimization method, useful for smaller datasets
Weight Initialization
Proper weight initialization is crucial for neural network training. PyTorch provides initialization functions in the nn.init module:
def init_weights(m): if isinstance(m, nn.Linear): # Kaiming/He initialization (good for ReLU) nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu') if m.bias is not None: nn.init.constant_(m.bias, 0) # Apply to all layers model.apply(init_weights)
Common initialization methods:
- Xavier/Glorot: Suitable for tanh or sigmoid activations
- Kaiming/He: Better for ReLU activations
- Orthogonal: Helpful for recurrent networks
PyTorch uses a variant of Kaiming initialization by default for convolutional and linear layers.
The trailing underscore in kaiming_normal_() indicates an in-place operation—a PyTorch convention for functions that modify tensors directly rather than returning new ones. This approach is memory-efficient because it avoids allocating new memory for large tensors, particularly important during initialization of models with millions of parameters.
Model Training and Evaluation
Training Loop
A complete training loop in PyTorch follows this pattern:
def train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs, device): train_losses = [] val_losses = [] for epoch in range(num_epochs): # Training phase model.train() running_loss = 0.0 for inputs, labels in train_loader: inputs, labels = inputs.to(device), labels.to(device) # Zero the parameter gradients optimizer.zero_grad() # Forward pass outputs = model(inputs) loss = criterion(outputs, labels) # Backward pass and optimize loss.backward() optimizer.step() running_loss += loss.item() epoch_train_loss = running_loss / len(train_loader) train_losses.append(epoch_train_loss) # Validation phase model.eval() running_loss = 0.0 with torch.no_grad(): # Disable gradient computation for inputs, labels in val_loader: inputs, labels = inputs.to(device), labels.to(device) outputs = model(inputs) loss = criterion(outputs, labels) running_loss += loss.item() epoch_val_loss = running_loss / len(val_loader) val_losses.append(epoch_val_loss) print(f'Epoch {epoch+1}/{num_epochs} | ' f'Train Loss: {epoch_train_loss:.4f} | ' f'Val Loss: {epoch_val_loss:.4f}') return train_losses, val_losses
Key components of the training loop:
- Set the model to training mode with
model.train() - Zero gradients before the forward pass with
optimizer.zero_grad() - Compute the loss and perform backpropagation with
loss.backward() - Update parameters with
optimizer.step() - Set the model to evaluation mode with
model.eval()during validation - Disable gradient calculation with
torch.no_grad()during validation
Learning Rate Scheduling
Learning rate schedulers adjust the learning rate during training:
# Step learning rate scheduler scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1) # Reduce learning rate on plateau scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=5) # In the training loop for epoch in range(num_epochs): # Train for one epoch train(...) # Update the learning rate scheduler.step() # For StepLR # or scheduler.step(val_loss) # For ReduceLROnPlateau
Model Evaluation
To evaluate model performance, calculate metrics like accuracy, precision, recall, or F1-score:
def evaluate_model(model, test_loader, device): model.eval() correct = 0 total = 0 with torch.no_grad(): for inputs, labels in test_loader: inputs, labels = inputs.to(device), labels.to(device) outputs = model(inputs) _, predicted = torch.max(outputs.data, 1) total += labels.size(0) correct += (predicted == labels).sum().item() accuracy = 100 * correct / total return accuracy
Visualizing and Analyzing Results
Learning Curves
Learning curves help identify overfitting or underfitting:
import matplotlib.pyplot as plt def plot_learning_curves(train_losses, val_losses, train_accs=None, val_accs=None): epochs = range(1, len(train_losses) + 1) fig, ax1 = plt.subplots(figsize=(10, 6)) # Plot losses color = 'tab:blue' ax1.set_xlabel('Epochs') ax1.set_ylabel('Loss', color=color) ax1.plot(epochs, train_losses, color=color, label='Training Loss') ax1.plot(epochs, val_losses, color='tab:orange', label='Validation Loss') ax1.tick_params(axis='y', labelcolor=color) # Plot accuracies if provided if train_accs and val_accs: ax2 = ax1.twinx() color = 'tab:red' ax2.set_ylabel('Accuracy (%)', color=color) ax2.plot(epochs, train_accs, color=color, linestyle='--', label='Training Acc') ax2.plot(epochs, val_accs, color='tab:green', linestyle='--', label='Validation Acc') ax2.tick_params(axis='y', labelcolor=color) fig.tight_layout() fig.legend(loc='upper right', bbox_to_anchor=(1,1), bbox_transform=ax1.transAxes) plt.title('Training and Validation Metrics') plt.show()
Saving and Loading Models
Saving and loading models in PyTorch is essential for preserving training progress and deploying models.
Basic Model Saving
The simplest way to save a model is to save its state dictionary:
# Save model state dictionary torch.save(model.state_dict(), 'model.pth') # Load model state dictionary model = MyModel() # Create an instance of the model model.load_state_dict(torch.load('model.pth')) model.eval() # Set to evaluation mode
Comprehensive Checkpointing
For more comprehensive checkpointing that allows resuming training:
def save_checkpoint(model, optimizer, epoch, scheduler, best_accuracy, filepath): """Save model checkpoint with all training state.""" checkpoint = { 'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'scheduler_state_dict': scheduler.state_dict() if scheduler else None, 'best_accuracy': best_accuracy } torch.save(checkpoint, filepath) print(f"Checkpoint saved at {filepath}") def load_checkpoint(model, optimizer, scheduler, filepath): """Load model checkpoint with all training state.""" checkpoint = torch.load(filepath) model.load_state_dict(checkpoint['model_state_dict']) optimizer.load_state_dict(checkpoint['optimizer_state_dict']) if scheduler and 'scheduler_state_dict' in checkpoint: scheduler.load_state_dict(checkpoint['scheduler_state_dict']) epoch = checkpoint['epoch'] best_accuracy = checkpoint['best_accuracy'] if 'best_accuracy' in checkpoint else 0 print(f"Checkpoint loaded from {filepath} (epoch {epoch})") return epoch, best_accuracy
Including Variables in Filenames
Including timestamp and performance metrics in filenames helps organize checkpoints:
import time import datetime def get_checkpoint_filename(model_name, epoch, accuracy=None): """Generate a checkpoint filename with timestamp and metrics.""" timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S") if accuracy is not None: return f"{model_name}_{timestamp}_epoch{epoch}_acc{accuracy:.2f}.pth" else: return f"{model_name}_{timestamp}_epoch{epoch}.pth" # Usage in training loop for epoch in range(start_epoch, num_epochs): # Training code... # Save checkpoint periodically if (epoch + 1) % 10 == 0: filepath = get_checkpoint_filename("resnet18", epoch, val_accuracy) save_checkpoint(model, optimizer, epoch, scheduler, best_accuracy, filepath)
This approach organizes checkpoints with relevant information for easy identification.
TorchScript for Deployment
TorchScript is a way to serialize and optimize PyTorch models for production deployment:
# Convert to TorchScript using tracing example_input = torch.rand(1, 3, 224, 224) traced_model = torch.jit.trace(model, example_input) traced_model.save("model_traced.pt") # Or using scripting (preferred for models with control flow) scripted_model = torch.jit.script(model) scripted_model.save("model_scripted.pt") print("Model saved for deployment") # Loading a TorchScript model loaded_model = torch.jit.load("model_scripted.pt")
TorchScript offers several advantages for deployment:
- Language independence: Run in C++ environments without Python
- Optimization: Optimize the model for inference performance
- Portability: Deploy to various platforms, including mobile devices
- Graph-level optimizations: Fuse operations for better performance
TorchScript models can be:
- Used in production environments where Python is not available
- Integrated into larger applications written in C++
- Deployed on resource-constrained devices
- Run with optimized inference performance
Inside PyTorch: Exploring the Source Code
Understanding PyTorch's internal implementation provides deeper insights into how it achieves both high performance and flexibility.
Finding Source Code
For installed packages, find the source directory with:
import torch import torch.nn as nn import inspect # Find the source file of a class print(inspect.getsourcefile(nn.Linear))
Linear Layer Implementation
Let's examine the key components of the nn.Linear implementation:
# Simplified version of nn.Linear's core functionality def linear_forward(input, weight, bias=None): output = input.matmul(weight.t()) if bias is not None: output += bias return output
The actual PyTorch implementation includes additional optimizations and special cases, but the core operation is a simple matrix multiplication followed by a bias addition. However, this Python-like code ultimately calls optimized C++/CUDA implementations that perform the actual computation.
Loss Function Implementation
The CrossEntropyLoss combines log softmax and negative log-likelihood:
# Simplified version of CrossEntropyLoss's core functionality def cross_entropy_loss(input, target, weight=None, reduction='mean'): log_softmax = F.log_softmax(input, 1) loss = F.nll_loss(log_softmax, target, weight=weight, reduction=reduction) return loss
The combined implementation avoids numerical instability issues that could arise from separate softmax and log operations. For large values, direct computation of softmax can lead to overflow, while the combined approach uses log-sum-exp tricks to maintain numerical stability.
Autograd Implementation
The automatic differentiation system is built around the concept of a computational graph:
- Forward Pass: Tensors flow through operations, recording the computation history
- Backward Pass: Gradients are computed by applying the chain rule, flowing backward through the graph
Each operation in PyTorch implements both a forward function and a backward function that defines how gradients propagate. The C++ backend implements these operations efficiently while the Python frontend provides the user interface.
PyTorch Conventions and Patterns
PyTorch follows several conventions that are helpful to understand:
Naming Conventions
- In-place operations: Functions ending with an underscore (
tensor.add_()) modify the tensor in place. This approach is memory-efficient because it avoids allocating new memory for large tensors, particularly important during initialization of models with millions of parameters. - Parameter classes: Classes starting with
Parameterrepresent learnable parameters - Module hooks: Functions with
hookin the name are used for intercepting forward/backward passes
Tensor Dimension Ordering
PyTorch typically follows this dimension ordering convention:
- Batch dimension first (N)
- For images: [N, C, H, W] (batch, channels, height, width)
- For sequences: [N, L, F] (batch, sequence length, features)
The .detach() Method
The detach() method disconnects a tensor from the computation graph:
# Create a tensor requiring gradients x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True) y = x * 2 # Detach y from the computation graph z = y.detach() # z's operations won't affect x's gradients z = z * 3 z.sum().backward() # This won't affect x.grad
This is useful when you want to use a tensor's values without tracking its computational history. Common use cases include:
- Preventing gradient flow through certain parts of a network
- Using intermediate results for visualization or logging without affecting gradients
- Converting tensors to NumPy arrays for interoperability with other libraries
- Implementing algorithms that require stopping gradient propagation, like GANs or certain reinforcement learning techniques
{{< include hw08-q01.md >}} {{< include hw08-q02.md >}} {{< include hw08-q03.md >}}