PyTorch Basics

2.3 PyTroch Basics

Learn the PyTorch Basics!

Created Date: 2025-05-10

Most machine learning workflows involve working with data, creating models, optimizing model parameters, and saving the trained models. This tutorial introduces you to a complete ML workflow implemented in PyTorch, with links to learn more about each of these concepts.

We’ll use the MNIST dataset to train a neural network that predicts if an input image belongs to one of the ten digit classes (0 through 9).

This tutorial assumes a basic familiarity with Python and Deep Learning concepts.

If you’re familiar with other deep learning frameworks, check out the section Quickstart first to quickly familiarize yourself with PyTorch’s API.

If you’re new to deep learning frameworks, head right into the section Tensors of our step-by-step guide.

2.3.1 Quickstart

This section runs through the API for common tasks in machine learning. File quick_start.py records the full process, refer to the links in each section to dive deeper.

2.3.1.1 Working With Data

PyTorch has two primitives to work with data: torch.utils.data.DataLoader and torch.utils.data.Dataset . Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset .

PyTorch offers domain-specific libraries such as TorchText, TorchVision, and TorchAudio, all of which include datasets. For this tutorial, we will be using a TorchVision dataset.

Every TorchVision Dataset includes two arguments: transform and target_transform to modify the samples and labels respectively.

from torchvision import datasets

training_data = datasets.MNIST(
    root='data',
    train=True,
    download=True,
    transform=torchvision.transforms.ToTensor(),
)

test_data = datasets.MNIST(
    root='data',
    train=False,
    download=True,
    transform=torchvision.transforms.ToTensor(),
)

We pass the Dataset as an argument to DataLoader. This wraps an iterable over our dataset, and supports automatic batching, sampling, shuffling and multiprocess data loading. Here we define a batch size of 64, i.e. each element in the dataloader iterable will return a batch of 64 features and labels.

batch_size = 64
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

for X, y in test_dataloader:
    print(f"Shape of X [N, C, H, W]: {X.shape}")
    print(f"Shape of y: {y.shape} {y.dtype}")
    break

Shape of X [N, C, H, W]: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64]) torch.int64

2.3.1.2 Creating Models

To define a neural network in PyTorch, we create a class that inherits from nn.Module. We define the layers of the network in the __init__ function and specify how data will pass through the network in the forward function. To accelerate operations in the neural network, we move it to the accelerator such as CUDA, MPS, MTIA, or XPU. If the current accelerator is available, we will use it. Otherwise, we use the CPU.

device = torch.accelerator.current_accelerator().type \
    if torch.accelerator.is_available() else 'cpu'

print(f'Using {device} device')

class NeuralNetwork(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = torch.nn.Flatten()
        self.linear_relu_stack = torch.nn.Sequential(
            torch.nn.Linear(28 * 28, 512),
            torch.nn.ReLU(),
            torch.nn.Linear(512, 512),
            torch.nn.ReLU(),
            torch.nn.Linear(512, 10)
        )
        
    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(device)
print(model)

Using mps device
NeuralNetwork(
    (flatten): Flatten(start_dim=1, end_dim=-1)
    (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
    )
)

2.3.1.3 Optimizing the Model Parameters

To train a model, we need a loss function and an optimizer.

loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

In a single training loop, the model makes predictions on the training dataset (fed to it in batches), and backpropagates the prediction error to adjust the model’s parameters.

def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y= X.to(device), y.to(device)
        
        # Compute prediction error.
        pred = model(X)
        loss = loss_fn(pred, y)
        
        # backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        
        if batch % 100 == 0:
            loss, current = loss.item(), (batch + 1) * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

We also check the model’s performance against the test dataset to ensure it is learning.

def test(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

The training process is conducted over several iterations (epochs). During each epoch, the model learns parameters to make better predictions. We print the model’s accuracy and loss at each epoch; we’d like to see the accuracy increase and the loss decrease with every epoch.

epochs = 20
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train(train_dataloader, model, loss_fn, optimizer)
    test(test_dataloader, model, loss_fn)
print("Done!")

Epoch 1
-------------------------------
loss: 2.304244  [   64/60000]
loss: 2.297499  [ 6464/60000]
loss: 2.292170  [12864/60000]
loss: 2.286326  [19264/60000]
loss: 2.280987  [25664/60000]
loss: 2.273621  [32064/60000]
loss: 2.264372  [38464/60000]
loss: 2.281450  [44864/60000]
loss: 2.262312  [51264/60000]
loss: 2.248378  [57664/60000]
Test Error: 
    Accuracy: 44.1%, Avg loss: 2.254247

...

Epoch 19
-------------------------------
loss: 0.454976  [   64/60000]
loss: 0.356519  [ 6464/60000]
loss: 0.364573  [12864/60000]
loss: 0.436727  [19264/60000]
loss: 0.348171  [25664/60000]
loss: 0.429552  [32064/60000]
loss: 0.311450  [38464/60000]
loss: 0.496438  [44864/60000]
loss: 0.452633  [51264/60000]
loss: 0.487647  [57664/60000]
Test Error: 
    Accuracy: 89.2%, Avg loss: 0.392691 

Epoch 20
-------------------------------
loss: 0.439239  [   64/60000]
loss: 0.345591  [ 6464/60000]
loss: 0.349240  [12864/60000]
loss: 0.428157  [19264/60000]
loss: 0.335903  [25664/60000]
loss: 0.421431  [32064/60000]
loss: 0.300410  [38464/60000]
loss: 0.485814  [44864/60000]
loss: 0.441437  [51264/60000]
loss: 0.481266  [57664/60000]
Test Error: 
    Accuracy: 89.4%, Avg loss: 0.382875 

Done!

2.3.2 Tensors

Tensors are a specialized data structure that are very similar to arrays and matrices. In PyTorch, we use tensors to encode the inputs and outputs of a model, as well as the model’s parameters.

Tensors are similar to NumPy’s ndarrays, except that tensors can run on GPUs or other hardware accelerators. In fact, tensors and NumPy arrays can often share the same underlying memory, eliminating the need to copy data (see Bridge with NumPy). Tensors are also optimized for automatic differentiation (we’ll see more about that later in the Autograd section). If you’re familiar with ndarrays, you’ll be right at home with the Tensor API. If not, follow along!

2.3.2.1 Initializing a Tensor

Tensors can be initialized in various ways. Take a look at the following examples in file tensor_demo.py .

Directly from Data

Tensors can be created directly from data. The data type is automatically inferred.

# directly from data
data = [[1, 2], [3, 4]]
x_data = torch.tensor(data)
assert x_data.shape == (2, 2)

From a NumPy array

Tensors can be created from NumPy arrays (and vice versa - see Bridge with NumPy).

# from a numpy array
np_array = numpy.array(data)
x_np = torch.from_numpy(np_array)
assert x_np.shape == (2, 2)

From Another Tensor

The new tensor retains the properties (shape, datatype) of the argument tensor, unless explicitly overridden.

x_ones = torch.ones_like(x_data)
print(f"Ones Tensor: \n {x_ones} \n")

x_rand = torch.rand_like(x_data, dtype=torch.float)
print(f"Random Tensor: \n {x_rand} \n")

Ones Tensor: 
 tensor([[1, 1],
        [1, 1]])

Random Tensor:
 tensor([[0.2572, 0.9217],
        [0.7783, 0.2022]])

With Random or Constant Values

shape is a tuple of tensor dimensions. In the functions below, it determines the dimensionality of the output tensor.

shape = (2,3,)
rand_tensor = torch.rand(shape)
ones_tensor = torch.ones(shape)
zeros_tensor = torch.zeros(shape)

print(f"Random Tensor: \n {rand_tensor} \n")
print(f"Ones Tensor: \n {ones_tensor} \n")
print(f"Zeros Tensor: \n {zeros_tensor}")

Random Tensor:
 tensor([[0.9961, 0.3413, 0.4354],
        [0.4710, 0.9064, 0.8762]])

Ones Tensor:
 tensor([[1., 1., 1.],
        [1., 1., 1.]])

Zeros Tensor:
 tensor([[0., 0., 0.],
        [0., 0., 0.]])

2.3.2.2 Attributes of a Tensor

Tensor attributes describe their shape, datatype, and the device on which they are stored.

tensor = torch.rand(3,4)

print(f"Shape of tensor: {tensor.shape}")
print(f"Datatype of tensor: {tensor.dtype}")
print(f"Device tensor is stored on: {tensor.device}")

Shape of tensor: torch.Size([3, 4])
Datatype of tensor: torch.float32
Device tensor is stored on: cpu

2.3.2.3 Operations on Tensors

Over 1200 tensor operations, including arithmetic, linear algebra, matrix manipulation (transposing, indexing, slicing), sampling and more are comprehensively described here .

Accelerator

Each of these operations can be run on the CPU and Accelerator such as CUDA, MPS, MTIA, or XPU.

By default, tensors are created on the CPU. We need to explicitly move tensors to the accelerator using .to method (after checking for accelerator availability). Keep in mind that copying large tensors across devices can be expensive in terms of time and memory!

# We move our tensor to the current accelerator if available
if torch.accelerator.is_available():
    tensor = tensor.to(torch.accelerator.current_accelerator())
print(f"Device tensor is stored on: {tensor.device}")

Device tensor is stored on: cuda:0

Standard MumPy-like Indexing and Slicing

2.3.3 Datasets and DataLoaders

Code for processing data samples can get messy and hard to maintain; we ideally want our dataset code to be decoupled from our model training code for better readability and modularity. PyTorch provides two data primitives: torch.utils.data.DataLoader and torch.utils.data.Dataset that allow you to use pre-loaded datasets as well as your own data. Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.

PyTorch domain libraries provide a number of pre-loaded datasets (such as FashionMNIST) that subclass torch.utils.data.Dataset and implement functions specific to the particular data. They can be used to prototype and benchmark your model. You can find them here: Image Datasets, Text Datasets, and Audio Datasets

2.3.3.1 Loading a Dataset

2.3.3.2 Iterating and Visualizing the Dataset

2.3.3.3 Creating a Custom Dataset

2.3.4 Transforms

2.3.5 Build the Nerual Model

2.3.6 Automatic Differentiation

When training neural networks, the most frequently used algorithm is back propagation. In this algorithm, parameters (model weights) are adjusted according to the gradient of the loss function with respect to the given parameter.

To compute those gradients, PyTorch has a built-in differentiation engine called torch.autograd. It supports automatic computation of gradient for any computational graph.

Consider the simplest one-layer neural network, with input x, parameters w and b, and some loss function. It can be defined in PyTorch in the following manner:

# input tensor
x = torch.ones(5)
# expected output
y = torch.zeros(3)
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w) + b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)