PyTorch - 1 Data Parallel

3 minute read

My ML journey started with building NN layer by layer with Tensorflow in 2016. Keras was invented but I still wanted to know more details of Tensorflow but so really adopt Keras till years later.

PyTorch got popular in couple of years later, mainly due to it’s Dynamic computational graphs and strong distributed computing support. Here is some recaps of PyTorch.

Alt text

Basic Training

This doc from PyTorch gives really good step by step learning for training with PyTorch, from calculate gradient manually, to use torch.autograd, and end with using optim lib for optimizations.

NN building

So here is quick summary. First define networks with torch.nn

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        # Define basic network components in init. 
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
        )
    # Always overwrite forward for network definations
    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork()

Loss function

Then, define the train function with loss lib. Pay attention that zero_grad is called from model and torch.no_grad is called before update parameters.

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-6
for t in range(2000):
    y_pred = model(xx)

    loss = loss_fn(y_pred, y)
    
    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

Optimizations

Now let’s add the optimizer. It will be used to call zero_grad and step, which will update all the parameters automatically. With loss.backward(), these are the 3 key steps for trainings.

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. The first argument to the RMSprop constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-3
optimizer = torch.optim.RMSprop(model.parameters(), lr=learning_rate)
for t in range(2000):
    y_pred = model(xx)
    loss = loss_fn(y_pred, y)
  
    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

Data Parallel

DP(Data Parallel) is an obsolete method which runs single-process, multi-thread, and only works on a single machine (with multi-GPUs). It’s recommended to be replaced with DDP(Distributed Data Parallel), which will be introduced later. This is the doc for the introductions.

model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)

model.to(device)

and you will get output like following

Let's use 3 GPUs!
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])

DP can only works with small models. When model is too large to fit into one GPU, Model Parallel need to be applied but it ONLY works with DDP NOT DP.

Tags:

Categories:

Updated: