Practice a Mannequin Quicker with torch.compile and Gradient Accumulation

December 27, 2025

28

Coaching a language mannequin with a deep transformer structure is time-consuming. Nevertheless, there are methods you should use to speed up coaching. On this article, you’ll study:

Utilizing torch.compile() to hurry up the mannequin
Utilizing gradient accumulation to coach a mannequin with a bigger efficient batch dimension

Let’s get began!

Practice a Mannequin Quicker with torch.compile and Gradient Accumulation
Photograph by François Genon. Some rights reserved.

Overview

This text is split into two elements; they’re:

Utilizing torch.compile()
Gradient Accumulation

Utilizing torch.compile

Whenever you write your mannequin code and run it with PyTorch, the code is executed in keen mode. This implies the code is executed line by line, and the outcomes are saved in reminiscence. That is native to Python since it’s an interpreted language. You recognize that is the case as a result of while you make a mistake in your code, you’ll not see the error till you run that line of code.

Working a mannequin in keen mode is gradual. Beginning with PyTorch 2.0, you should use torch.compile() to compile a mannequin for improved efficiency. This generates a brand new mannequin object that’s optimized. It’s not the identical mannequin object you created utilizing nn.Module, nevertheless it shares the identical tensors with the unique mannequin. You should utilize this compiled mannequin for ahead go, backward go, and optimizer updates as traditional.

Constructing a mannequin and compiling it as a computation graph is how TensorFlow 1.0 was imagined to work. This makes debugging more durable, because the mannequin you execute can not match line by line with the code you wrote. Due to this fact, you shouldn’t compile your mannequin till you will have run a trial and confirmed that it’s error-free.

Not all fashions might be compiled. Nevertheless, in case your mannequin helps compilation, you instantly profit from the speedup. To compile a mannequin, all it is advisable do is exchange the mannequin object proper earlier than you’re prepared to make use of it:

… mannequin = LlamaForPretraining(model_config).to(machine) mannequin.load_state_dict(checkpoint) mannequin = torch.compile(mannequin) …

...

mannequin = LlamaForPretraining(model_config).to(machine)

mannequin.load_state_dict(checkpoint)

mannequin = torch.compile(mannequin)

...

Don’t load the mannequin weights after compilation. It is because the compiled mannequin is an object that shares the identical weights as the unique mannequin. Throughout compilation, the computation graph is constructed referencing the burden tensors of the unique mannequin. If you happen to load the weights after compilation, the mannequin might not work as anticipated.

Equally, to save lots of the compiled mannequin, you need to confer with the unique mannequin’s state dict, as follows:

torch.save(getattr(mannequin, “_orig_mod”, mannequin).state_dict(), “mannequin.pth”)

torch.save(getattr(mannequin, “_orig_mod”, mannequin).state_dict(), “mannequin.pth”)

The unique mannequin might be accessed from the compiled mannequin utilizing mannequin._orig_mod. Within the code above, we use getattr(mannequin, "_orig_mod", mannequin) to get the unique mannequin if it exists, or use mannequin itself if it doesn’t. This line of code works for each compiled and unique fashions.

Gradient Accumulation

Whenever you practice a mannequin, you probably spend two to a few instances extra time on the backward go than the ahead go. It is because the backward go is extra computationally intensive and makes use of extra reminiscence.

One straightforward trick to hurry up coaching is to carry out fewer backward passes. This may be achieved by rising the batch dimension: with the identical variety of information samples, a bigger batch dimension means fewer batches to course of.

Nevertheless, a bigger batch dimension requires extra reminiscence. In a memory-constrained atmosphere, you may mimic a bigger batch dimension by operating a number of ahead passes and accumulating the gradients. That is known as gradient accumulation.

It’s simpler to elucidate this concept with code:

.. accumulate_steps = 4 for epoch in vary(num_epochs): optimizer.zero_grad() for i, batch in enumerate(dataloader): # get batched information input_ids, target_ids = batch # create consideration masks: causal masks + padding masks attn_mask = create_causal_mask(input_ids.form[1], machine) + create_padding_mask(input_ids, PAD_TOKEN_ID, machine) # extract output from mannequin logits = mannequin(input_ids, attn_mask) # compute loss: cross-entropy between logits and goal, ignoring padding tokens loss = loss_fn(logits.view(-1, logits.dimension(-1)), target_ids.view(-1)) loss = loss / accumulate_steps # Run backward, however replace solely as soon as each `accumulate_steps` steps loss.backward() if (i + 1) % accumulate_steps == 0: torch.nn.utils.clip_grad_norm_(mannequin.parameters(), 1.0) optimizer.step() optimizer.zero_grad() scheduler.step()

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

..

accumulate_steps = 4

for epoch in vary(num_epochs):

optimizer.zero_grad()

for i, batch in enumerate(dataloader):

# get batched information

input_ids, target_ids = batch

# create consideration masks: causal masks + padding masks

attn_mask = create_causal_mask(input_ids.form[1], machine) +

create_padding_mask(input_ids, PAD_TOKEN_ID, machine)

# extract output from mannequin

logits = mannequin(input_ids, attn_mask)

# compute loss: cross-entropy between logits and goal, ignoring padding tokens

loss = loss_fn(logits.view(–1, logits.dimension(–1)), target_ids.view(–1))

loss = loss / accumulate_steps

# Run backward, however replace solely as soon as each `accumulate_steps` steps

loss.backward()

if (i + 1) % accumulate_steps == 0:

torch.nn.utils.clip_grad_norm_(mannequin.parameters(), 1.0)

optimizer.step()

optimizer.zero_grad()

scheduler.step()

The coaching loop above is an excerpt from the earlier article for coaching a Llama mannequin in your native GPU.

Usually, while you run a ahead go, you calculate the loss. You then name loss.backward() to backpropagate the loss gradient by the mannequin parameters. In PyTorch, the backward() technique is cumulative, which means gradients are added up. Due to this fact, it is advisable name optimizer.zero_grad() explicitly to clear the gradients earlier than operating the backward go.

Within the code above, you intentionally don’t name optimizer.zero_grad() in each iteration. As a substitute, you run backpropagation for the loss divided by accumulate_steps. This fashion, the gradients are scaled down however accrued over accumulate_steps iterations. As soon as each accumulate_steps iterations, you run the optimizer to regulate the mannequin parameters.

This method yields outcomes akin to utilizing a bigger batch dimension. Nevertheless, because you run fewer optimizer updates, the training fee schedule must be adjusted accordingly. This implies it is advisable initialize the scheduler with a unique variety of steps:

… num_training_steps = (len(dataloader) // accumulate_steps) * num_epochs cosine_scheduler = lr_scheduler.CosineAnnealingLR( optimizer, T_max=num_training_steps – num_warmup_steps, eta_min=0 )

...

num_training_steps = (len(dataloader) // accumulate_steps) * num_epochs

cosine_scheduler = lr_scheduler.CosineAnnealingLR(

optimizer,

T_max=num_training_steps – num_warmup_steps,

eta_min=0

)

Additional Studying

Under are some supplies that you could be discover attention-grabbing:

Abstract

On this article, you discovered that utilizing torch.compile() will help you pace up the mannequin by compiling the computation graph. You additionally discovered that gradient accumulation is a method for coaching with a bigger efficient batch dimension by accumulating gradients from a number of mini-batches. Because you run fewer optimizer updates this fashion, you save time on backward passes and parameter updates.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Practice a Mannequin Quicker with torch.compile and Gradient Accumulation

Overview

Utilizing torch.compile

Gradient Accumulation

Additional Studying

Abstract

Vector Databases vs. Graph RAG for Agent Reminiscence: When to Use Which

Prime 20 Agentic Coding CLI Instruments in 2026

The 2026 Time Sequence Toolkit: 5 Basis Fashions for Autonomous Forecasting

LEAVE A REPLY Cancel reply

Most Popular

Falling Blossoms Journal (Diary, Pocket book)

meross Matter Good Plug Mini, Simple Setup, 100% Privateness Good Outlet, Compact Measurement, Help Apple Residence, Alexa, Google Residence with Schedule and Timer, App...

Z-Edge 32-inch Curved Gaming Monitor 16:9 1920×1080 240Hz 1ms Frameless LED Gaming Monitor, UG32P AMD Freesync Premium Show Port HDMI

Skullcandy Crusher ANC 2 Wi-fi Over-Ear Bluetooth Headphones, Multi-Sensory Bass, Lively Noise Cancelling, As much as 60 Hours Battery, Microphone for iPhone Android –...

Recent Comments

POPULAR PRODUCTS

Falling Blossoms Journal (Diary, Pocket book)

Reptile Warmth Fixture, 7-Inch Deep Dome Warmth Basking Lamp with 150W Infrared Bulb and three/6/12 Cycle Timer for Turtle, Bearded Dragon, Lizards, Snake

LILYSILK Silk Sleep Masks 100% Pure Silk, 2 Pack, Pure Silk Stuffed, Smooth Pores and skin-Pleasant, Sleeping Eye Masks with Adjustable Strap for Ladies...

POPULAR POSTS

Falling Blossoms Journal (Diary, Pocket book)

meross Matter Good Plug Mini, Simple Setup, 100% Privateness Good Outlet, Compact Measurement, Help Apple Residence, Alexa, Google Residence with Schedule and Timer, App...

Z-Edge 32-inch Curved Gaming Monitor 16:9 1920×1080 240Hz 1ms Frameless LED Gaming Monitor, UG32P AMD Freesync Premium Show Port HDMI

POPULAR CATEGORY

ABOUT US

FOLLOW US