Evaluating Perplexity on Language Fashions

December 27, 2025

19

A language mannequin is a chance distribution over sequences of tokens. Whenever you practice a language mannequin, you wish to measure how precisely it predicts human language use. This can be a troublesome job, and also you want a metric to judge the mannequin. On this article, you’ll be taught in regards to the perplexity metric. Particularly, you’ll be taught:

What’s perplexity, and tips on how to compute it
How you can consider the perplexity of a language mannequin with pattern knowledge

Let’s get began.

Evaluating Perplexity on Language Fashions
Picture by Lucas Davis. Some rights reserved.

Overview

This text is split into two elements; they’re:

What Is Perplexity and How you can Compute It
Consider the Perplexity of a Language Mannequin with HellaSwag Dataset

What Is Perplexity and How you can Compute It

Perplexity is a measure of how properly a language mannequin predicts a pattern of textual content. It’s outlined because the inverse of the geometric imply of the possibilities of the tokens within the pattern. Mathematically, perplexity is outlined as:

$$
PPL(x_{1:L}) = prod_{i=1}^L p(x_i)^{-1/L} = expbig(-frac{1}{L} sum_{i=1}^L log p(x_i)massive)
$$

Perplexity is a perform of a specific sequence of tokens. In observe, it’s extra handy to compute perplexity because the imply of the log chances, as proven within the formulation above.

Perplexity is a metric that quantifies how a lot a language mannequin hesitates in regards to the subsequent token on common. If the language mannequin is completely sure, the perplexity is 1. If the language mannequin is totally unsure, then each token within the vocabulary is equally seemingly; the perplexity is the same as the vocabulary measurement. You shouldn’t count on perplexity to transcend this vary.

Consider the Perplexity of a Language Mannequin with HellaSwag Dataset

Perplexity is a dataset-dependent metric. One dataset you should utilize is HellaSwag. It’s a dataset with practice, check, and validation splits. It’s out there on the Hugging Face hub, and you may load it with the next code:

import datasets dataset = datasets.load_dataset(“HuggingFaceFW/hellaswag”) print(dataset) for pattern in dataset[“validation”]: print(pattern) break

import datasets

dataset = datasets.load_dataset(“HuggingFaceFW/hellaswag”)

print(dataset)

for pattern in dataset[“validation”]:

print(pattern)

break

Operating this code will print the next:

DatasetDict({ practice: Dataset({ options: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’, ‘source_id’, ‘split’, ‘split_type’, ‘label’], num_rows: 39905 }) check: Dataset({ options: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’, ‘source_id’, ‘split’, ‘split_type’, ‘label’], num_rows: 10003 }) validation: Dataset({ options: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’, ‘source_id’, ‘split’, ‘split_type’, ‘label’], num_rows: 10042 }) }) {‘ind’: 24, ‘activity_label’: ‘Roof shingle elimination’, ‘ctx_a’: ‘A person is sitting on a roof.’, ‘ctx_b’: ‘he’, ‘ctx’: ‘A person is sitting on a roof. he’, ‘endings’: [ ‘is using wrap to wrap a pair of skis.’, ‘is ripping level tiles off.’, “is holding a rubik’s cube.”, ‘starts pulling up roofing on a roof.’ ], ‘source_id’: ‘activitynet~v_-JhWjGDPHMY’, ‘break up’: ‘val’, ‘split_type’: ‘indomain’, ‘label’: ‘3’}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

DatasetDict({

practice: Dataset({

options: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’,

‘source_id’, ‘split’, ‘split_type’, ‘label’],

num_rows: 39905

})

check: Dataset({

options: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’,

‘source_id’, ‘split’, ‘split_type’, ‘label’],

num_rows: 10003

})

validation: Dataset({

options: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’,

‘source_id’, ‘split’, ‘split_type’, ‘label’],

num_rows: 10042

})

{‘ind’: 24, ‘activity_label’: ‘Roof shingle elimination’,

‘ctx_a’: ‘A person is sitting on a roof.’, ‘ctx_b’: ‘he’,

‘ctx’: ‘A person is sitting on a roof. he’, ‘endings’: [

‘is using wrap to wrap a pair of skis.’, ‘is ripping level tiles off.’,

“is holding a rubik’s cube.”, ‘starts pulling up roofing on a roof.’

], ‘source_id’: ‘activitynet~v_-JhWjGDPHMY’, ‘break up’: ‘val’, ‘split_type’: ‘indomain’,

‘label’: ‘3’}

You possibly can see that the validation break up has 10,042 samples. That is the dataset you’ll use on this article. Every pattern is a dictionary. The important thing "activity_label" describes the exercise class, and the important thing "ctx" offers the context that must be accomplished. The mannequin is predicted to finish the sequence by deciding on one of many 4 endings. The important thing "label", with values 0 to three, signifies which ending is right.

With this, you may write a brief code to judge your personal language mannequin. Let’s use a small mannequin from Hugging Face for example:

import datasets import torch import torch.nn.useful as F import tqdm import transformers mannequin = “openai-community/gpt2″ # Load the mannequin torch.set_default_device(“cuda” if torch.cuda.is_available() else “cpu”) tokenizer = transformers.AutoTokenizer.from_pretrained(mannequin) mannequin = transformers.AutoModelForCausalLM.from_pretrained(mannequin) # Load the dataset: HellaSwag has practice, check, and validation splits dataset = datasets.load_dataset(“hellaswag”, break up=”validation”) # Consider the mannequin: Compute the perplexity of every ending num_correct = 0 for pattern in tqdm.tqdm(dataset): # tokenize textual content from the pattern textual content = tokenizer.encode(” ” + pattern[“activity_label”] + “. ” + pattern[“ctx”]) endings = [tokenizer.encode(” ” + x) for x in sample[“endings”]] # 4 endings groundtruth = int(pattern[“label”]) # integer, 0 to three # generate logits for every ending perplexities = [0.0] * 4 for i, ending in enumerate(endings): # run your entire enter and ending to the mannequin input_ids = torch.tensor(textual content + ending).unsqueeze(0) output = mannequin(input_ids).logits # extract the logits for every token within the ending logits = output[0, len(text)-1:, :] token_probs = F.log_softmax(logits, dim=-1) # accumulate the chance of producing the ending log_prob = 0.0 for j, token in enumerate(ending): log_prob += token_probs[j, token] # convert the sum of log chances to perplexity perplexities[i] = torch.exp(-log_prob / len(ending)) # print the perplexity of every ending print(pattern[“activity_label”] + “. ” + pattern[“ctx”]) right = perplexities[groundtruth] == min(perplexities) for i, p in enumerate(perplexities): if i == groundtruth: image=”(O)” if right else ‘(!)’ elif p == min(perplexities): image=”(X)” else: image=” “ print(f”Ending {i}: {p:.4g} {image} – {pattern[‘endings’][i]}”) if right: num_correct += 1 print(f”Accuracy: {num_correct}/{len(dataset)} = {num_correct / len(dataset):.4f}”)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

import datasets

import torch

import torch.nn.useful as F

import tqdm

import transformers

mannequin = “openai-community/gpt2”

# Load the mannequin

torch.set_default_device(“cuda” if torch.cuda.is_available() else “cpu”)

tokenizer = transformers.AutoTokenizer.from_pretrained(mannequin)

mannequin = transformers.AutoModelForCausalLM.from_pretrained(mannequin)

# Load the dataset: HellaSwag has practice, check, and validation splits

dataset = datasets.load_dataset(“hellaswag”, break up=“validation”)

# Consider the mannequin: Compute the perplexity of every ending

num_correct = 0

for pattern in tqdm.tqdm(dataset):

# tokenize textual content from the pattern

textual content = tokenizer.encode(” “ + pattern[“activity_label”] + “. “ + pattern[“ctx”])

endings = [tokenizer.encode(” “ + x) for x in sample[“endings”]] # 4 endings

groundtruth = int(pattern[“label”]) # integer, 0 to three

# generate logits for every ending

perplexities = [0.0] * 4

for i, ending in enumerate(endings):

# run your entire enter and ending to the mannequin

input_ids = torch.tensor(textual content + ending).unsqueeze(0)

output = mannequin(input_ids).logits

# extract the logits for every token within the ending

logits = output[0, len(text)–1:, :]

token_probs = F.log_softmax(logits, dim=–1)

# accumulate the chance of producing the ending

log_prob = 0.0

for j, token in enumerate(ending):

log_prob += token_probs[j, token]

# convert the sum of log chances to perplexity

perplexities[i] = torch.exp(–log_prob / len(ending))

# print the perplexity of every ending

print(pattern[“activity_label”] + “. “ + pattern[“ctx”])

right = perplexities[groundtruth] == min(perplexities)

for i, p in enumerate(perplexities):

if i == groundtruth:

image = ‘(O)’ if right else ‘(!)’

elif p == min(perplexities):

image = ‘(X)’

else:

image = ‘ ‘

print(f“Ending {i}: {p:.4g} {image} – {pattern[‘endings’][i]}”)

if right:

num_correct += 1

print(f“Accuracy: {num_correct}/{len(dataset)} = {num_correct / len(dataset):.4f}”)

This code hundreds the smallest GPT-2 mannequin from the Hugging Face Hub. It’s a 124M-parameter mannequin you could simply run on a low-profile pc. The mannequin and tokenizer are loaded utilizing the Hugging Face transformers library. You additionally load the HellaSwag validation dataset.

Within the for-loop, you tokenize the exercise label and the context. You additionally tokenize every of the 4 endings. Be aware that tokenizer.encode() is the strategy for utilizing the tokenizer from the transformers library. It’s completely different from the tokenizer object you used within the earlier article.

Subsequent, for every ending, you run the concatenated enter and ending to the mannequin. The input_ids tensor is a 2D tensor of integer token IDs with the batch dimension 1. The mannequin returns an object, during which you extract the output logits tensor. That is completely different from the mannequin you constructed within the earlier article as it is a mannequin object from the transformers library. You possibly can simply swap it together with your educated mannequin object with minor adjustments.

GPT-2 is a decoder-only transformer mannequin. It processes the enter with a causal masks. For an enter tensor of form $(1, L)$, the output logits tensor has form $(1, L, V)$, the place $V$ is the vocabulary measurement. The output at place $p$ corresponds to the mannequin’s estimate of the token at place $p+1$, relying on the enter at positions 1 to $p$. Due to this fact, you extract the logits beginning at offset $n-1$, the place $n$ is the size of the mixed exercise label and context. You then convert the logits to log chances and compute the common over the size of every ending.

The worth token_probs[j, token] is the log chance at place j for the token with ID token. The imply log-probability of every token within the ending is used to compute the perplexity. A very good mannequin is predicted to establish the proper ending with the bottom perplexity. You possibly can consider a mannequin by counting the variety of right predictions over your entire HellaSwag validation dataset. Whenever you run this code, you will notice the next:

… Finance and Enterprise. [header] How you can purchase a peridot Evaluating Perplexity on Language Fashions Have a look at quite a lot of stones… Ending 0: 13.02 (X) – It would be best to watch a number of of the gems, notably eme… Ending 1: 30.19 – Not solely are they among the many delicates amongst them, however they are often… Ending 2: 34.96 (!) – Familiarize your self with the completely different shades that it is available in, … Ending 3: 28.85 – Neither peridot nor many different jade or allekite stones are necess… Household Life. [header] How you can inform in case your teen is being abused Evaluating Perplexity on Language Fashions Take note of… Ending 0: 16.58 – Strive to determine why they’re dressing one thing that’s frowned… Ending 1: 22.01 – Learn the next as a rule for figuring out your teen’s behaviou… Ending 2: 15.21 (O) – [substeps] For example, your teen could attempt to cover the indicators of a… Ending 3: 23.91 – [substeps] Ask your teen if they’ve black tights (with stripper… Accuracy: 3041/10042 = 0.3028

…

Finance and Enterprise. [header] How you can purchase a peridot Evaluating Perplexity on Language Fashions Have a look at quite a lot of stones…

Ending 0: 13.02 (X) – It would be best to watch a number of of the gems, notably eme…

Ending 1: 30.19 – Not solely are they among the many delicates amongst them, however they are often…

Ending 2: 34.96 (!) – Familiarize your self with the completely different shades that it is available in, …

Ending 3: 28.85 – Neither peridot nor many different jade or allekite stones are necess…

Household Life. [header] How you can inform in case your teen is being abused Evaluating Perplexity on Language Fashions Take note of…

Ending 0: 16.58 – Strive to determine why they’re dressing one thing that’s frowned…

Ending 1: 22.01 – Learn the next as a rule for figuring out your teen’s behaviou…

Ending 2: 15.21 (O) – [substeps] For example, your teen could attempt to cover the indicators of a…

Ending 3: 23.91 – [substeps] Ask your teen if they’ve black tights (with stripper…

Accuracy: 3041/10042 = 0.3028

The code prints the perplexity of every ending and marks the proper reply with (O) or (!) and the mannequin’s unsuitable prediction with (X). You possibly can see that GPT-2 has a perplexity of 10 to twenty, even for an accurate reply. Superior LLMs can obtain perplexity under 10, even with a a lot bigger vocabulary measurement than GPT-2. Extra vital is whether or not the mannequin can establish the proper ending: the one which naturally completes the sentence. It needs to be the one with the bottom perplexity; in any other case, the mannequin can not generate the proper ending. GPT-2 achieves solely 30% accuracy on this dataset.

You can too repeat the code with a unique mannequin. Listed below are the outcomes:

mannequin openai-community/gpt2: That is the smallest GPT-2 mannequin with 124M parameters, used within the code above. The accuracy is 3041/10042 or 30.28%
mannequin openai-community/gpt2-medium: That is the bigger GPT-2 mannequin with 355M parameters. The accuracy is 3901/10042 or 38.85%
mannequin meta-llama/Llama-3.2-1B: That is the smallest mannequin within the Llama household with 1B parameters. The accuracy is 5731/10042 or 57.07%

Due to this fact, it’s pure to see greater accuracy with bigger fashions.

Be aware that you shouldn’t evaluate perplexities throughout fashions with vastly completely different architectures. Since perplexity is a metric within the vary of 1 to the vocabulary measurement, it extremely relies on the tokenizer. You possibly can see the rationale once you evaluate the perplexity within the code above after changing GPT-2 with Llama 3.2 1B: The perplexity is an order of magnitude greater for Llama 3, however the accuracy is certainly higher. It is because GPT-2 has a vocabulary measurement of solely 50,257, whereas Llama 3.2 1B has a vocabulary measurement of 128,256.

Additional Readings

Beneath are some assets that you could be discover helpful:

Abstract

On this article, you discovered in regards to the perplexity metric and tips on how to consider the perplexity of a language mannequin with the HellaSwag dataset. Particularly, you discovered:

Perplexity measures how a lot a mannequin hesitates in regards to the subsequent token on common.
Perplexity is a metric delicate to vocabulary measurement.
Computing perplexity means computing the geometric imply of the possibilities of the tokens within the pattern.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Evaluating Perplexity on Language Fashions

Overview

What Is Perplexity and How you can Compute It

Consider the Perplexity of a Language Mannequin with HellaSwag Dataset

Additional Readings

Abstract

Vector Databases vs. Graph RAG for Agent Reminiscence: When to Use Which

Prime 20 Agentic Coding CLI Instruments in 2026

The 2026 Time Sequence Toolkit: 5 Basis Fashions for Autonomous Forecasting

LEAVE A REPLY Cancel reply

Most Popular

Falling Blossoms Journal (Diary, Pocket book)

meross Matter Good Plug Mini, Simple Setup, 100% Privateness Good Outlet, Compact Measurement, Help Apple Residence, Alexa, Google Residence with Schedule and Timer, App...

Z-Edge 32-inch Curved Gaming Monitor 16:9 1920×1080 240Hz 1ms Frameless LED Gaming Monitor, UG32P AMD Freesync Premium Show Port HDMI

Skullcandy Crusher ANC 2 Wi-fi Over-Ear Bluetooth Headphones, Multi-Sensory Bass, Lively Noise Cancelling, As much as 60 Hours Battery, Microphone for iPhone Android –...

Recent Comments

POPULAR PRODUCTS

Falling Blossoms Journal (Diary, Pocket book)

Reptile Warmth Fixture, 7-Inch Deep Dome Warmth Basking Lamp with 150W Infrared Bulb and three/6/12 Cycle Timer for Turtle, Bearded Dragon, Lizards, Snake

LILYSILK Silk Sleep Masks 100% Pure Silk, 2 Pack, Pure Silk Stuffed, Smooth Pores and skin-Pleasant, Sleeping Eye Masks with Adjustable Strap for Ladies...

POPULAR POSTS

Falling Blossoms Journal (Diary, Pocket book)

meross Matter Good Plug Mini, Simple Setup, 100% Privateness Good Outlet, Compact Measurement, Help Apple Residence, Alexa, Google Residence with Schedule and Timer, App...

Z-Edge 32-inch Curved Gaming Monitor 16:9 1920×1080 240Hz 1ms Frameless LED Gaming Monitor, UG32P AMD Freesync Premium Show Port HDMI

POPULAR CATEGORY

ABOUT US

FOLLOW US