1

The perplexity of the $i^{th}$ token in the $k^{th}$ sequence is

$$ P_{ki} = \frac{1}{p(t_{ki})} $$

The perplexity aggregated for the $k^{th}$ sequence is then

$$ P_{k} = \left(\prod_{i=1}^N P_{ki}\right)^{1/N} \\ = \left(\prod_{i=1}^N \frac{1}{p(t_{ki})} \right)^{1/N} $$

which is the geometric mean of the perplexities of the tokens. This makes sense as we are essentially taking the multiplicative inverse of the probability that the model got the whole sequence correct.

Now my question is how to aggregate the perplexities of several sequences. It seems from various places, including the Hugging Face Tutorial, I see that the prescription is to take the arithmetic mean of the perplexities of sequences

$$ P = \frac{1}{m} \sum_{k=1}^m P_k $$

I am not quite understanding what it means to take the average of 1/probabilities. What is this actually capturing?

Borun Chowdhury
  • 210
  • 1
  • 7

1 Answers1

0

The choice of arithmetic mean for averaging perplexities is mainly for simplicity and consistency. It provides a straightforward measure of overall model performance across multiple sequences.

Each $P_k$ is already a summary statistic (perplexity) for each sequence, and averaging them gives a global view of how well the model performs on average across multiple sequences.

In some cases, alternatives to the arithmetic mean of perplexities might be more appropriate. For example, weighted average that assigns weights to sequences based on their length.

Kulin Patel
  • 361
  • 2
  • 12