How are perplexities over multiple instance aggregated?

Question

The perplexity of the $i^{th}$ token in the $k^{th}$ sequence is

$$ P_{ki} = \frac{1}{p(t_{ki})} $$

The perplexity aggregated for the $k^{th}$ sequence is then

$$ P_{k} = \left(\prod_{i=1}^N P_{ki}\right)^{1/N} \\ = \left(\prod_{i=1}^N \frac{1}{p(t_{ki})} \right)^{1/N} $$

which is the geometric mean of the perplexities of the tokens. This makes sense as we are essentially taking the multiplicative inverse of the probability that the model got the whole sequence correct.

Now my question is how to aggregate the perplexities of several sequences. It seems from various places, including the Hugging Face Tutorial, I see that the prescription is to take the arithmetic mean of the perplexities of sequences

$$ P = \frac{1}{m} \sum_{k=1}^m P_k $$

I am not quite understanding what it means to take the average of 1/probabilities. What is this actually capturing?

score 0 · Answer 1 · answered Jun 16 '24 at 04:16

The choice of arithmetic mean for averaging perplexities is mainly for simplicity and consistency. It provides a straightforward measure of overall model performance across multiple sequences.

Each $P_k$ is already a summary statistic (perplexity) for each sequence, and averaging them gives a global view of how well the model performs on average across multiple sequences.

In some cases, alternatives to the arithmetic mean of perplexities might be more appropriate. For example, weighted average that assigns weights to sequences based on their length.

How are perplexities over multiple instance aggregated?

1 Answers1