Why does the pass@k metric not "behave like" probability?

Question

pass@k is a metric used to evaluate models that generate code, used for example to evaluate Codex. To evaluate pass@k, you have a dataset of natural language/code pairs, and you pass each NL prompt to the model. For each prompt, it generates k code snippets. If at least one of the code snippets is correct, then the model succeeded at that prompt in k samples. The pass@k is the fraction of prompts for which the model succeeded in this sense.

The samples generated for each prompt are obtained via some stochastic procedure based on the model's output probability distributions on the vocabulary, like randomly choosing the next token based on that distribution or something like that.

So in the Codex paper for example, we see these figures for the largest Codex model:

pass@1: 28%
pass@100: 72%

These numbers make no sense to me. Every time we sample the model's prediction for a given prompt, we get back a random code snippet. That output is either correct or incorrect. Each trial is independent. So if the probability of one output being correct is p, the probability of at least one of 100 being correct should be 1 - (1 - p)^100. Here p=0.28, so the pass@100 should be like 99.999999%.

Are the trials not independent? What's going on?

score 1 · Accepted Answer · answered May 10 '23 at 19:01

Very late to this question, but pass@k doesn't behave like you're describing because each single pass@k sample is itself a union of k independent events, so averaging them doesn't behave like a union of (num samples) independent events.

Here's an illustration. Say our dataset is two samples, such that on sample 1 our model always gets it right, and on sample 2 our model never gets it right. Then pass@1 = 50%. But doing 100 passes on each sample, we get 100 correct answers on sample 1 (counts as correct) and 0 correct answers on sample 2 (counts as incorrect). Then pass@100 = 50%. In fact pass@k = 50% for any k in this toy setting.

Andz · Answer 2 · 2025-06-03T20:27:34.637

Let me also take a stab at answering this.

If the question was something with a probabilistic answer, say "guess the color of my pen" and I had 7 colors to choose from (red, orange, yellow, green, blue, indigo, violet), then pass@1 would be 1/7 or 14.28% and pass@100 would then be nearly 100%

However the questions being asked for such benchmarks involve "skill" and "understanding". Just like if we were asked to solve a complex math problem involving many many steps, it is possible that we might get certain steps wrong and come up with the wrong answer. We may understand most of what we are doing but make mistakes in some areas, or we could even make an educated guess as to the answers for certain intermediate steps.

So, without telling us what the correct answer is yet, we are told to come up with 100 possible answers for such a question. It's a really tough question, but we have some level of understanding, so it is possible for us to get the right answer by chance and the more tries we are given, the more likely we are to get at least one right answer, but this is not equivalent to guessing blindly.

Our level of competence will have a direct bearing on how likely it is for us to get at least one right answer, and this chance will indeed increase the more tries we are given, but I hope it will be obvious why that it is not going to be $1 - (1 - p)^{\text{num_of_tries}}$

Am I understanding this correctly?

eindzl · Answer 3 · 2025-06-10T15:10:34.967

I behaves like a probability, just not the one you are thinking of.

Say Codex has $N$ problems to be solved. You can think of each as a Bernoulli trial. Since each problem will have a different complexity, that means you will have $N$ (possibly) different probabilities of success $p_i$ which you don't know.

If you give only one try to each problem, the fraction of correct answers (successes) you will get will be: $$ pass@1 = \frac{1}{N}\sum_{i=1}^{N}p_i $$

If you give k tries to each problem, then what you have is a binomial distribution. If you consider a success for a given problem to get it right at least once, then the probability is: $$ P(X_i\ge1) = \sum_{j=1}^{k} P(X_i=j)= \sum_{j=1}^{k}\binom{k}{j}p_i^{j}(1-p_i)^{k-j} $$ That equation is for each of the $N$ problems, and adding them up you get: $$ pass@k = \frac{1}{N}\sum_{i=1}^{N}P(X_i\ge1) = \frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{k}\binom{k}{j}p_i^{j}(1-p_i)^{k-j} $$

Why does the pass@k metric not "behave like" probability?

3 Answers3