8

I couldn't find any publication by LIGO that explains how we should interpret this value. The closest I have found is the following quote:

This means that a noise event mimicking GW150914 would be exceedingly rare - indeed we expect an event as strong as GW150914 to appear by chance only once in about 200,000 years of such data! This false alarm rate can be translated into a number of "sigma" (denoted by s), which is commonly used in statistical analysis to measure the significance of a detection claim. This search identifies GW150914 as a real event, with a significance of more than 5 sigma.

http://www.ligo.org/science/Publication-GW150914/index.php

From my reading, it appears that $5.1\sigma$ significance refers to:

The probability of observing such a signal given that the model of background noise correctly describes all input to the detectors at the time of the signal.

I would like to verify that the above interpretation is correct and is different from the probability GW150914:

  1. arose due to chance
  2. was caused by a gravitational wave
  3. was caused by a BH-BH merger

I ask because I have seen posts on this site and elsewhere (both news and blogs) that seem to imply differently. I worry I may be misunderstanding some terminology specific to astrophysics.

Also, does anyone know what calculations were used to convert false alarm rate to # of sigmas? This detail seems to have been left out of the papers, so I assume it is something trivial I am missing due to lack of background in this area.

Edit:

Let me clarify (what I have learned is an incorrect) interpretation #1 above. This is Bayes' rule:

$ p(H|O)=\frac{p(H)p(O|H)}{p(O)} \tag{1} $
where,
$$H=\text{Hypothesis (model of background noise describes}\\ \text{all input to the detectors at the time of the signal)}$$ $$O= \text{Observation (the GW150914 signal)}$$

Just to be 100% clear:
$$ p(H|O)=\text{The probability H is true given O has been observed} \\ p(O|H)=\text{The probability of observing O given H is true} \\ p(H)=\text{The probability H is true} \textit{ independent } \text{of observation O} \\ p(O)=\text{The probability of observing O} \textit{ independent } \text{of whether H is true} $$

The last term can be rewritten as: $$ p(O)= p(H)p(O|H)+p(\neg H)p(O|\neg H) \tag{2} $$ where the probability H is false is denoted by
$$p(\neg H)=1-p(H)\tag{3}$$

In the answers, we established the $\sigma$-level is a simple transformation of the p-value, which equals $p(O|H)$. It is clear that $p(H|O)$ must have a different numerical value than the p-value except under some very specific circumstances, i.e. when $p(H)=p(O)$. The p-value is calculated under the assumption that $H$ is true, and from equations 1/2/3 we see that $p(H|O)$ explicitly depends on both $p(H)$ and the probability of observing such a signal if $H$ is false: $p(O|\neg H)$.

If our hypothesis is true, I think we all agree the only way to get a signal like GW150914 is a chance coincidence of noise patterns between the two LIGO detectors. So when writing we often use shorthand such as: $$H=\text{any signal is due to, i.e. caused by, chance coincidence}$$ or $$H=\text{any signal is not real}$$

There are many shorthand ways of saying the same thing that confuses things. The point is that the p-value is not the probability GW150914 was caused by (arose from; is due to) chance (background noise; random coincidence). It is also not the probability GW150914 "isn't real", or "how unlikely" it is that GW150914 is due to chance.

In this case, the p-value is apparently $p(O|H)\approx2\times10^{-7}$. Also, apparently the only other plausible explanation is a BH-BH merger. In an earlier question we estimated the prior probability of this to be $\approx10^{-4} \text{ to } 10^{-1}$. If we suppose that is the only other possible explanation, that must be the probability that H is false independent of observing GW150914: $p(\neg H)$.

First, lets use the lower bound: $p(\neg H)\approx10^{-4}$. From equation 3, then $p(H)\approx0.9999$. Also, GW150914 apparently matched prediction exactly. Therefore, probability of seeing such a signal given that H is false is $p(O|\neg H)\approx1$. Plugging in these values we get:

$$p(H|O)=\frac{0.9999\times2\times10^{-7}}{0.9999\times2\times10^{-7}+10^{-4}\times1}\approx0.002$$

Doing the same for the upper bound I get $p(H|O)\approx 1.8\times10^{-6}$. Now we can say "the probability GW150914 occurred due to chance ranges from $2\times10^{-3} \text{ to } 1.8\times10^{-6}$," which is quite different from the p-value. Any mistakes in this reasoning?

Livid
  • 1,096

3 Answers3

10

I see where you are going with your question. Let me feed the flames.

The sigma value that is quoted is equivalent to a false alarm probability. It tells you how unlikely it is for your experiment, given your understanding (theoretically and empirically) of the noise characteristics, to have produced a signal that looked like GWs from a merging BH.

Personally, I prefer the statement in the text you quote. Such an event would have been seen (in both detectors) about once every 200,000 years. Given that the observations were for 16 days, that means an expectation there would be $2.2 \times 10^{-7}$ such events in the data. i.e a one in 4.6 million chance.

The LIGO team have just converted this number into a numbers of sigma significance using an integral under one tail of the normal distribution. Using one of the readily available calculators e.g http://www.danielsoper.com/statcalc3/calc.aspx?id=20 we see that 5.0-5.1$\sigma$ (known as z-scores) corresponds to p-values of $2.7\times 10^{-7}$ to $1.7\times10^{-7}$, bracketing the value found above.

However this is not the confidence level that this is a gravitational wave or a merging black hole. There is always the possibility that some unanticipated source of error could have crept in that mimics a GW signal (but note that it needs to affect both detectors) or that some other astrophysical source could be capable of producing the signal. As far as I am aware, apart from the usual conspiracy theories (yawn), nobody has come up with a plausible alternative to GWs from a merging BH.

ProfRob
  • 141,325
6

In all frequentist hypothesis testing, one finds a so-called $p$-value: the probability of obtaining such "extreme" observations (i.e. such an extreme test-statistic) were the null hypothesis true.

The null hypothesis is rejected iff the $p$-value is less than a pre-specified critical value or confidence level. Otherwise, the null is not accepted or confirmed - it is merely not rejected.

In this case, the null hypothesis is that

model of background noise correctly describes all input to the detectors

and it was rejected at high confidence.

The $p$-values are conventionally converted into one-tailed Gaussian significances, i.e. a number of standard deviations such that an identical probability is in the tail of Gaussian distribution, $$ Z = \Phi^{-1}(\text{$p$-value}) $$ where $\Phi^{-1}$ is the inverse of a Gaussian CDF. This convention is annoying as the relation between $p$-value and significance isn't algebraic or easy to approximate. It would make more sense to simply report a $p$-value.

In your comment, you allude to Bayes' theorem and a calculation of the probability or plausibility of the null hypothesis. The LIGO hypothesis testing is, however, strictly frequentist. Only the probability of data and pseudo-data is considered. Since the data appears to be so strong in this case, there shouldn't be any qualitative differences in the conclusions of Bayesian or frequentist methods.

You are correct, of course, that $$ P(\text{Any signal-like features due to chance}|\text{data}) $$ is not equal to $$ P(\text{data}|\text{Any signal-like features due to chance}) $$ They are related by Bayes' theorem. Frequentist methods, including LIGO methodology, considers only the latter.

You may find arXiv:1609.01668 interesting, as it discusses differences between Bayesian and frequentist analyses of LIGO signals. Remarkably, even small significances could correspond to colossal Bayes-factors. The $5.1\sigma$ event had a Bayes-factor of $10^{125}$, which is the largest number I've seen in this context.

innisfree
  • 15,425
3

It's a p-value, written in terms of a z-score.

Any computation of a chance is predicated on a model, sometimes it is even enshrined in the name Null Hypothesis. For the first direct sighting of a gravitational wave, the Null Hypothesis could be that gravitational waves don't exist, but your detectors can react to noise.

Now, the computation isn't as simple as the chance of getting any one particular set of data. You actually order the data into those that look like the predicted wave data and those that don't. And then within those that look like the predicted wave data, you order them on strength.

And then you find out the chance that it reacts like that strong a signal ... or stronger (and that or stronger part is what these last two paragraphs are all about). And that's your p-value. It really is about making an error about saying you saw a signal when actually that data sometimes happens by chance ... given the null hypothesis.

Finally you take the probability computed from above and find the z-score cuttoff that has that probability as its tail. And then you report that z-score in "units" of $\sigma.$

The point is that such a standard can decrease how often we announce discoveries to each other that were really just noise. And Physicists have a pretty high standard (compared to p-values of 0.05 or 0.01).

Can someone give the exact calculation of getting from false alarm rate$=1/200,000$ years to $5.1\sigma$?

Intuitively you are looking at the theory to identify things called signals. And then looking at the detectors to find out how often the detectors produce results that look like those signals just from noise. So it involves knowing what the signals look like and how the detectors react to noise. Both are things you should know if you are designing a detector. Neither of those is going to be a simple calculation. The theory required many very long and tedious calculations, hours of computer time. The noise is also hard to compute since they put so many things in to reduce noise. They literally adjusted how the arms work to adjust the noise to have less noise in some areas than the zero point energy produces naturally.

But you have the set of signals and the model of how the detector reacts to noise. The rate (in time) at which the detector generated (from noise) results that look like the signals will depend on the length of the different signals. A signal that is short has many times it could appear in a 200,000 year time period. A longer signal has fewer times.

It isn't a simple calculation when you have lots of different signals, of different lengths and different shapes. You can't just look it up on a table. You can look up the p-value to z-score in a table. But the conversion to a rate in time will depend how often the machine is put into data collection mode and how long the possible signals take to collect when the machine is on.



To address the subtext. If you sit on your hands and don't report a 5$\sigma$ result, then you shouldn't have built your detector. That doesn't mean any particular alternative to the Null Hypothesis is correct. It means that's the agree upon standard about when to report your results.

It's designed to not have too many reports about things that are just noise.

Timaeus
  • 26,055
  • 1
  • 36
  • 71