6

I am not sure in which SE site I have to put this question. But since I have learnt Shannon Entropy in the context of Statistical Physics, I am putting this question here.

In the case of Shannon Information theory, he defines information $I$ for an $i^{th}$ event as,

$$ I_i = -\ln P_i \qquad \qquad \forall i=1,...n. $$

Based on this definition we further define Shannon Entropy as average information,

$$ S_\text{Shannon} =\langle I\rangle = -\sum\limits_{i=1}^n P_i\log_2P_i .$$

My question is what is the motivation behind defining entropy as some function that is inversely related to probability? I was told by my professor that lesser the probability of an event more information it possesses, although am still not convinced about this fact.

Secondly, what is the reason in choosing the logarithmic function in this definition? Are there places where this definition of information is forfeited?

Alex Nelson
  • 3,025
user35952
  • 3,134

2 Answers2

2

This is in addition to my answer posted elsewhere since OP wanted a more general answer. That example captured the essence based on the idea of how information can be encoded -- it is a somewhat constructive argument in spirit.

Another way of thinking about the amount of information is as follows:

  • If an event that is very probable happens, then you cannot get much information out of it. It was going to happen any way. On the other hand, if something unusual happens, then that should give you something to think about. Such an event carries more "information". (For eg: The occurrence of a certain event conveys no information)

  • If two independent events happen, then the information you glean from them must "add up". Since their probabilities multiply to give the probability of the combined event, the information gleaned from each event must be proportional to the $\log$ of its probability.

In typical treatments, one solves a functional equation for the dependence of the entropy on the probability distribution -- with the two conditions mentioned above. The latter gives the $a \log [] + b $ while the former fixes the additive constant $b$ to zero. The scale factor $a$ depends on the base to which you take the logarithm.

Siva
  • 6,204
0

Shannon performed his work in the context of communication engineering-- considering a communication system that can send a sequence of symbols from some alphabet $\mathcal A$ to communicate information. Wireless communication systems today do this by modulating amplitude/phase/frequency/code or some combination of the above. A link to his original paper, which starts off by giving the historical reasoning for the logarithmic informaiton measure is given here.

If an information source has a lot of redundancy, the same message can be conveyed without transmitting the literal message from the source. For instance, if the message was a string consisting of 50 copies of the letter "A", instead of transmitting "AAAAAAAA...." you could just design a source coding scheme that transmits "A" with some metadata implying "repeat 50 times", and the receiver reconstructs the original message with much less overhead. Incidentally, Shannon's theory is also the basis for data compression theory today.

Robert L.
  • 151