## information entropy

### expected value of the amount of information delivered by a message; quantifies the amount of uncertainty involved in the value of a random variable or the outcome of a random process; average amount of information produced by a stochastic data source

The information entropy, often just entropy, is a basic quantity in information theory associated to any random variable, which can be interpreted as the average level of "information", "surprise", or "uncertainty" inherent in the variable's possible outcomes. The concept of information entropy was introduced by Claude Shannon in his 1948 paper "A Mathematical Theory of Communication".The entropy is the expected value of the self-information, a related quantity also introduced by Shannon. The self-information quantifies the level of information or surprise associated with one particular outcome or event of a random variable, whereas the entropy quantifies how "informative" or "surprising" the entire random variable is, averaged on all its possible outcomes.
The entropy was originally created by Shannon as part of his theory of communication, in which a data communication system is composed of three elements: a source of data, a communication channel, and a receiver. In Shannon's theory, the "fundamental problem of communication" – as expressed by Shannon – is for the receiver to be able to identify what data was generated by the source, based on the signal it receives through the channel. Shannon considered various ways to encode, compress, and transmit messages from a data source, and proved in his famous source coding theorem that the entropy represents an absolute mathematical limit on how well data from the source can be losslessly compressed onto a perfectly noiseless channel. Shannon strengthened this result considerably for noisy channels in his noisy-channel coding theorem.
The entropy can also be interpreted as the average rate at which information is produced by a stochastic source of data. When the data source produces a low-probability value (i.e., when a low-probability event occurs), the event carries more "information" than when the data source produces a high-probability value. This notion of "information" is formally represented by Shannon's self-information quantity, and is also sometimes interpreted as "surprisal". The amount of information conveyed by each individual event then becomes a random variable whose expected value is the information entropy.
Given a random variable
X
{\displaystyle X}
, with possible outcomes
x
i
{\displaystyle x_{i}}
, each with probability
P
X
(
x
i
)
{\displaystyle P_{X}(x_{i})}
, the entropy
H
(
X
)
{\displaystyle H(X)}
of
X
{\displaystyle X}
is as follows:
H
(
X
)
=
−
∑
i
P
X
(
x
i
)
log
b
P
X
(
x
i
)
=
∑
i
P
X
(
x
i
)
I
X
(
x
i
)
=
E
[
I
X
]
H(X)=-\sum _{i}P_{X}(x_{i})\log _{b}{P_{X}(x_{i})}=\sum _{i}P_{X}(x_{i})I_{X}(x_{i})=\operatorname {E} [I_{X}]
where
I
X
(
x
i
)
{\displaystyle I_{X}(x_{i})}
is the self-information associated with particular outcome;
I
X
{\displaystyle I_{X}}
is the self-information of the random variable
X
{\displaystyle X}
in general, treated as a new derived random variable; and
E
[
I
X
]
{\displaystyle \operatorname {E} [I_{X}]}
is the expected value of this new random variable, equal to the sum of the self-information of each outcome, weighted by the probability of each outcome occurring; and
b
{\displaystyle b}
, the base of the logarithm, is a new parameter that can be set different ways to determine the choice of units for information entropy.
Information entropy is typically measured in bits (alternatively called "shannons"), corresponding to base 2 in the above equation. It is also sometimes measured in "natural units" (nats), corresponding to base e, or decimal digits (called "dits", "bans", or "hartleys"), corresponding to base 10.
Shannon's definition is basically unique in that it is the only such one that has certain properties: it is determined entirely by the probability distribution of the data source, it is additive for independent sources, it is maximized at the uniform distribution, it is minimized (and equal to zero) when there is 100% probability of only one event occurring, and it obeys a certain derived version of the chain rule of probability. Axiomatic derivations of entropy are explained further below on the page.
The definition of entropy used in information theory is directly analogous to the definition used in statistical thermodynamics, a relationship which is detailed on the page Entropy in thermodynamics and information theory.

Read more or edit on Wikipedia