Self-information, Entropy and Cross Entropy
Q: What's the relationship between cross entropy, entropy and self-information
In Information Theory, these three concepts form a hierarchy. You can think of them as building blocks: Self-Information is the atom, Entropy is the molecule built from those atoms, and Cross-Entropy is how that molecule interacts with a different molecule.
Here is the relationship in a nutshell, followed by the deep dive: $$\text{Cross-Entropy} = \text{Entropy} + \text{KL Divergence}$$
The Atom: Self-Information
Before we can understand the average, we must understand the individual unit. Self-information (or “surprisal”) measures the surprise associated with a single outcome.
-
Intuition: If an event is highly probable (e.g., the sun rising), it has low self-information (0 surprise). If an event is rare (e.g., winning the lottery), it has high self-information (high surprise).
-
The Math: For a single event $x$ with probability $P(x)$:
$$I(x) = -\log(P(x))$$
(Note: We usually use log base 2 to measure this in “bits”.)
The Molecule: Entropy ($H$)
Entropy is simply the average self-information of an entire probability distribution. It tells you how unpredictable a system is overall.
Intuition: If you want to transmit a message from a source $P$, Entropy is the average number of bits you need to encode one event if you use the perfect code for that source.
The Math: It is the expected value of the self-information: $$ H(P) = E_{x \sim P} [I(x)] = - \sum_{x} P(x) \log(P(x)) $$
Cross-Entropy (The Interaction)
Cross-Entropy ($H(P, Q)$) measures the average cost of encoding data from the true distribution ($P$) using a model/code designed for a different distribution ($Q$).
Concept: This is the cost of the true chaos ($P$) plus the extra cost of your model ($Q$) being wrong.
Formula:
$$H(P, Q) = -\sum_{x} P(x) \log(Q(x))$$
The Fundamental Relationship
The relationship is defined by the KL Divergence ($D_{KL}$), which is the “distance” or “error” between the truth $P$ and your model $Q$.
$$H(P, Q) = H(P) + D_{KL}(P || Q)$$
Interpretation
Cross-Entropy (Total Cost) = Entropy (Unavoidable Cost) + KL Divergence (Avoidable Error)
- $H(P)$: The “Floor.” You cannot get lower than this. This is the inherent noise in the data.
- $D_{KL}(P || Q)$: The “Penalty.” This is the extra bits you are wasting because your model $Q$ is not equal to the truth $P$.
| Metric | Symbol | Definition |
|---|---|---|
| Self-Information | I(x) | −log P(x) |
| Entropy | H(P) | ∑ P(x) I(x) |
| Cross-Entropy | H(P,Q) | H(P) + D_KL(P∥Q) |
Why Machine Learning Optimizes Cross-Entropy
In Deep Learning, we try to minimize Cross-Entropy loss. Because the Entropy of the real world H(P) is fixed, minimizing Cross-Entropy is mathematically equivalent to minimizing the KL Divergence (the error).
$$\min H(P, Q) \rightarrow \min D_{KL}(P || Q)$$