目录

Self-information, Entropy and Cross Entropy

Q: What's the relationship between cross entropy, entropy and self-information

In Information Theory, these three concepts form a hierarchy. You can think of them as building blocks: Self-Information is the atom, Entropy is the molecule built from those atoms, and Cross-Entropy is how that molecule interacts with a different molecule.

Here is the relationship in a nutshell, followed by the deep dive: $$\text{Cross-Entropy} = \text{Entropy} + \text{KL Divergence}$$

The Atom: Self-Information

Before we can understand the average, we must understand the individual unit. Self-information (or “surprisal”) measures the surprise associated with a single outcome.

  • Intuition: If an event is highly probable (e.g., the sun rising), it has low self-information (0 surprise). If an event is rare (e.g., winning the lottery), it has high self-information (high surprise).

  • The Math: For a single event $x$ with probability $P(x)$:

$$I(x) = -\log(P(x))$$

(Note: We usually use log base 2 to measure this in “bits”.)

The Molecule: Entropy ($H$)

Entropy is simply the average self-information of an entire probability distribution. It tells you how unpredictable a system is overall.

Intuition: If you want to transmit a message from a source $P$, Entropy is the average number of bits you need to encode one event if you use the perfect code for that source.

The Math: It is the expected value of the self-information: $$ H(P) = E_{x \sim P} [I(x)] = - \sum_{x} P(x) \log(P(x)) $$

Cross-Entropy (The Interaction)

Cross-Entropy ($H(P, Q)$) measures the average cost of encoding data from the true distribution ($P$) using a model/code designed for a different distribution ($Q$).

Concept: This is the cost of the true chaos ($P$) plus the extra cost of your model ($Q$) being wrong.

Formula:

$$H(P, Q) = -\sum_{x} P(x) \log(Q(x))$$

The Fundamental Relationship

The relationship is defined by the KL Divergence ($D_{KL}$), which is the “distance” or “error” between the truth $P$ and your model $Q$.

$$H(P, Q) = H(P) + D_{KL}(P || Q)$$

Interpretation

Cross-Entropy (Total Cost) = Entropy (Unavoidable Cost) + KL Divergence (Avoidable Error)

  • $H(P)$: The “Floor.” You cannot get lower than this. This is the inherent noise in the data.
  • $D_{KL}(P || Q)$: The “Penalty.” This is the extra bits you are wasting because your model $Q$ is not equal to the truth $P$.
Metric Symbol Definition
Self-Information I(x) −log P(x)
Entropy H(P) ∑ P(x) I(x)
Cross-Entropy H(P,Q) H(P) + D_KL(P∥Q)

Why Machine Learning Optimizes Cross-Entropy

In Deep Learning, we try to minimize Cross-Entropy loss. Because the Entropy of the real world H(P) is fixed, minimizing Cross-Entropy is mathematically equivalent to minimizing the KL Divergence (the error).

$$\min H(P, Q) \rightarrow \min D_{KL}(P || Q)$$