Posit AI Weblog: Infinite shock

May 6, 2025

58

Amongst deep studying practitioners, Kullback-Leibler divergence (KL divergence) is maybe greatest identified for its function in coaching variational autoencoders (VAEs). To be taught an informative latent house, we don’t simply optimize for good reconstruction. Somewhat, we additionally impose a previous on the latent distribution, and purpose to maintain them shut – usually, by minimizing KL divergence.

On this function, KL divergence acts like a watchdog; it’s a constraining, regularizing issue, and if anthropomorphized, would appear stern and extreme. If we depart it at that, nevertheless, we’ve seen only one aspect of its character, and are lacking out on its complement, an image of playfulness, journey, and curiosity. On this publish, we’ll check out that different aspect.

Whereas being impressed by a collection of tweets by Simon de Deo, enumerating purposes of KL divergence in an enormous variety of disciplines,

we don’t aspire to offer a complete write-up right here – as talked about within the preliminary tweet, the subject may simply fill an entire semester of examine.

The way more modest objectives of this publish, then, are

to rapidly recap the function of KL divergence in coaching VAEs, and point out similar-in-character purposes;
as an example that extra playful, adventurous “different aspect” of its character; and
in a not-so-entertaining, however – hopefully – helpful method, differentiate KL divergence from associated ideas resembling cross entropy, mutual data, or free power.

Earlier than although, we begin with a definition and a few terminology.

KL divergence in a nutshell

KL divergence is the anticipated worth of the logarithmic distinction in chances in accordance with two distributions, (p) and (q). Right here it’s in its discrete-probabilities variant:

[begin{equation}
D_{KL}(p||q) = sumlimits_{x} p(x) log(frac{p(x)}{q(x)})
tag{1}
end{equation}]

Notably, it’s uneven; that’s, (D_{KL}(p||q)) will not be the identical as (D_{KL}(q||p)). (Which is why it’s a divergence, not a distance.) This side will play an necessary function in part 2 devoted to the “different aspect.”

To emphasize this asymmetry, KL divergence is typically known as relative data (as in “data of (p) relative to (q)”), or data achieve. We agree with one in all our sources that due to its universality and significance, KL divergence would in all probability have deserved a extra informative identify; resembling, exactly, data achieve. (Which is much less ambiguous pronunciation-wise, as properly.)

KL divergence, “villain”

In lots of machine studying algorithms, KL divergence seems within the context of variational inference. Typically, for practical information, actual computation of the posterior distribution is infeasible. Thus, some type of approximation is required. In variational inference, the true posterior (p^*) is approximated by a less complicated distribution, (q), from some tractable household.
To make sure we’ve got a superb approximation, we decrease – in idea, a minimum of – the KL divergence of (q) relative to (p^*), thus changing inference by optimization.

In observe, once more for causes of intractability, the KL divergence minimized is that of (q) relative to an unnormalized distribution (widetilde{p})

[begin{equation}
J(q) = D_{KL}(q||widetilde{p})
tag{2}
end{equation}]

the place (widetilde{p}) is the joint distribution of parameters and information:

[begin{equation}
widetilde{p}(mathbf{x}) = p(mathbf{x}, mathcal{D}) = p^*(mathbf{x}) p(mathcal{D})
tag{3}
end{equation}]

and (p^*) is the true posterior:

[begin{equation}
p^*(mathbf{x}) = p(mathbf{x}|mathcal{D})
tag{4}
end{equation}]

Equal to that formulation (eq. (2)) – for a derivation see (Murphy 2012) – is that this, which exhibits the optimization goal to be an higher certain on the adverse log-likelihood (NLL):

[begin{equation}
J(q) = D_{KL}(q||p^*) – log p(D)
tag{5}
end{equation}]

One more formulation – once more, see (Murphy 2012) for particulars – is the one we truly use when coaching (e.g.) VAEs. This one corresponds to the anticipated NLL plus the KL divergence between the approximation (q) and the imposed prior (p):

[begin{equation}
J(q) = D_{KL}(q||p) – E_q[- log p(mathcal{D}|mathbf{x})]
tag{6}
finish{equation}]

Negated, this formulation can also be known as the ELBO, for proof decrease certain. Within the VAE publish cited above, the ELBO was written

[begin{equation}
ELBO = E[log p(x|z)] – KL(q(z)||p(z))
tag{7}
finish{equation}]

with (z) denoting the latent variables ((q(z)) being the approximation, (p(z)) the prior, usually a multivariate regular).

Past VAEs

Generalizing this “conservative” motion sample of KL divergence past VAEs, we are able to say that it expresses the standard of approximations. An necessary space the place approximation takes place is (lossy) compression. KL divergence supplies a method to quantify how a lot data is misplaced after we compress information.

Summing up, in these and comparable purposes, KL divergence is “unhealthy” – though we don’t need it to be zero (or else, why hassle utilizing the algorithm?), we definitely need to preserve it low. So now, let’s see the opposite aspect.

KL divergence, good man

In a second class of purposes, KL divergence will not be one thing to be minimized. In these domains, KL divergence is indicative of shock, disagreement, exploratory conduct, or studying: This actually is the attitude of data achieve.

Shock

One area the place shock, not data per se, governs conduct is notion. For instance, eyetracking research (e.g., (Itti and Baldi 2005)) confirmed that shock, as measured by KL divergence, was a greater predictor of visible consideration than data, measured by entropy. Whereas these research appear to have popularized the expression “Bayesian shock,” this compound is – I believe – not probably the most informative one, as neither half provides a lot data to the opposite. In Bayesian updating, the magnitude of the distinction between prior and posterior displays the diploma of shock caused by the info – shock is an integral a part of the idea.

Thus, with KL divergence linked to shock, and shock rooted within the elementary strategy of Bayesian updating, a course of that may very well be used to explain the course of life itself, KL divergence itself turns into elementary. We may get tempted to see it in every single place. Accordingly, it has been utilized in many fields to quantify unidirectional divergence.

For instance, (Zanardo 2017) have utilized it in buying and selling, measuring how a lot an individual disagrees with the market perception. Greater disagreement then corresponds to larger anticipated positive aspects from betting in opposition to the market.

Nearer to the realm of deep studying, it’s utilized in intrinsically motivated reinforcement studying (e.g., (Solar, Gomez, and Schmidhuber 2011)), the place an optimum coverage ought to maximize the long-term data achieve. That is doable as a result of like entropy, KL divergence is additive.

Though its asymmetry is related whether or not you employ KL divergence for regularization (part 1) or shock (this part), it turns into particularly evident when used for studying and shock.

Asymmetry in motion

Wanting once more on the KL method

[begin{equation}
D_{KL}(p||q) = sumlimits_{x} p(x) log(frac{p(x)}{q(x)})
tag{1}
end{equation}]

the roles of (p) and (q) are essentially totally different. For one, the expectation is computed over the primary distribution ((p) in (1)). This side is necessary as a result of the “order” (the respective roles) of (p) and (q) could should be chosen in accordance with tractability (which distribution can we common over).

Secondly, the fraction contained in the (log) implies that if (q) is ever zero at some extent the place (p) isn’t, the KL divergence will “blow up.” What this implies for distribution estimation typically is properly detailed in Murphy (2012). Within the context of shock, it implies that if I be taught one thing I used to assume had chance zero, I will likely be “infinitely shocked.”

To keep away from infinite shock, we are able to be sure that our prior chance isn’t zero. However even then, the fascinating factor is that how a lot data we achieve in anyone occasion relies on how a lot data I had earlier than. Let’s see a easy instance.

Assume that in my present understanding of the world, black swans in all probability don’t exist, however they might … possibly 1 p.c of them is black. Put otherwise, my prior perception of a swan, ought to I encounter one, being black is (q = 0.01).

Now actually I do encounter one, and it’s black.
The knowledge I’ve gained is:

[begin{equation}
l(p,q) = 0 * log(frac{0}{0.99}) + 1 * log(frac{1}{0.01}) = 6.6 bits
tag{8}
end{equation}]

Conversely, suppose I’d been way more undecided earlier than; say I’d have thought the chances had been 50:50.
On seeing a black swan, I get lots much less data:

[begin{equation}
l(p,q) = 0 * log(frac{0}{0.5}) + 1 * log(frac{1}{0.5}) = 1 bit
tag{9}
end{equation}]

This view of KL divergence, by way of shock and studying, is inspiring – it may lead one to seeing it in motion in every single place. Nonetheless, we nonetheless have the third and last activity to deal with: rapidly evaluate KL divergence to different ideas within the space.

Entropy

All of it begins with entropy, or uncertainty, or data, as formulated by Claude Shannon.
Entropy is the typical log chance of a distribution:

[begin{equation}
H(X) = – sumlimits_{x=1}^n p(x_i) log(p(x_i))
tag{10}
end{equation}]

As properly described in (DeDeo 2016), this formulation was chosen to fulfill 4 standards, one in all which is what we generally image as its “essence,” and one in all which is particularly fascinating.

As to the previous, if there are (n) doable states, entropy is maximal when all states are equiprobable. E.g., for a coin flip uncertainty is highest when coin bias is 0.5.

The latter has to do with coarse-graining, a change in “decision” of the state house. Say we’ve got 16 doable states, however we don’t actually care at that stage of element. We do care about 3 particular person states, however all the remaining are principally the identical to us. Then entropy decomposes additively; whole (fine-grained) entropy is the entropy of the coarse-grained house, plus the entropy of the “lumped-together” group, weighted by their chances.

Subjectively, entropy displays our uncertainty whether or not an occasion will occur. Curiously although, it exists within the bodily world as properly: For instance, when ice melts, it turns into extra unsure the place particular person particles are. As reported by (DeDeo 2016), the variety of bits launched when one gram of ice melts is about 100 billion terabytes!

As fascinating as it’s, data per se could, in lots of instances, not be the most effective technique of characterizing human conduct. Going again to the eyetracking instance, it’s utterly intuitive that folks have a look at shocking elements of pictures, not at white noise areas, that are the utmost you can get by way of entropy.

As a deep studying practitioner, you’ve in all probability been ready for the purpose at which we’d point out cross entropy – probably the most generally used loss operate in categorization.

Cross entropy

The cross entropy between distributions (p) and (q) is the entropy of (p) plus the KL divergence of (p) relative to (q). If you happen to’ve ever carried out your individual classification community, you in all probability acknowledge the sum on the very proper:

[begin{equation}
H(p,q) = H(p) + D_{KL}(p||q) = – sum p log(q)
tag{11}
end{equation}]

In data theory-speak, (H(p,q)) is the anticipated message size per datum when (q) is assumed however (p) is true.
Nearer to the world of machine studying, for mounted (p), minimizing cross entropy is equal to minimizing KL divergence.

Mutual data

One other extraordinarily necessary amount, utilized in many contexts and purposes, is mutual data. Once more citing DeDeo, “you may consider it as probably the most normal type of correlation coefficient which you can measure.”

With two variables (X) and (Y), we are able to ask: How a lot will we find out about (X) after we find out about a person (y), (Y=y)? Averaged over all (y), that is the conditional entropy:

[begin{equation}
H(X|Y) = – sumlimits_{i} P(y_i) log(H(X|y_i))
tag{12}
end{equation}]

Now mutual data is entropy minus conditional entropy:

[begin{equation}
I(X, Y) = H(X) – H(X|Y) = H(Y) – H(Y|X)
tag{13}
end{equation}]

This amount – as required for a measure representing one thing like correlation – is symmetric: If two variables (X) and (Y) are associated, the quantity of knowledge (X) offers you about (Y) is the same as that (Y) offers you about (X).

KL divergence is a part of a household of divergences, known as f-divergences, used to measure directed distinction between chance distributions. Let’s additionally rapidly look one other information-theoretic measure that in contrast to these, is a distance.

Jensen-Shannon distance

In math, a distance, or metric, in addition to being non-negative has to fulfill two different standards: It should be symmetric, and it should obey the triangle inequality.

Each standards are met by the Jensen-Shannon distance. With (m) a combination distribution:

[begin{equation}
m_i = frac{1}{2}(p_i + q_i)
tag{14}
end{equation}]

the Jensen-Shannon distance is a mean of KL divergences, one in all (m) relative to (p), the opposite of (m) relative to (q):

[begin{equation}
JSD = frac{1}{2}(KL(m||p) + KL(m||q))
tag{15}
end{equation}]

This might be a great candidate to make use of had been we occupied with (undirected) distance between, not directed shock attributable to, distributions.

Lastly, let’s wrap up with a final time period, proscribing ourselves to a fast glimpse at one thing entire books may very well be written about.

(Variational) Free Power

Studying papers on variational inference, you’re fairly prone to hear individuals speaking not “simply” about KL divergence and/or the ELBO (which as quickly as you realize what it stands for, is simply what it’s), but in addition, one thing mysteriously known as free power (or: variational free power, in that context).

For sensible functions, it suffices to know that variational free power is adverse the ELBO, that’s, corresponds to equation (2). However for these , there may be free power as a central idea in thermodynamics.

On this publish, we’re primarily occupied with how ideas are associated to KL divergence, and for this, we comply with the characterization John Baez offers in his aforementioned speak.

Free power, that’s, power in helpful type, is the anticipated power minus temperature instances entropy:

[begin{equation}
F = [E] -T H
tag{16}
finish{equation}]

Then, the additional free power of a system (Q) – in comparison with a system in equilibrium (P) – is proportional to their KL divergence, that’s, the knowledge of (Q) relative to (P):

[begin{equation}
F(Q) – F(P) = k T KL(q||p)
tag{17}
end{equation}]

Talking of free power, there’s additionally the – not uncontroversial – free power precept posited in neuroscience.. However sooner or later, we’ve got to cease, and we do it right here.

Conclusion

Wrapping up, this publish has tried to do three issues: Having in thoughts a reader with background primarily in deep studying, begin with the “routine” use in coaching variational autoencoders; then present the – in all probability much less acquainted – “different aspect”; and at last, present a synopsis of associated phrases and their purposes.

If you happen to’re occupied with digging deeper into the various varied purposes, in a variety of various fields, no higher place to start out than from the Twitter thread, talked about above, that gave rise to this publish. Thanks for studying!

DeDeo, Simon. 2016. “Data Principle for Clever Individuals.”

Friston, Karl. 2010. “Friston, ok.j.: The Free-Power Precept: A Unified Mind Principle? Nat. Rev. Neurosci. 11, 127-138.” Nature Evaluations. Neuroscience 11 (February): 127–38. https://doi.org/10.1038/nrn2787.

Itti, Laurent, and Pierre Baldi. 2005. “Bayesian Shock Attracts Human Consideration.” In Advances in Neural Data Processing Programs 18 [Neural Information Processing Systems, NIPS 2005, December 5-8, 2005, Vancouver, British Columbia, Canada], 547–54. http://papers.nips.cc/paper/2822-bayesian-surprise-attracts-human-attention.

Murphy, Kevin. 2012. Machine Studying: A Probabilistic Perspective. MIT Press.

Solar, Yi, Faustino J. Gomez, and Juergen Schmidhuber. 2011. “Planning to Be Stunned: Optimum Bayesian Exploration in Dynamic Environments.” CoRR abs/1103.5708. http://arxiv.org/abs/1103.5708.

Zanardo, Enrico. 2017. “HOW TO MEASURE DISAGREEMENT ?” In.

Previous articleNavigating the SQL Server to Databricks Migration: Suggestions for a Seamless Transition

Next articleDo not let Apple’s document quarter idiot you: A storm is brewing

Posit AI Weblog: Infinite shock

KL divergence in a nutshell

KL divergence, “villain”

Past VAEs

KL divergence, good man

Shock

Asymmetry in motion

Entropy

Cross entropy

Mutual data

Jensen-Shannon distance

(Variational) Free Power

Conclusion

Past KYC: AI-Powered Insurance coverage Onboarding Acceleration

10 Python One-Liners to Optimize Your Machine Studying Pipelines

The Obtain: Ukraine’s Starlink restore store, and predicting photo voltaic storms

LEAVE A REPLY Cancel reply

Most Popular

The ability of ten — and the facility of Function

Eire Nationwide UAS Coverage Framework

Why AI Is not Really Clever — and How We Can Change That

Fastned Station Enlargement Perspective For Subsequent 5 Years

Recent Comments

ABOUT US

POPULAR POSTS

The ability of ten — and the facility of Function

Eire Nationwide UAS Coverage Framework

Why AI Is not Really Clever — and How We Can Change That

POPULAR CATEGORY