Skip to content

KLDivLoss and F.kl_div compute KL(Q || P) rather than KL(P || Q) #57459

@BenoitDalFerro

Description

@BenoitDalFerro

🐛 Bug

Executive summary:

The inputs of KLDivLoss and F.kl_div are inverted. target should be input and input should be target. This is bad, as this function is not symmetric.

Also actual functional behaviour is akin to KL(Q || exp(P)), but not even, because we are working with the exp of the PDF, not of the distribution...

Further points

torch.functional.kl_div() is inverting positional argument of source (P) and target (Q) distribution, incorrect development of logarithm fractions for the visible part, resulting in various problems up to and including negative Kl divergence (as reported in here (#32520) and Shannon Jensen Divergence not bounded at ]0,1] (bits) or ]0,log2(e)=1.4426950408889634...] (nats) and eventually for computation of the Shannon Jensen metric (square root of the JSD) to NaN issues (as a result of Kl underflow and JSM computing square root of negative numbers...)

The documentation seems wrong too (and explains partly the errors experienced)

l(x,y)=L={l1​,…,lN​},ln​=yn​⋅(logyn​−xn​)

Which actually should be :
l(x,y)=L={l1​,…,lN​},ln​=xn​⋅(log(xn)​−log(yn​))=xn​⋅log(xn)​−xn​⋅log(yn​)
Because
l(x,y)= 1/NΣ xn⋅log(xn/ yn),ln​=xn​⋅(log(xn/yn​)=xn​⋅(log(xn)-logyn​))=xn​⋅log(xn)​−xn​⋅log(yn​)

Now taking into account that xn and yn are inverted and that xn is a log_softmax
l(x,y)=L={l1​,…,lN​},ln​=yn​⋅(logyn​−xn​), xn=log(wn)
(=)
l(x,y)=L={l1​,…,lN​},ln​=yn​⋅(logyn​−xn​)=yn​⋅(logyn​−logwn)​ and here we fall back on the equation but with inversion of yn and xn terms or more exactly inversion of one and the antilog of the other, any reader will understand it as Kl(x||y) but in fact is Kl(y||w) with x=log(w).

Actually naming is misleading, it is not a Kl divergence since F.log_softmax has to be computed beforehand, I understand numercial stability issues in computing log of F.softmax() but that's something explicit in the documentation, the reader should be forewarned as this not so trivial to grasp issue, made very clear.

To Reproduce

Steps to reproduce the behavior:

M =0.5*(p+q)
JSD = 0
JSD += 0.5*F.kl_div(F.log_softmax(M,dim=dim).log(), F.softmax(p, dim=dim), None, None, 'none')
JSD += 0.5*F.kl_div(F.log_softmax(M,dim=dim).log(), F.softmax(q, dim=dim), None, None, 'none')

return torch.sqrt(JSD)

Expected behavior

First two positional arguments in wrong order as per mathematical error in the documentation, for the general Kullback-Leibner it should be something like

F.kl_div(F.softmax(p, dim=dim),F.log_softmax(q,dim=dim), None, None, 'none')

and in the context of the JSD's specifics :

F.kl_div(F.softmax(p, dim=dim),F.log_softmax(M,dim=dim), None, None, 'none')

Second question is why do we need to input a F.log_softmax for the first (which should be in fact the second) positional argument in the first place ? I understand that F.softmax(M, dim=dim),log() has numerical stability compared to computing directly F.log_softmax yet could not be part of a routine treatment of the second term ? Otherwise it is not KL_divergence and the name is misleading !

Proposed fix for Kl underflow

The Kl numerical underflow issue is actually documented in Cover & al (2006), 2.3 Relative Entropy and Mutual Information p.45 : "we use the convention that 0 log 0/0 =0 and the convention (based on continuity arguments) that 0 log 0/q = 0 and p log p /0 = inf. Therefore postprocessing fix is :

D_pM = (p*(torch.log(p)-F.log_softmax(M,dim=dim)))
D_qM = (q*(torch.log(q)-F.log_softmax(M,dim=dim)))
D_pM[torch.isnan(D_pM)] = 0
D_qM[torch.isnan(D_qM)] = 0

Environment

Collecting environment information...
PyTorch version: 1.7.0
Is debug build: True
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Famille
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.2.89
GPU models and configuration: GPU 0: GeForce GTX 1050
Nvidia driver version: 457.63
cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin\cudnn64_7.dll
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] numpydoc==1.1.0
[pip3] pytorch-metric-learning==0.9.98
[pip3] torch==1.7.0 (UPDATED TO 1.8.1 problem remains)
[pip3] torchaudio==0.7.0
[pip3] torchvision==0.8.1
[conda] blas 2.16 mkl conda-forge
[conda] cudatoolkit 10.2.89 hb195166_8 conda-forge
[conda] libblas 3.8.0 16_mkl conda-forge
[conda] libcblas 3.8.0 16_mkl conda-forge
[conda] liblapack 3.8.0 16_mkl conda-forge
[conda] liblapacke 3.8.0 16_mkl conda-forge
[conda] mkl 2020.1 216
[conda] numpy 1.19.1 py37hae9e721_0 conda-forge
[conda] numpydoc 1.1.0 py_1 conda-forge
[conda] pytorch 1.7.0 py3.7_cuda102_cudnn7_0 pytorch
[conda] pytorch-metric-learning 0.9.98 pyh39e3cac_0 metric-learning
[conda] torchaudio 0.7.0 py37 pytorch
[conda] torchvision 0.8.1 py37_cu102 pytorch

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @brianjo @mruberry @albanD @walterddr @rgommers @heitorschueroff

Metadata

Metadata

Assignees

Labels

high prioritymodule: correctness (silent)issue that returns an incorrect result silentlymodule: docsRelated to our documentation, both in docs/ and docblocksmodule: nnRelated to torch.nnmodule: numpyRelated to numpy support, and also numpy compatibility of our operatorsmodule: specialFunctions with no exact solutions, analogous to those in scipy.specialtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions