KLDivLoss and F.kl_div compute KL(Q || P) rather than KL(P || Q)

## 🐛 Bug
### Executive summary:
The inputs of `KLDivLoss` and `F.kl_div` are inverted. `target` should be `input` and `input` should be `target`. This is bad, as this function is not symmetric.

Also actual functional behaviour is akin to KL(Q || exp(P)), but not even, because we are working with the exp of the PDF, not of the distribution...

### Further points
torch.functional.kl_div() is inverting positional argument of source (P) and target (Q) distribution, incorrect development of logarithm fractions for the visible part, resulting in various problems up to and including negative Kl divergence (as reported in here (https://github.com/pytorch/pytorch/issues/32520) and Shannon Jensen Divergence not bounded at ]0,1] (bits) or ]0,log2(e)=1.4426950408889634...] (nats) and eventually for computation of the Shannon Jensen metric (square root of the JSD) to NaN issues (as a result of Kl underflow and JSM computing square root of negative numbers...)

The [documentation seems](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html) wrong too (and explains partly the errors experienced)

l(x,y)=L={l1​,…,lN​},ln​=yn​⋅(logyn​−xn​) 

Which actually should be : 
l(x,y)=L={l1​,…,lN​},ln​=xn​⋅(log(xn)​−log(yn​))=xn​⋅log(xn)​−xn​⋅log(yn​)
Because
l(x,y)= 1/NΣ xn⋅log(xn/ yn),ln​=xn​⋅(log(xn/yn​)=xn​⋅(log(xn)-logyn​))=xn​⋅log(xn)​−xn​⋅log(yn​)

Now taking into account that xn and yn are inverted and that xn is a log_softmax
l(x,y)=L={l1​,…,lN​},ln​=yn​⋅(logyn​−xn​), xn=log(wn) 
(=)
l(x,y)=L={l1​,…,lN​},ln​=yn​⋅(logyn​−xn​)=yn​⋅(logyn​−logwn)​ and here we fall back on the equation but with inversion of yn and xn terms or more exactly inversion of one and the antilog of the other, any reader will understand it as Kl(x||y) but in fact is Kl(y||w) with x=log(w).

Actually naming is misleading, it is not a Kl divergence since F.log_softmax has to be computed beforehand, I understand numercial stability issues in computing log of F.softmax() but that's something explicit in the documentation, the reader should be forewarned as this not so trivial to grasp issue, made very clear.

## To Reproduce

Steps to reproduce the behavior:

```
M =0.5*(p+q)
JSD = 0
JSD += 0.5*F.kl_div(F.log_softmax(M,dim=dim).log(), F.softmax(p, dim=dim), None, None, 'none')
JSD += 0.5*F.kl_div(F.log_softmax(M,dim=dim).log(), F.softmax(q, dim=dim), None, None, 'none')

return torch.sqrt(JSD)
``` 

## Expected behavior

First two positional arguments in wrong order as per mathematical error in the documentation, for the general Kullback-Leibner it should be something like

F.kl_div(F.softmax(p, dim=dim),F.log_softmax(q,dim=dim), None, None, 'none')

and in the context of the JSD's specifics : 

F.kl_div(F.softmax(p, dim=dim),F.log_softmax(M,dim=dim), None, None, 'none')

Second question is why do we need to input a F.log_softmax for the first (which should be in fact the second) positional argument in the first place ? I understand that F.softmax(M, dim=dim),log() has numerical stability compared to computing directly F.log_softmax yet could not be part of a routine treatment of the second term ? Otherwise it is not KL_divergence and the name is misleading !

## Proposed fix for Kl underflow

The Kl numerical underflow issue is actually documented in Cover & al (2006), 2.3 Relative Entropy and Mutual Information p.45 : "we use the convention that 0 log 0/0 =0 and the convention (based on continuity arguments) that 0 log 0/q = 0 and p log p /0 = inf. Therefore postprocessing fix is : 

D_pM = (p*(torch.log(p)-F.log_softmax(M,dim=dim)))
D_qM = (q*(torch.log(q)-F.log_softmax(M,dim=dim)))
D_pM[torch.isnan(D_pM)] = 0
D_qM[torch.isnan(D_qM)] = 0

## Environment

Collecting environment information...
PyTorch version: 1.7.0
Is debug build: True
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Famille
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.2.89
GPU models and configuration: GPU 0: GeForce GTX 1050
Nvidia driver version: 457.63
cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin\cudnn64_7.dll
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] numpydoc==1.1.0
[pip3] pytorch-metric-learning==0.9.98
[pip3] torch==1.7.0 (UPDATED TO 1.8.1 problem remains)
[pip3] torchaudio==0.7.0
[pip3] torchvision==0.8.1
[conda] blas                      2.16                        mkl    conda-forge
[conda] cudatoolkit               10.2.89              hb195166_8    conda-forge
[conda] libblas                   3.8.0                    16_mkl    conda-forge
[conda] libcblas                  3.8.0                    16_mkl    conda-forge
[conda] liblapack                 3.8.0                    16_mkl    conda-forge
[conda] liblapacke                3.8.0                    16_mkl    conda-forge
[conda] mkl                       2020.1                      216
[conda] numpy                     1.19.1           py37hae9e721_0    conda-forge
[conda] numpydoc                  1.1.0                      py_1    conda-forge
[conda] pytorch                   1.7.0           py3.7_cuda102_cudnn7_0    pytorch
[conda] pytorch-metric-learning   0.9.98             pyh39e3cac_0    metric-learning
[conda] torchaudio                0.7.0                      py37    pytorch
[conda] torchvision               0.8.1                py37_cu102    pytorch

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @brianjo @mruberry @albanD @walterddr @rgommers @heitorschueroff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KLDivLoss and F.kl_div compute KL(Q || P) rather than KL(P || Q) #57459

🐛 Bug

Executive summary:

Further points

To Reproduce

Expected behavior

Proposed fix for Kl underflow

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KLDivLoss and F.kl_div compute KL(Q || P) rather than KL(P || Q) #57459

Description

🐛 Bug

Executive summary:

Further points

To Reproduce

Expected behavior

Proposed fix for Kl underflow

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions