Skip to content

Conversation

mdhaber
Copy link
Contributor

@mdhaber mdhaber commented Nov 13, 2022

Reference issue

gh-16241

What does this implement/fix?

This adds a CDF method to multivariate_t.

Additional information

Rather than wrapping Genz's Fortran code, I translated his (permissively-licensed) Matlab code so that it would be easier to use a NumPy's PRNG (and avoid the equivalent of gh-16142).

That said, the number of random variates needed is always 10*dim. My plan was to compile this with Pythran in a follow-up PR, and to do that I would have to pass in those random variates because I don't think we can pass in a Generator or RandomState into a Pythran function (correct me if I'm wrong). In that case, I suppose we could apply the same strategy with the Fortran code, slightly modified to accept all the random variates as an additional input. The advantage is that we wouldn't really need to review the Fortran source; the disadvantage is that we'd need to wrap it.

I still need to look for rough edges and polish them. I suspect that the output shapes are not perfect, I don't know what happens when x is empty, etc. Update: it handles edge cases like multivariate_t.pdf does.

I went ahead and wrote my own translation because comparing existing translations may be a useful part of review.

@mdhaber mdhaber marked this pull request as draft November 13, 2022 02:01
@rkern
Copy link
Member

rkern commented Nov 13, 2022

Ah, I should have made my translation public earlier. I've been sitting on it for a while.

https://gist.github.com/rkern/8518cb3e3f57fa81d21dbd9eabcfaa87

Yeah, lifting the drawing from Generator out of the loop is pretty easy. Then the inner loop would be Pythranizable. The main blocker then is adding the necessary special functions to Pythran. I'm not sure you can do that except by releasing a new version of Pythran with those implemented.

The CBC lattice in Genz's later version is quite a bit superior to the Richtmyer lattice. @tupui I don't know if you want to consider adding the CBC lattice as a generic QMC engine. It's finicky (more than Sobol' even) in that you have to draw exactly the prime number of samples that it was configured to produce. The core algorithm is here to use, if you want.

I have an implementation of the generic multivariate normal integrand that you can integrate yourself with any QMCEngine to compare techniques. In my findings, the CBC lattice here nearly always edges out Sobol' in terms of accuracy and runtime, but they are not far.

If you want a test problem, integrating a multivariate normal with the covariance matrix with the diagonals all 1 and the off-diagonals all 0.5 over the negative orthant has an exact true answer (1 / (ndim + 1)). And it's a nontrivial covariance matrix that appears in real life problems (Dunnett's test with equal group sizes).

@mdhaber
Copy link
Contributor Author

mdhaber commented Nov 14, 2022

Thanks! Yeah, I knew you already translated it, but I figured one could be tested against the other. Or maybe we'll end up going for Fortran anyway.
I'll add that test problem, thanks!

@serge-sans-paille any chance we can get special.gammaincinv and ndtri in Pythran? (And ideally ndtr, but it looks like we can do that with erf if need be.)

@rkern
Copy link
Member

rkern commented Nov 14, 2022

I believe that the CBC QMC scheme used in the version that I translated is superior in performance to the scheme in the Fortan code. I think Pythranization is the way to go.

We should probably wrap the Fortran code for the bivariate and trivariate cases which can be done more accurately.

@mdhaber
Copy link
Contributor Author

mdhaber commented Nov 15, 2022

I think Pythranization is the way to go.

OK. I like that if we can get the special functions. If not, there's always Cython.

We should probably wrap the Fortran code for the bivariate and trivariate cases which can be done more accurately.

Good idea. But let's leave that as a followup PR. This will get big enough as it is.


I had planned to wait until @tupui get's back next month before doing too much more here, but I suppose I should ask - would you like to review this, or would you like to submit your code as a PR and I will review?

In either case, I'd like to suggest that given the same input (including random numbers), the translation should produce the same output up to machine precision. As long as we hit all the code paths with tests, this would make it very easy to check for accuracy against the Matlab originals. Is that still true in your translation?

@rkern
Copy link
Member

rkern commented Nov 15, 2022

Not quite. The CBC lattice construction involves doing some convolutions via FFTs and argmin() on the results. Sometimes there are near ties (i.e. would be true ties in infinite-precision arithmetic but aren't in float64). Due to the way MATLAB sometimes rounds small numbers down to 0, sometimes it breaks the near tie in a different way than we do. They are equally valid lattices but will cause different results. I was checking my results against the MATLAB code as evaluated by Octave and discovered that.

@serge-sans-paille
Copy link
Contributor

@serge-sans-paille any chance we can get special.gammaincinv and ndtri in Pythran? (And ideally ndtr, but it looks like we can do that with erf if need be.)

I can give it a try. Can you open an issue in Pythran bug tracker to keep note of it?

@serge-sans-paille
Copy link
Contributor

@mdhaber I didn't plan to do a pythran release soon, but once you have that one ready, please tell me and I'll do a minor version release.

@mdhaber
Copy link
Contributor Author

mdhaber commented Nov 29, 2022

Thanks @serge-sans-paille!

@mdhaber
Copy link
Contributor Author

mdhaber commented Dec 8, 2022

Welcome back @tupui! What would you prefer to do here?

  1. Continue with Genz's qsimvtv (what we have here).
  2. Have me translate Genz's qsilatmvtv (improved) in the same style (i.e. roughly line-by-line translation).
  3. Start with 1, and do 2 in a followup.
  4. Replace this with Robert's translation.

@tupui
Copy link
Member

tupui commented Dec 8, 2022

Thanks 😃

I still need to have a look and make a decision. Was busy with other reviews so far. I will try to have an answer tmr.

@tupui
Copy link
Member

tupui commented Dec 26, 2022

Sorry for the delay.

Welcome back @tupui! What would you prefer to do here?

  1. Continue with Genz's qsimvtv (what we have here).
  2. Have me translate Genz's qsilatmvtv (improved) in the same style (i.e. roughly line-by-line translation).
  3. Start with 1, and do 2 in a followup.
  4. Replace this with Robert's translation.

What about we use @rkern's version with CBC and use your current version as a validation test?

@mdhaber mdhaber added the enhancement A new feature or improvement label Jan 6, 2023
@mdhaber
Copy link
Contributor Author

mdhaber commented Jan 6, 2023

OK @tupui I updated per your recommendation. Those Cirrus failure seems unrelated. Other failures seem to be due to an fftpack import; I'll see if I can switch that out.

@rkern how do you want to be credited? Before merging, we could do an interactive rebase in which you would author the first commit (maybe just adding the file _qmvnt.py) and we'd squash all the rest of my commits on top. Alternatively, would you be ok being listed as a coauthor when we squash-merge (easier)?

@mdhaber mdhaber marked this pull request as ready for review January 6, 2023 01:51
@rkern
Copy link
Member

rkern commented Jan 6, 2023

Whatever's easiest.

Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay.

I did a first pass. I am not sure about the methods we need in the end in _qmvnt.py. There are lot of unused code so before I dive into some function it would help to clarify what is to stay.

Comment on lines 2394 to 2399
A = rng.random(size=(dim, dim))
cov = A @ A.T
mean = rng.random(dim)
a = -rng.random(dim)
b = rng.random(dim)
df = rng.random() * 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is not too slow, what do you think about testing more variations of all the random parts? Without going full hypothesis, we could make a loop of 5-10? (OC real known corner cases would be better.)

return prob, est_error, n_samples


def _bounds(m, t, alternative):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provision?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for the Dunnett test. Should be removed from here because I don't think this file will be where that gets implemented.

Copy link
Contributor Author

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting warnings when the covariance matrix is singular.

The MATLAB code's extra masking stops the NaN from leaking out and replacing it with a 0, but the actual value that comes out is not quite correct for the cusp (which does seem to have a finite, nonzero limiting value).

So should we warn the user about a loss of accuracy then?

For singular covariance, I'm also getting a lot of substantial discrepancies between this and _qsimvtv without much improvement when I increase the number of points in _qsimvtv. Would you expect _qsimvtv to be very inaccurate in this case? I tried using Matlab as a reference, but its mvtcdf doesn't allow a singular matrix.

@rkern
Copy link
Member

rkern commented Feb 3, 2023

So should we warn the user about a loss of accuracy then?

What I was talking about is not a case that arises with the CBC lattice implemented here. It would arise if we did use qmc_quad() with Sobol'. There are several other reasons why I don't think that would be an improvement, regardless of the boundary issue. In any case, I don't think there's any significant loss of accuracy in any case in which the mask is triggered.

@rkern
Copy link
Member

rkern commented Feb 3, 2023

Would you expect _qsimvtv to be very inaccurate in this case?

I don't expect any implementation to be particularly accurate in the singular case.

@rkern
Copy link
Member

rkern commented Feb 5, 2023

I've updated my gist with a notebook with some more test problems. It turns out that the analytically-known case that I used in my benchmark is a special case of a slightly wider family of a structured covariance matrix where the multivariate normal integral can be reduced down to a univariate integral that is straightforward to numerically integrate with quad().

A related special case family creates a singular matrix, and it too can be integrated numerically.

And finally, it does turn out that one similar special case family can turn the multivariate t into a double-integral that dblquad() can handle fine. I think it is likely that the strategy used in the singular MVN case can be ported over to the singular MVT case if we really care about it, but I'm definitely not going through the math to derive it.

Co-authored-by: Robert Kern <robert.kern@gmail.com>
Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated my gist with a notebook with some more test problems. It turns out that the analytically-known case that I used in my benchmark is a special case of a slightly wider family of a structured covariance matrix where the multivariate normal integral can be reduced down to a univariate integral that is straightforward to numerically integrate with quad().

I am not sure I understand what conclusion you want to draw.

I am not going to block this. It's in relative ok shape and main points have been discussed. Let's finish up last ones and we can merge this soon.

@tupui tupui added this to the 1.11.0 milestone Feb 6, 2023
@rkern
Copy link
Member

rkern commented Feb 6, 2023

I am not sure I understand what conclusion you want to draw.

That notebook has no particular conclusion. I am offering it as an additional source of test problems to be used in our test suite rather than testing by comparing two ports of the only slightly-different versions of basically the same MATLAB code.

It can also be used to do more thorough comparisons on the relative accuracy between the CBC lattice and Sobol' on a wider variety of problems. I ran those benchmarks last night. My findings from the more restricted problem hold up. If there is interest, I can clean up the visualizations of those results.

Here you are proposing 2 other QMC methods instead of the one we already have. I think it's legitimate to ask why we should add other methods just for this purpose and not follow our existing infrastructure.

Just one, we can drop the Richtmyer lattice; it's always worse.

On the whole, I have not questioned the legitimacy of your questions. I have simply answered them. I anticipated the question about using qmc_quad() with Sobol' (I was curious myself, and hoping that it would be better!), which is why I did the comparison ahead of this PR and published the testing code before you asked your questions.

The only thing I did question was the relevancy of one complaint about the fairness of how the comparison was done with respect to the boundary issue, most particularly because there was no mechanism by which restoring "fairness" on that particular issue could work in Sobol''s favor in terms of accuracy or time. It was a secondary issue that could not affect the primary conclusion.

In the event, it didn't matter. All combinations (doing the fastest thing for both, doing the safest thing for both even if it was unnecessary, doing the optimal thing for both) don't change the relative orderings of accuracy and time.

If in this case you are certain it's faster and converge faster. Then ok we have our answer and can accept the additional maintenance cost knowingly.

I am very confident of this, particularly with the new benchmark problems. Yes, I will maintain this code. The inline generation of the QMC points is, by a long stretch, the easiest part of this code to understand and maintain.

Just a note that our QMC engines are not that slow as being implemented in Cython

For sure! They are not slow, by any means. Their API, though it is a very good general purpose API, is just less well-adapted to this particular problem. The integrand is arranged such that it can deal with each dimension one by one, vectorizing across all of the samples. I don't need to generate whole blocks of QMC samples, just one column at a time. Lattice rules work really well for that. But even if we did have lattice QMCEngines, their APIs and that of qmc_quad() doesn't let us make use of that.

@tupui
Copy link
Member

tupui commented Feb 6, 2023

The integrand is arranged such that it can deal with each dimension one by one, vectorizing across all of the samples. I don't need to generate whole blocks of QMC samples, just one column at a time. Lattice rules work really well for that. But even if we did have lattice QMCEngines, their APIs and that of qmc_quad() doesn't let us make use of that.

This is something I have in the back of my mind, add a dimension to a sample/engine. A lot of QMC methods, Sobol' to start with, allow to add a dimension (and are built like that like here.) I am mostly waiting for more people to ask this and really put some thoughts on that. I also have a use case for some people doing DoE but that might be too niche (and also arguable) for now.

@mdhaber
Copy link
Contributor Author

mdhaber commented Feb 6, 2023

I fixed the PEP8 issues in the docstrings and removed or commented about the unused code. I also added tests against generic integrators, Matlab, and the analytical result for the singular case mentioned in #17410 (comment).

Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and CI is happy. Let's get this in. Thanks all for working on that and the discussions. (Unrelated failures on Cirrus as too usual.)

ref = _qsimvtv(20000, df, cov, a - mean, b - mean, rng)[0]
assert_allclose(res, ref, atol=1e-4, rtol=1e-3)

def test_cdf_against_generic_integrators(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I had a very similar code locally when doing some tests. I also had a simple 1D case with an empirical CDF but this should be more robust and cover this.

@tupui tupui merged commit 9b1195e into scipy:main Feb 6, 2023
@rkern
Copy link
Member

rkern commented Feb 6, 2023

the analytical result for the singular case mentioned in #17410 (comment).

Just for the record, this case is not singular.

Also, I presented it as an analytic result for the multivariate normal. Fortunately, it also seems to be true for the multivariate t, regardless of df, at least numerically. I don't have a derivation for it, though, like I do for the multivariate normal case.

@rkern
Copy link
Member

rkern commented Feb 7, 2023

Ah, I do have a derivation. For the negative orthant bounds, all of the w terms that interact with df disappear. The w terms can all be moved outside of the u integral and evaluate to 1. Then the remaining integrand reduces to the same gaussian integral from the corresponding multivariate normal case.

@mdhaber
Copy link
Contributor Author

mdhaber commented Sep 9, 2024

@rkern there is an effort to remove very old Fortran 77 code (gh-18566) and one of the items on the list is mvndst. We plan to compile _qmvnt.py, and its _qmvn function should be able to eliminate the reliance on _mvn.mvnun (and resolve some open issues, like gh-16142 and maybe gh-8367). The other use of mvndst is in gaussian_kde.integrate_box, where it calls mvnun_weighted.

Would this functionality be difficult to incorporate into qmvnt.py? Would you be willing to add it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants