-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
ENH: stats.multivariate_t: add cdf method #17410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Ah, I should have made my translation public earlier. I've been sitting on it for a while. https://gist.github.com/rkern/8518cb3e3f57fa81d21dbd9eabcfaa87 Yeah, lifting the drawing from The CBC lattice in Genz's later version is quite a bit superior to the Richtmyer lattice. @tupui I don't know if you want to consider adding the CBC lattice as a generic QMC engine. It's finicky (more than Sobol' even) in that you have to draw exactly the prime number of samples that it was configured to produce. The core algorithm is here to use, if you want. I have an implementation of the generic multivariate normal integrand that you can integrate yourself with any If you want a test problem, integrating a multivariate normal with the covariance matrix with the diagonals all 1 and the off-diagonals all 0.5 over the negative orthant has an exact true answer ( |
Thanks! Yeah, I knew you already translated it, but I figured one could be tested against the other. Or maybe we'll end up going for Fortran anyway. @serge-sans-paille any chance we can get |
I believe that the CBC QMC scheme used in the version that I translated is superior in performance to the scheme in the Fortan code. I think Pythranization is the way to go. We should probably wrap the Fortran code for the bivariate and trivariate cases which can be done more accurately. |
OK. I like that if we can get the special functions. If not, there's always Cython.
Good idea. But let's leave that as a followup PR. This will get big enough as it is. I had planned to wait until @tupui get's back next month before doing too much more here, but I suppose I should ask - would you like to review this, or would you like to submit your code as a PR and I will review? In either case, I'd like to suggest that given the same input (including random numbers), the translation should produce the same output up to machine precision. As long as we hit all the code paths with tests, this would make it very easy to check for accuracy against the Matlab originals. Is that still true in your translation? |
Not quite. The CBC lattice construction involves doing some convolutions via FFTs and |
I can give it a try. Can you open an issue in Pythran bug tracker to keep note of it? |
@mdhaber I didn't plan to do a pythran release soon, but once you have that one ready, please tell me and I'll do a minor version release. |
Thanks @serge-sans-paille! |
Welcome back @tupui! What would you prefer to do here?
|
Thanks 😃 I still need to have a look and make a decision. Was busy with other reviews so far. I will try to have an answer tmr. |
Sorry for the delay.
What about we use @rkern's version with CBC and use your current version as a validation test? |
OK @tupui I updated per your recommendation. Those Cirrus failure seems unrelated. Other failures seem to be due to an fftpack import; I'll see if I can switch that out. @rkern how do you want to be credited? Before merging, we could do an interactive rebase in which you would author the first commit (maybe just adding the file |
Whatever's easiest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay.
I did a first pass. I am not sure about the methods we need in the end in _qmvnt.py
. There are lot of unused code so before I dive into some function it would help to clarify what is to stay.
A = rng.random(size=(dim, dim)) | ||
cov = A @ A.T | ||
mean = rng.random(dim) | ||
a = -rng.random(dim) | ||
b = rng.random(dim) | ||
df = rng.random() * 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is not too slow, what do you think about testing more variations of all the random parts? Without going full hypothesis, we could make a loop of 5-10? (OC real known corner cases would be better.)
scipy/stats/_qmvnt.py
Outdated
return prob, est_error, n_samples | ||
|
||
|
||
def _bounds(m, t, alternative): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Provision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for the Dunnett test. Should be removed from here because I don't think this file will be where that gets implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting warnings when the covariance matrix is singular.
The MATLAB code's extra masking stops the NaN from leaking out and replacing it with a 0, but the actual value that comes out is not quite correct for the cusp (which does seem to have a finite, nonzero limiting value).
So should we warn the user about a loss of accuracy then?
For singular covariance, I'm also getting a lot of substantial discrepancies between this and _qsimvtv
without much improvement when I increase the number of points in _qsimvtv
. Would you expect _qsimvtv
to be very inaccurate in this case? I tried using Matlab as a reference, but its mvtcdf
doesn't allow a singular matrix.
What I was talking about is not a case that arises with the CBC lattice implemented here. It would arise if we did use |
I don't expect any implementation to be particularly accurate in the singular case. |
I've updated my gist with a notebook with some more test problems. It turns out that the analytically-known case that I used in my benchmark is a special case of a slightly wider family of a structured covariance matrix where the multivariate normal integral can be reduced down to a univariate integral that is straightforward to numerically integrate with A related special case family creates a singular matrix, and it too can be integrated numerically. And finally, it does turn out that one similar special case family can turn the multivariate t into a double-integral that |
Co-authored-by: Robert Kern <robert.kern@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated my gist with a notebook with some more test problems. It turns out that the analytically-known case that I used in my benchmark is a special case of a slightly wider family of a structured covariance matrix where the multivariate normal integral can be reduced down to a univariate integral that is straightforward to numerically integrate with quad().
I am not sure I understand what conclusion you want to draw.
I am not going to block this. It's in relative ok shape and main points have been discussed. Let's finish up last ones and we can merge this soon.
That notebook has no particular conclusion. I am offering it as an additional source of test problems to be used in our test suite rather than testing by comparing two ports of the only slightly-different versions of basically the same MATLAB code. It can also be used to do more thorough comparisons on the relative accuracy between the CBC lattice and Sobol' on a wider variety of problems. I ran those benchmarks last night. My findings from the more restricted problem hold up. If there is interest, I can clean up the visualizations of those results.
Just one, we can drop the Richtmyer lattice; it's always worse. On the whole, I have not questioned the legitimacy of your questions. I have simply answered them. I anticipated the question about using The only thing I did question was the relevancy of one complaint about the fairness of how the comparison was done with respect to the boundary issue, most particularly because there was no mechanism by which restoring "fairness" on that particular issue could work in Sobol''s favor in terms of accuracy or time. It was a secondary issue that could not affect the primary conclusion. In the event, it didn't matter. All combinations (doing the fastest thing for both, doing the safest thing for both even if it was unnecessary, doing the optimal thing for both) don't change the relative orderings of accuracy and time.
I am very confident of this, particularly with the new benchmark problems. Yes, I will maintain this code. The inline generation of the QMC points is, by a long stretch, the easiest part of this code to understand and maintain.
For sure! They are not slow, by any means. Their API, though it is a very good general purpose API, is just less well-adapted to this particular problem. The integrand is arranged such that it can deal with each dimension one by one, vectorizing across all of the samples. I don't need to generate whole blocks of QMC samples, just one column at a time. Lattice rules work really well for that. But even if we did have lattice |
This is something I have in the back of my mind, add a dimension to a sample/engine. A lot of QMC methods, Sobol' to start with, allow to add a dimension (and are built like that like here.) I am mostly waiting for more people to ask this and really put some thoughts on that. I also have a use case for some people doing DoE but that might be too niche (and also arguable) for now. |
I fixed the PEP8 issues in the docstrings and removed or commented about the unused code. I also added tests against generic integrators, Matlab, and the analytical result for the singular case mentioned in #17410 (comment). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and CI is happy. Let's get this in. Thanks all for working on that and the discussions. (Unrelated failures on Cirrus as too usual.)
ref = _qsimvtv(20000, df, cov, a - mean, b - mean, rng)[0] | ||
assert_allclose(res, ref, atol=1e-4, rtol=1e-3) | ||
|
||
def test_cdf_against_generic_integrators(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I had a very similar code locally when doing some tests. I also had a simple 1D case with an empirical CDF but this should be more robust and cover this.
Just for the record, this case is not singular. Also, I presented it as an analytic result for the multivariate normal. Fortunately, it also seems to be true for the multivariate t, regardless of |
Ah, I do have a derivation. For the negative orthant bounds, all of the |
@rkern there is an effort to remove very old Fortran 77 code (gh-18566) and one of the items on the list is Would this functionality be difficult to incorporate into |
Reference issue
gh-16241
What does this implement/fix?
This adds a CDF method to
multivariate_t
.Additional information
Rather than wrapping Genz's Fortran code, I translated his (permissively-licensed) Matlab code so that it would be easier to use a NumPy's PRNG (and avoid the equivalent of gh-16142).
That said, the number of random variates needed is always
10*dim
. My plan was to compile this with Pythran in a follow-up PR, and to do that I would have to pass in those random variates because I don't think we can pass in aGenerator
orRandomState
into a Pythran function (correct me if I'm wrong). In that case, I suppose we could apply the same strategy with the Fortran code, slightly modified to accept all the random variates as an additional input. The advantage is that we wouldn't really need to review the Fortran source; the disadvantage is that we'd need to wrap it.I still need to look for rough edges and polish them. I suspect that the output shapes are not perfect, I don't know what happens when
x
is empty, etc. Update: it handles edge cases likemultivariate_t.pdf
does.I went ahead and wrote my own translation because comparing existing translations may be a useful part of review.