Skip to content

Conversation

ev-br
Copy link
Member

@ev-br ev-br commented Mar 1, 2025

Reference issue

A follow-up to #22298

What does this implement/fix?

Use a new _qmvn multivariate normal integrator in gaussian_kde. Previously, gh-22298 replaced the F77 integrator, _mvn.mvnun, with a improved version. Now this PR replaces _mvn.mvnun_weighted, which completes the removal of the F77 original.

Additional information

The first commits only adds tests of the "old" implementation vs multivariate_normal; the second commit changes the implementation, and the third commit removes the Fortran code.

ev-br added 3 commits March 1, 2025 22:36
See scipy/stats/_qmvnt.py for the replacement.
Note that

- mvndst.f was code by Alan Genz
- _qmvnt.py code is a python translation of a(n improved) Matlab
  implementation, also by Alan Genz
@ev-br ev-br added scipy.stats maintenance Items related to regular maintenance tasks labels Mar 1, 2025
@ev-br ev-br requested a review from rgommers as a code owner March 1, 2025 22:45
@github-actions github-actions bot added Fortran Items related to the internal Fortran code base Meson Items related to the introduction of Meson as the new build system for SciPy labels Mar 1, 2025
@ev-br ev-br changed the title Kde mvn Rewrite gaussian_kde.integrate_box, remove _mvn F77 extension Mar 1, 2025
@ev-br ev-br requested review from rkern and mdhaber March 1, 2025 22:45
@lucascolley lucascolley changed the title Rewrite gaussian_kde.integrate_box, remove _mvn F77 extension MAINT: stats: rewrite gaussian_kde.integrate_box, remove _mvn F77 extension Mar 2, 2025
Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't look at the nitty gritty of the tests yet. Before I do, just wanted to check what prompted the addition of PDF tests - is there nothing like them there already?

Copy link
Member

@rkern rkern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! One question on the API.

Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, too; just one question about whether there's a reason to use atol instead of rtol. It may be fine here, but it's not immediately obvious because MVN pdf/cdf can easily get quite small, so I'd default to rtol unless there's a reason to use atol.

warnings.warn(msg, stacklevel=2)

return value
low, high = low_bounds - self.dataset.T, high_bounds - self.dataset.T
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed that dataset is atleast_2d. Also, it looks like low_bounds and high_bounds are only supposed to work with a single point at a time, and that will always be aligned along the -1 axis (which is why we transpose dataset).

high, lower_limit=low, cov=self.covariance, maxpts=maxpts
)
# XXX: xp.linalg.vecdot is much faster than (a*b).sum(axis=-1)
return (values * self.weights).sum(axis=-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also checked that weights is already normalized. 👍

assert_allclose(
gkde(xx),
stats.norm.pdf(xx[:, None], loc=loc, scale=scale).sum(axis=-1) / gkde.n,
atol=1e-14
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use atol throughout? If not, I'd default to rtol. Especially when looking at multivariate normal later, it's not immediately obvious that assertions about the PDF with atol=1e-14 really test anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

¯\_(ツ)_/¯ this is just my default to "nearly equals in double precision", with an order of magnitude of a wiggle room.

Copy link
Contributor

@mdhaber mdhaber Mar 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then please consider changing that default to rtol=1e-14, at least when working on stats and special functions.

arg = xx[:, None, :] - gkde.dataset.T
pdf = stats.multivariate_normal.pdf
assert_allclose(
gkde(xx.T),
Copy link
Contributor

@mdhaber mdhaber Mar 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance,

gkde(xx.T)
# [5.35496592e-02 3.93564062e-04 6.68839968e-14]

so with atol=1e-14, the test at point [5, 6] is not doing much, and it's allowing quite different relative precision for the other values. Seeing [5, 6] as the argument to MVN is actually what prompted the comment.

rtol=1e-14 typically aligns better with the intent "nearly equals in double precision"1.

Footnotes

  1. assert_array_max_ulp is even better (slightly), except for the name. It might be worth redefining xp_assert_close and such in terms of ULPs, though, or at least allowing a ulp option.

@mdhaber
Copy link
Contributor

mdhaber commented Mar 3, 2025

Alright, thanks @ev-br and reviewers!

Oops looks like there is some value in the history here, so I'll squash the last three commits into one before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Fortran Items related to the internal Fortran code base maintenance Items related to regular maintenance tasks Meson Items related to the introduction of Meson as the new build system for SciPy scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants