Fix for scipy issues 18506 and 18511 #987

mborland · 2023-05-23T10:40:57Z

All of the member variables of the hyper geometric distribution were of type unsigned leading to overflow when n*N exceeded UINT_MAX. Replace them with std::uint64_t. Similar to #939

x-ref: scipy/scipy#18506 and scipy/scipy#18511

mborland · 2023-05-23T13:19:31Z

@jzmaddock Can you please take a look at this?

jzmaddock · 2023-05-23T16:46:02Z

OK this is nearly right (but see below).

The changes to use [u]int64_t are fine.

However, matching changes are required in detail/hypergeometric_pdf.hpp, detail/hypergeometric_cdf.hpp and detail/hypergeometric_quantile.hpp otherwise the arguments will simply get narrowed to unsigned internally.

The changes to some of the functions are not, consider the original version of variance:

   template <class RealType, class Policy>
   inline RealType variance(const hypergeometric_distribution<RealType, Policy>& dist)
   {
      RealType r = static_cast<RealType>(dist.defective());
      RealType n = static_cast<RealType>(dist.sample_count());
      RealType N = static_cast<RealType>(dist.total());
      return n * r  * (N - r) * (N - n) / (N * N * (N - 1));
   } // RealType variance(const hypergeometric_distribution<RealType, Policy>& dist)

This I think does the right thing - which is to say carry out all computations as floating point arithmetic, and your changes I think break that.

The reason the old mean was failing in scipy/scipy#18511 was because we hadn't applied this workaround there, and computation was as integers:

   template <class RealType, class Policy>
   inline RealType mean(const hypergeometric_distribution<RealType, Policy>& dist)
   {
      return static_cast<RealType>(dist.defective() * dist.sample_count()) / dist.total();
   } // RealType mean(const hypergeometric_distribution<RealType, Policy>& dist)

So that should use a bit of casting as per the original variance implementation to carry out FP arithmetic not integer.

Now for the open question: I assume that we're calculating the complement of the CDF here? And that the OP in scipy/scipy#18506 is just upping the population size until some limit is reached (or something breaks)? If so then using int64_t will help, but we will still break in the end... but I'm conflicted here because the arguments are logically integers!

mborland · 2023-05-24T07:49:38Z

OK this is nearly right (but see below).

Hit all of the changes in the latest commit.

Now for the open question: I assume that we're calculating the complement of the CDF here? And that the OP in scipy/scipy#18506 is just upping the population size until some limit is reached (or something breaks)? If so then using int64_t will help, but we will still break in the end... but I'm conflicted here because the arguments are logically integers!

I think like #939 we try bumping it up to uint64_t, and see if the users hit another wall. We could try conditionally using __int128, but I am not quite sure how pybind11 works with extension types.

WarrenWeckesser · 2023-05-24T18:12:22Z

+1 for changing the type to uint64_t (and for fixing the mean calculation). This moves the "breaking point" pretty far out there, and it seems like a reasonable change even without the motivation from the SciPy issues.

On the SciPy side, we need to ensure that Python integers that are too large to be represented as 64 bit integers are handled appropriately instead of allowing them to end up in 64 bit variables.

mborland · 2023-05-25T07:16:21Z

@jzmaddock The only CI failure is fail to clone so this should be good if you want to take another look.

jzmaddock · 2023-05-25T17:45:16Z

Looks good to me - we just need to update the docs to int64_t.

mborland added 3 commits May 23, 2023 10:52

Replace 32 bit unsigned with 64 bits

f968bec

Fix stack overflow

54b1722

Add tests for issue 18511

29fbc5b

Change types in hypergeometric cdf and pdf impls

94c68f8

Update docs

ba36dbe

mborland merged commit 6bfe581 into boostorg:develop May 31, 2023

mborland deleted the 18511 branch May 31, 2023 07:39

mdhaber mentioned this pull request Jun 1, 2023

MAINT: stats.hypergeom.mean: correct for large args scipy/scipy#18602

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix for scipy issues 18506 and 18511 #987

Fix for scipy issues 18506 and 18511 #987

Uh oh!

mborland commented May 23, 2023

Uh oh!

mborland commented May 23, 2023

Uh oh!

jzmaddock commented May 23, 2023

Uh oh!

mborland commented May 24, 2023 •

edited

Loading

Uh oh!

WarrenWeckesser commented May 24, 2023

Uh oh!

mborland commented May 25, 2023

Uh oh!

jzmaddock commented May 25, 2023

Uh oh!

Uh oh!

Fix for scipy issues 18506 and 18511 #987

Fix for scipy issues 18506 and 18511 #987

Uh oh!

Conversation

mborland commented May 23, 2023

Uh oh!

mborland commented May 23, 2023

Uh oh!

jzmaddock commented May 23, 2023

Uh oh!

mborland commented May 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WarrenWeckesser commented May 24, 2023

Uh oh!

mborland commented May 25, 2023

Uh oh!

jzmaddock commented May 25, 2023

Uh oh!

Uh oh!

mborland commented May 24, 2023 •

edited

Loading