Skip to content

Conversation

AnirudhDagar
Copy link
Member

@AnirudhDagar AnirudhDagar commented Feb 16, 2022

Reference PR

gh-8707

What does this implement/fix?

>>> from scipy import datasets
# Example ascent dataset loading with the new module
>>> datasets.ascent()
array([[ 83,  83,  83, ..., 117, 117, 117],
       [ 82,  82,  83, ..., 117, 117, 117],
       [ 80,  81,  83, ..., 117, 117, 117],
       ...,
       [178, 178, 178, ...,  57,  59,  57],
       [178, 178, 178, ...,  56,  57,  57],
       [178, 178, 178, ...,  57,  57,  58]])

With gh-8707, a couple of years ago, SciPy wanted to introduce the datasets submodule and move a handful of dataset functions from the current misc module to this new datasets submodule. Thanks to @WarrenWeckesser for the initial work. With this PR (indeed inspired by gh-8707) we resume those efforts after making some improvements (explained below) and move away from the scipy.misc module, finally deprecating it (in a separate PR #15901).

  • Add scipy.datasets submodule
  • Utilize pooch to handle the dataset downloading and caching.
  • Enable meson support for scipy.datasets submodule
  • Update the base URL and registry URL dataset links once we decide the approach and create the respected repos in the scipy organization.

TODO Later:

Additional information

Why should we use Pooch?
OR
What are some other possible options (eg. intake, one of the suggestions made earlier) and why not prefer them?

There are a few good reasons for choosing Pooch:

  1. With Pooch we can easily decouple the datasets that are currently present within the SicPy repository and move them to their own new repository. For example, see https://github.com/scipy-datasets, where each dataset has its own repository. This will lead to a lightweight SciPy wheel decreasing the download size for future releases. Keeping the datasets in individual repos or a single scipy-datasets repo is a point of discussion. Note that we need to take into account that GitHub has a 2Gb limit on the repo size. In case, in future, we plan to add new datasets which are relatively large, then this idea of keeping the datasets in the SciPy repo may not be a wise choice. In fact, keeping all the datasets in a separate single dataset repo might also break after a while. The best approach IMO would be to maybe group them by type of datasets into a few repositories or just keep them individually (i.e each dataset has its own repo).

  2. Dependency: Pooch is an extremely light package and has only a few dependencies, so if we were to add a new dependency i.e. Pooch, we can expect it to be small and at the same time it won't add a lot of sub-dependencies. On the other hand, something like intake has a few more deps, see requirements.txt. With the basic intake package and some additional sections, it will also install packages like dask etc.(not required for scipy datasets). Both of them use a library called appdirs which is quite small and useful.

  3. IMO Pooch has a simpler API overall and seems easy to implement/maintain for developers while also supporting the key features like caching (hash checks, multi download). Intake is slightly more complex with its design of catalog like YAML files and definitely has some extra features but we surely don't need those to solve the problem of dataset downloading and caching.

Ps. scikit-image also recently (in 2020) made a move to make use of Pooch for some of the same reasons.

cc @rgommers

@stefanv
Copy link
Member

stefanv commented Feb 17, 2022

/cc @grlee77 @hmaarrfk Any lessons from the skimage fetcher that should apply here?

@rgommers
Copy link
Member

2. Dependency: Pooch is an extremely light package and has only a few dependencies, so if we were to add a new dependency i.e. Pooch, we can expect it to be small and at the same time it won't add a lot of sub-dependencies

Scikit-image doesn't depend on Pooch, it's an optional dependency (e.g., pip install scikit-image[data]). We should probably do something similar, dragging it in by default seems like a bit much. xref scikit-image/scikit-image#5105 for the relevant discussion.

@ilayn
Copy link
Member

ilayn commented Feb 17, 2022

Caching and downloading parts can be done without the dependencies. Requests is quite a hefty dependence for a light weight lib in my opinion. We can simply check a file with a predefined hash and otherwise get it with some urllib machinary.

Also if scikit image is providing these then probably we should delegate to it since they are the golden standard for image related stuff.

Note however that we need a long deprecation time as we witnessed before with imresize and other related functions that haunted us for at least four major versions.

@andyfaff
Copy link
Contributor

I'm with Ralf on the dependencies, I definitely wouldn't want to be adding more hard dependencies, only optional ones (if at all).

In any case, it's probably worth asking ourselves "Would we add the same files to the project nowadays?" and "Do we think it's worthwhile to add more datasets/images?". If the answer to these questions is no then why don't we just deprecate and remove them completely?

@ev-br
Copy link
Member

ev-br commented Feb 17, 2022

What Andrew said. If we go down to pip install scipy[data] or some such, why not go all the way to a separate scipy_datasets repo/project in the scipy organization.

And this is orthogonal to moving/removing derivative/central_diff_weight, best keep these two things separate.

@rgommers
Copy link
Member

In any case, it's probably worth asking ourselves "Would we add the same files to the project nowadays?" and "Do we think it's worthwhile to add more datasets/images?". If the answer to these questions is no then why don't we just deprecate and remove them completely?

That's a very fair question. I think the answer is that we do actually want these for examples, tutorials and tests, and ideally want some more.

For image data we could say why not just use skimage.data (https://scikit-image.org/docs/stable/api/skimage.data.html), and add them there? This has a few pros and cons:

In addtion, we also want:

  • time series data (like electrocardiagram)
  • stats data (like iris, and test/benchmark data like the NIST data we already ship)
  • optimize benchmark data, like the Netlib LP data in benchmarks/benchmarks/linprog_benchmark_files/
  • all the special functions reference data in scipy/special/tests/data/

@AnirudhDagar didn't really focus on the yes/no for scipy.datasets here, because the linked PR from @WarrenWeckesser got thorough and largely positive feedback (#8707 (comment)) which still applies. In addition, I think the importance of sdist/wheel size reduction has grown since then, and there's more to gain than was pointed out in that thread. I recently counted, and came to about 18 MB of text/data files that we ship (uncompressed, didn't carefully measure how much wheel sizes reduce with all that removed).

So my impression is that:

  • the case for having a single place for all our datasets needs is strong,
  • depending on scikit-image for image data, scikit-learn for iris, etc. is not a good solution,
  • so the choice is between scipy.datasets and a separate scipy-datasets. for that:
    • functionality is the same either way
    • we need it to make scipy.test() work
    • it's more discoverable as part of the main scipy docs
    • release/docs/maintenance-wise it's easier to have it integrated rather than having an optional dependency on a (pinned version of) scipy-datasets.

Regarding file sizes: my impression is that >95% of users will not need scipy.datasets contents, unless it becomes much more popular in the future, so we can assume this is disk space & network bandwidth saved. I did a little calculation for PyPI:

  • assuming 18 MB on disk may translate to ~3 MB wheel size reduction, then with the current 35M downloads/months we save 36e6 * 12 * 3e6 / 1e15 # downloads/month * months * wheel-MB / PB = 1.3 PB/yr
  • PyPI bandwidth usage is about 1 PB/day (see https://dustingram.com/articles/2021/04/14/powering-the-python-package-index-in-2021/)
  • So it's about 0.35% of total PyPI bandwidth size (a little handwaving, but I'm reasonably sure it's >0.1% unless I made a calculation error)

@charris
Copy link
Member

charris commented Feb 18, 2022

I expect the sdists need to include tests, but do the binary wheels need them?

@rgommers
Copy link
Member

I expect the sdists need to include tests, but do the binary wheels need them?

We've always wanted this so users can actually diagnose install issues, and I personally use it a lot as well. It could be separated out as another package (e.g., scipy-testsuite), but also that is quite a bit of work. The main downside of actually including tests is data size, so if we can solve that here with a datasets package, that'd be my preference.

@hmaarrfk
Copy link
Contributor

Personally, i think Pooch is a great optional dependency to have.
The authors have been very responsive in the past, mindful about backward compatibility and they do things like checking the hash for you. I don't think it is worth recreating it yourself.

I think the dataset API should be abstracted away from files. Users should see data and python objects, not filenames.

We had trouble with scikit image with users knowing the data directory and some of our examples being in the form of filepath traversing a known path.

I think if you wanted to get away from requests, you might have better effect making patches to pooch and suggesting them in PRs.

@rgommers
Copy link
Member

I think the dataset API should be abstracted away from files. Users should see data and python objects, not filenames.

Yes, I agree.

I think if you wanted to get away from requests, you might have better effect making patches to pooch and suggesting them in PRs.

There's no need to specifically avoid requests. I saw @ilayn's comment higher up, but urllib may get deprecated and itself recommends requests. If we optionally depend on Pooch, this is just something we shouldn't care about (and if we did, requests is the gold standard).

@ilayn
Copy link
Member

ilayn commented Feb 24, 2022

urllib may get deprecated and itself recommends requests

That's really good news for me. Miraculously I missed the whole PSF takeover. Makes many things so much simpler.

@AnirudhDagar
Copy link
Member Author

AnirudhDagar commented Feb 28, 2022

Thanks all for the feedback, completely understand why it makes sense to have pooch as an optional dependency. Does this mean that scipy should pass all the tests without pooch after build (we can somehow skip datasets module tests)? Or can we add pooch as a requirement for the tests on the CI and then naturally tests should be fine (like done in the PR at the moment)?

I guess a user won't care about tests and it is fine to hard depend on pooch for tests to pass, as we do depend in the case of matplotlib as well. (on a second thought, not sure about this)

When using SciPy, the dependency still remains optional, until someone actually imports the datasets module or calls one of the datasets functions, all is good. IIUC when scipy.datasets is imported, an importerror should be raised if pooch is missing, and at the same time ask the user to install it.

@rgommers
Copy link
Member

I guess a user won't care about tests and it is fine to hard depend on pooch for tests to pass, as we do depend in the case of matplotlib as well. (on a second thought, not sure about this)

It's fine to have a hard dependency on Pooch for the test suite. It'll be hard to avoid, too many tests will depend on it. Matplotlib is a different story - it's literally only a handful of tests that depend on it.

@hmaarrfk
Copy link
Contributor

Some build systems downstream might have restrictions on connecting to the internet. During their testing phase.

Finally, there can be spurious failures with downloading large datasets in CIs

We actually allow users to test without pooch install:

https://github.com/scikit-image/scikit-image/blob/725e8f6bcccc7d4e78ba0d64168f8ed76b6dc818/skimage/data/_fetchers.py#L143

@hmaarrfk
Copy link
Contributor

i mostly wanted to point to our implementation rather than to say you should do.

@rgommers
Copy link
Member

Some build systems downstream might have restrictions on connecting to the internet. During their testing phase.

This is easy to avoid by downloading in the install phase in CI scripts (more efficient anyway probably). So when you do:

python dev.py --build-only   # or `pip install .`

to install SciPy, then right after do whatever the command will be to download all datasets used in CI.

@hmaarrfk
Copy link
Contributor

to install SciPy, then right after do whatever the command will be to download all datasets used in CI.

I believe some are even restricted against doing this.

I think one example is debian.

Anyway, what we did was to skip any test where pooch failed for whatever reason.

@AnirudhDagar
Copy link
Member Author

Anyway, what we did was to skip any test where pooch failed for whatever reason.

This sounds like a good idea to me if we decide to skip only if the failure is due to a problem with pooch downloading the dataset.

@rgommers
Copy link
Member

I think one example is debian.

It must be allowed to run multiple commands I'd think. We already require:

git clone ...
git submodule update --init

That should be the "obtain sources" step. If needed we could make it possible to run some script straight after without first building SciPy.

@oscarbenjamin
Copy link

Maybe the problem is that this does not install the pooch dependency:

pip install git+https://github.com/scipy/scipy@main

Maybe the dependency should be added in setup.py.

@AnirudhDagar
Copy link
Member Author

AnirudhDagar commented Aug 15, 2022

Thanks @oscarbenjamin for the report. Yes, pooch is an optional dependency, and thus we don't require it. I'm curious, does SymPy use the datasets module already? Or are you running all the SciPy tests in SymPy CI (I've not checked SymPy CI yet, will only have time to look later today)? This warning should only hit if you try to import scipy.datasets or use a method from that module.

All other sections/submodules within SciPy can still be used without installing/requiring Pooch.

@AnirudhDagar
Copy link
Member Author

Ohh interesting, is it just the install from git that is triggering the issue?

@oscarbenjamin
Copy link

No SymPy does not use the datasets module. Somehow this PR has affected SymPy importing other things from SciPy as part of lambdify. The code that does this is quite convoluted so it's not easy to say exactly what import statement is being used:
https://github.com/sympy/sympy/blob/master/sympy/utilities/lambdify.py

@oscarbenjamin
Copy link

Ohh interesting, is it just the install from git that is triggering the issue?

Yes, somehow SymPy CI fails with SciPy main since this PR was merged. More discussion in sympy/sympy#23932.

@AnirudhDagar
Copy link
Member Author

The file indeed is convoluted, and I need to look closely, but from the first glance, maybe it is L106 importing everything from SciPy.

https://github.com/sympy/sympy/blob/b1f35f8daf5efa9f801f905cf1037000d69b0278/sympy/utilities/lambdify.py#L106

I'll be happy to help out and dig deeper in SymPy code later in the day (slightly busy atm); I hope it's not too urgent, and is fine if I reply late evening or by tomorrow morning. Thanks!

@rgommers
Copy link
Member

rgommers commented Aug 15, 2022

Yes, from scipy import * will import scipy.datasets, because datasets is added to __all__. I suggest fixing it by removing from scipy import *, that is really bad practice and it's then expected to pull in optional components. There is also nothing much of interest in the main scipy namespace that even makes it useful to star-import.

@oscarbenjamin
Copy link

To be clear since I haven't managed to build scipy main locally does this PR make it an importerror to do from scipy import * without having pooch installed?

@AnirudhDagar
Copy link
Member Author

To be clear since I haven't managed to build scipy main locally does this PR make it an importerror to do from scipy import * without having pooch installed?

Importing everything will also include scipy.datasets submodule, which depends on pooch and we don't make it a requirement so yes, importing everything will prompt an import error and ask you to install pooch. But as Ralf said it's a bad practice to import * and that's not something you want to do.

@oscarbenjamin
Copy link

I agree that the sympy code should be changed but does this not sound like a scipy regression to you?

@rgommers
Copy link
Member

Not really, no. There's many non-idiomatic things one can do, and in this case:

@oscarbenjamin
Copy link

The sympy code should be changed in any case (although I would prefer not to have to do that at release candidate stage) so let's not be concerned with that here.

I imagine that from scipy import * is something that has worked just fine for a long time even if you don't consider it idiomatic and I expect that a lot of code depends on it. Obviously it's your call but to me this seems like a fairly gratuitous breakage.

@rgommers
Copy link
Member

Okay, let's open an issue for it, tag it with the 1.10.0 milestone, and see if we need to do something here before the next release. I don't think we want to change it right now in main, that is actually the right time to make changes and smoke out other possible issues.

@AnirudhDagar
Copy link
Member Author

AnirudhDagar commented Aug 15, 2022

It honestly will make a lot of users unhappy, not sure if there is a way to delay the import error raised.

Something like this inside datasets.__init__.py would avoid the error, and we can shift the error to _fetchers.py but it is obv not the best idea because now the user who wants to use this will need to install and reimport since the module will have no functions listed to it:

try:
    import pooch
    from ._fetchers import face, ascent, electrocardiogram  # noqa: E402
    __all__ = ['ascent', 'electrocardiogram', 'face']

except:
    warnings.warn("Pooch not found! Please install pooch before using scipy.datasets."
                  "After installing pooch, try importing scipy.datasets again .....")

from scipy._lib._testutils import PytestTester
test = PytestTester(__name__)
del PytestTester
....

@rgommers
Copy link
Member

After installing pooch, try importing scipy.datasets again

A second import statement for the same module is a no-op, so that may not do what you intend.

@AnirudhDagar
Copy link
Member Author

Ohhh I see, that's correct, will not work until the user restarts the process. I don't have any ideas atm to avoid this.

@oscarbenjamin
Copy link

not sure if there is a way to delay the import error raised.

There are lots of ways that you can make this work if you want to make it work. The code here goes out of its way to raise ImportError. If you wanted the module to be importable but without some of its features when pooch is not installed then you just do something like:

try:
    import pooch
except ImportError:
    pooch = None
    data = None
else:
    data = pooch.create(...)

def fetch_data(...):
    if data is None:
        raise an error saying that pooch is not installed
    return data.fetch(...)

Then everywhere you currently call data.fetch you can just call fetch_data instead.

will not work until the user restarts the process. I don't have any ideas atm to avoid this

In general it is always necessary to restart the process after install or uninstalling things so that doesn't seem like a big deal. If you did want to avoid that though then you would just move import pooch into the fetch_data function rather than being at top-level.

@AnirudhDagar
Copy link
Member Author

AnirudhDagar commented Aug 17, 2022

Thanks @oscarbenjamin this is good and should work. Sorry for the negligence on my side, this was quite straightfwd. Moving pooch import into scipy.datasets._fetchers.py will do the trick. @rgommers do you think it is okay to not raise a warning/error if someone does from scipy import datasets? Now that I think it of it, i feel it should have been that way from the start. Got that wrong :/

If that's okay, I'll send a PR to fix this now.

After the change:

>>> from scipy import datasets # this does not raise
>>> datasets.ascent()  # raise only on function call
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/gollum/Desktop/Work/scipy/build-install/lib/python3.10/site-packages/scipy/datasets/_fetchers.py", line 76, in ascent
    fname = fetch_data("ascent.dat")
  File "/Users/gollum/Desktop/Work/scipy/build-install/lib/python3.10/site-packages/scipy/datasets/_fetchers.py", line 35, in fetch_data
    raise ImportError(msg)
ImportError: Missing optional dependency 'pooch' required for scipy.datasets module. Please use pip or conda to install 'pooch'.

@asmeurer
Copy link

I disagree that SymPy should change here. from mod import * is how lambdify works, to import all the names into the namespace (see the "how it works" section here https://docs.sympy.org/dev/modules/utilities/lambdify.html#sympy.utilities.lambdify.lambdify).

If SciPy is hard importing an external library with from scipy import *, then that library has effectively become a hard dependency of scipy. SymPy is hardly going to be the only instance doing an import * so this will likely affect more than just it.

I'm actually surprised that you can add something to __all__ without adding it to the main import scipy. I never realized __all__ worked like that.

@rgommers
Copy link
Member

I disagree that SymPy should change here. from mod import * is how lambdify works, to import all the names into the namespace (see the "how it works" section here https://docs.sympy.org/dev/modules/utilities/lambdify.html#sympy.utilities.lambdify.lambdify).

I can't understand from the link why from scipy import * is needed. You're not also recursively star-importing scipy.submodule, right? 99.x% of scipy's functionality is present in those submodules, there's only a handful of items in the main namespace. So it looks to me like this is not actually necessary. And it's most definitely bad practice.

@asmeurer
Copy link

It's pulling everything from scipy and scipy.special. I guess as Oscar showed at sympy/sympy#23945, we really only need scipy.special.

And it's most definitely bad practice.

import * is bad practice is certain situations, but perfectly acceptable in others. If you really believe that it's universally bad, then why does scipy even define __all__?

@rgommers
Copy link
Member

then why does scipy even define __all__?

__all__ has always been used to define all public symbols, and we use it also to verify that all public functions are present in the API docs.

@AnirudhDagar AnirudhDagar mentioned this pull request Sep 8, 2022
10 tasks
AnirudhDagar added a commit to AnirudhDagar/scipy that referenced this pull request Oct 7, 2022
download_all.py can be run as a simple script to download
all the data files required by scipy.datasets at once.
This is useful for packages that may have some restrictions
e.g, scipy#15607 (comment)
rgommers pushed a commit that referenced this pull request Nov 24, 2022
`download_all.py` can be run as a simple script to download
all the data files required by `scipy.datasets` at once.
This is useful for packages that may have some restrictions
e.g, #15607 (comment)

[skip azp] [skip github]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.