-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
ENH: Add scipy.datasets
submodule
#15607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/cc @grlee77 @hmaarrfk Any lessons from the skimage fetcher that should apply here? |
Scikit-image doesn't depend on Pooch, it's an optional dependency (e.g., |
Caching and downloading parts can be done without the dependencies. Requests is quite a hefty dependence for a light weight lib in my opinion. We can simply check a file with a predefined hash and otherwise get it with some urllib machinary. Also if scikit image is providing these then probably we should delegate to it since they are the golden standard for image related stuff. Note however that we need a long deprecation time as we witnessed before with imresize and other related functions that haunted us for at least four major versions. |
I'm with Ralf on the dependencies, I definitely wouldn't want to be adding more hard dependencies, only optional ones (if at all). In any case, it's probably worth asking ourselves "Would we add the same files to the project nowadays?" and "Do we think it's worthwhile to add more datasets/images?". If the answer to these questions is no then why don't we just deprecate and remove them completely? |
What Andrew said. If we go down to And this is orthogonal to moving/removing |
That's a very fair question. I think the answer is that we do actually want these for examples, tutorials and tests, and ideally want some more. For image data we could say why not just use
In addtion, we also want:
@AnirudhDagar didn't really focus on the yes/no for So my impression is that:
Regarding file sizes: my impression is that >95% of users will not need
|
I expect the sdists need to include tests, but do the binary wheels need them? |
We've always wanted this so users can actually diagnose install issues, and I personally use it a lot as well. It could be separated out as another package (e.g., |
Personally, i think Pooch is a great optional dependency to have. I think the dataset API should be abstracted away from files. Users should see data and python objects, not filenames. We had trouble with scikit image with users knowing the data directory and some of our examples being in the form of filepath traversing a known path. I think if you wanted to get away from requests, you might have better effect making patches to pooch and suggesting them in PRs. |
Yes, I agree.
There's no need to specifically avoid |
That's really good news for me. Miraculously I missed the whole PSF takeover. Makes many things so much simpler. |
Thanks all for the feedback, completely understand why it makes sense to have pooch as an optional dependency. Does this mean that I guess a user won't care about tests and it is fine to hard depend on pooch for tests to pass, as we do depend in the case of When using SciPy, the dependency still remains optional, until someone actually imports the datasets module or calls one of the datasets functions, all is good. IIUC when |
3af2bb2
to
cb955bc
Compare
It's fine to have a hard dependency on Pooch for the test suite. It'll be hard to avoid, too many tests will depend on it. Matplotlib is a different story - it's literally only a handful of tests that depend on it. |
Some build systems downstream might have restrictions on connecting to the internet. During their testing phase. Finally, there can be spurious failures with downloading large datasets in CIs We actually allow users to test without pooch install: |
i mostly wanted to point to our implementation rather than to say you should do. |
This is easy to avoid by downloading in the install phase in CI scripts (more efficient anyway probably). So when you do:
to install SciPy, then right after do whatever the command will be to download all datasets used in CI. |
fe40cec
to
f90d303
Compare
I believe some are even restricted against doing this. I think one example is debian. Anyway, what we did was to skip any test where pooch failed for whatever reason. |
This sounds like a good idea to me if we decide to skip only if the failure is due to a problem with pooch downloading the dataset. |
It must be allowed to run multiple commands I'd think. We already require:
That should be the "obtain sources" step. If needed we could make it possible to run some script straight after without first building SciPy. |
Maybe the problem is that this does not install the pooch dependency:
Maybe the dependency should be added in |
Thanks @oscarbenjamin for the report. Yes, All other sections/submodules within SciPy can still be used without installing/requiring |
Ohh interesting, is it just the install from git that is triggering the issue? |
No SymPy does not use the datasets module. Somehow this PR has affected SymPy importing other things from SciPy as part of lambdify. The code that does this is quite convoluted so it's not easy to say exactly what import statement is being used: |
Yes, somehow SymPy CI fails with SciPy main since this PR was merged. More discussion in sympy/sympy#23932. |
The file indeed is convoluted, and I need to look closely, but from the first glance, maybe it is L106 importing everything from SciPy. I'll be happy to help out and dig deeper in SymPy code later in the day (slightly busy atm); I hope it's not too urgent, and is fine if I reply late evening or by tomorrow morning. Thanks! |
Yes, |
To be clear since I haven't managed to build scipy main locally does this PR make it an importerror to do |
Importing everything will also include |
I agree that the sympy code should be changed but does this not sound like a scipy regression to you? |
Not really, no. There's many non-idiomatic things one can do, and in this case:
|
The sympy code should be changed in any case (although I would prefer not to have to do that at release candidate stage) so let's not be concerned with that here. I imagine that |
Okay, let's open an issue for it, tag it with the |
It honestly will make a lot of users unhappy, not sure if there is a way to delay the import error raised. Something like this inside
|
A second import statement for the same module is a no-op, so that may not do what you intend. |
Ohhh I see, that's correct, will not work until the user restarts the process. I don't have any ideas atm to avoid this. |
There are lots of ways that you can make this work if you want to make it work. The code here goes out of its way to raise ImportError. If you wanted the module to be importable but without some of its features when pooch is not installed then you just do something like: try:
import pooch
except ImportError:
pooch = None
data = None
else:
data = pooch.create(...)
def fetch_data(...):
if data is None:
raise an error saying that pooch is not installed
return data.fetch(...) Then everywhere you currently call
In general it is always necessary to restart the process after install or uninstalling things so that doesn't seem like a big deal. If you did want to avoid that though then you would just move |
Thanks @oscarbenjamin this is good and should work. Sorry for the negligence on my side, this was quite straightfwd. Moving pooch import into If that's okay, I'll send a PR to fix this now. After the change: >>> from scipy import datasets # this does not raise
>>> datasets.ascent() # raise only on function call
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/gollum/Desktop/Work/scipy/build-install/lib/python3.10/site-packages/scipy/datasets/_fetchers.py", line 76, in ascent
fname = fetch_data("ascent.dat")
File "/Users/gollum/Desktop/Work/scipy/build-install/lib/python3.10/site-packages/scipy/datasets/_fetchers.py", line 35, in fetch_data
raise ImportError(msg)
ImportError: Missing optional dependency 'pooch' required for scipy.datasets module. Please use pip or conda to install 'pooch'. |
I disagree that SymPy should change here. If SciPy is hard importing an external library with I'm actually surprised that you can add something to |
I can't understand from the link why |
It's pulling everything from scipy and scipy.special. I guess as Oscar showed at sympy/sympy#23945, we really only need scipy.special.
|
|
download_all.py can be run as a simple script to download all the data files required by scipy.datasets at once. This is useful for packages that may have some restrictions e.g, scipy#15607 (comment)
`download_all.py` can be run as a simple script to download all the data files required by `scipy.datasets` at once. This is useful for packages that may have some restrictions e.g, #15607 (comment) [skip azp] [skip github]
Reference PR
gh-8707
What does this implement/fix?
With gh-8707, a couple of years ago, SciPy wanted to introduce the
datasets
submodule and move a handful of dataset functions from the currentmisc
module to this newdatasets
submodule. Thanks to @WarrenWeckesser for the initial work. With this PR (indeed inspired by gh-8707) we resume those efforts after making some improvements (explained below) and move away from thescipy.misc
module, finally deprecating it (in a separate PR #15901).scipy.datasets
submodulescipy.datasets
submoduleTODO Later:
scipy.stats
has its own test datasets within the repo) to their respective new repos (explained below). This is something that can be done after landing this PR once we have a concrete datasets API and approach defined for adding new datasets.Additional information
Why should we use Pooch?
OR
What are some other possible options (eg. intake, one of the suggestions made earlier) and why not prefer them?
There are a few good reasons for choosing Pooch:
With Pooch we can easily decouple the datasets that are currently present within the SicPy repository and move them to their own new repository. For example, see https://github.com/scipy-datasets, where each dataset has its own repository. This will lead to a lightweight SciPy wheel decreasing the download size for future releases. Keeping the datasets in individual repos or a single
scipy-datasets
repo is a point of discussion. Note that we need to take into account that GitHub has a 2Gb limit on the repo size. In case, in future, we plan to add new datasets which are relatively large, then this idea of keeping the datasets in the SciPy repo may not be a wise choice. In fact, keeping all the datasets in a separate single dataset repo might also break after a while. The best approach IMO would be to maybe group them by type of datasets into a few repositories or just keep them individually (i.e each dataset has its own repo).Dependency: Pooch is an extremely light package and has only a few dependencies, so if we were to add a new dependency i.e. Pooch, we can expect it to be small and at the same time it won't add a lot of sub-dependencies. On the other hand, something like intake has a few more deps, see requirements.txt. With the basic intake package and some additional sections, it will also install packages like
dask
etc.(not required for scipy datasets). Both of them use a library calledappdirs
which is quite small and useful.IMO Pooch has a simpler API overall and seems easy to implement/maintain for developers while also supporting the key features like caching (hash checks, multi download). Intake is slightly more complex with its design of catalog like YAML files and definitely has some extra features but we surely don't need those to solve the problem of dataset downloading and caching.
Ps.
scikit-image
also recently (in 2020) made a move to make use of Pooch for some of the same reasons.cc @rgommers