-
Notifications
You must be signed in to change notification settings - Fork 252
1 of 3: Build a graphframes
Python package during the build process
#512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…s.txt and split out requirements-dev.txt. Version bumps.
@SauronShepherd @SemyonSinchenko @bjornjorgensen plz take a look |
python/setup.cfg
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not to use pyproject.toml instead? It seems to me that setup.cfg
is a legacy approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SemyonSinchenko Because setup.py is already there and I needed to wrap this quickly so I could see the graphframes.tutorials
package from other Python scripts. A later PR can do that, I'm good at poetry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SemyonSinchenko okay, if I need extras I am going to poetrize it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjurney I did not mean moving to poetry! To be honest I was expecting that you leave setup.py as is and just move the package metadata to the pyproject file with backend "setuptools"... Are you 100% sure that we should use poetry here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apache spark does not use poetry, but there are some mails about the project python build at @dev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thank you. I did not think about it. I do think we should try to use poetry, and deal with the merge when we merge... I don't want to rewrite this again. Poetry is 100% cleaner than the setup.py/cfg method I was just using. It was very easy, just got to fix CI. Is is also easy to publish to PyPi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion is that if we are moving to a new tool we should move to uv
. Or let's stay on setuptools
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for uv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's continue with poetry.
python/setup.cfg
Outdated
include_package_data = True | ||
install_requires = | ||
pyspark>=2.0.0 | ||
click==8.1.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need it in main requirements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SemyonSinchenko Because graphframes.tutorials.download
uses it. Click
has no dependencies except for colorama
on Windows, so it seems a safe inclusion. If you want me to pull it into an extras, I think that could confuse new users who try to run the tutorials...
ed: I'll do a tutorials
extra.
python/setup.cfg
Outdated
click==8.1.8 | ||
numpy>=1.7 | ||
py7zr==0.22.0 | ||
requests==2.32.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that this should be a part of dev / additional dependencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SemyonSinchenko Okay, so I could do a tutorials
extras. I can do that. It is more than one package it adds... alright :)
python/graphframes/tests.py
Outdated
"spark.submit.pyFiles", | ||
os.path.abspath("python/dist/graphframes-{VERSION}-py3-none-any.whl"), | ||
) | ||
cls.sc = SparkContext(master="local[4]", appName="GraphFramesTests", conf=cls.conf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we avoid it at all? The problem is that Spark Connect does not support SparkContext at all. And I don't see actual usage of cls.sc
. Can we rely on the SparkSession
only? As I remember it is also recommended by PySpark documentation too.
why do you add a dockerFile in the PR? Convert tests to PyTest somewhere you will have things like Apache Spark does it |
python/graphframes/tests.py
Outdated
except: | ||
raise TypeError("invalid minor version") | ||
try: | ||
version_info['minor'] = int(m.group(2)) | ||
version_info["minor"] = int(m.group(2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this?
python/graphframes/tests.py
Outdated
assert gtu.spark_at_least_of_version("1.7") | ||
assert gtu.spark_at_least_of_version("2.0") | ||
assert gtu.spark_at_least_of_version("2.0.1") | ||
assert gtu.spark_at_least_of_version("2.0.2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bjornjorgensen I do not know, this is not an addition I made. It is not in my PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I converted it from unittest to pytest. I can alter the logic of the tests, but that was not the purpose of this PR.
python/graphframes/tests.py
Outdated
all_edges = [z for (a, b) in edges for z in [(a, b), (b, a)]] | ||
edges = self.spark.createDataFrame(all_edges, ["src", "dst"]) | ||
edgesDF = self.spark.createDataFrame(all_edges, ["src", "dst"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not the same
edges != edgesDF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bjornjorgensen Sorry, you don't like the use of edgesDF
variable name? The edges are defined right there a few lines above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bjornjorgensen fixed
@bjornjorgensen I don't add a Dockerfile. It was already there, I just cleaned it up a little.
I didn't want to include formatting in this PR as I need to get the other stuff in before hackathon. |
Okay, if I have to do a |
@rjurney have 1 (one) PR for each thing.. not 1 for 40 half finished stuff.. |
I'm going to finish them but obviously need help in reviewing the code. This wasn't easy to pull apart as the tutorials depends on the existence of a Python package. |
This takes a lot of time when you are posting PR's with over 1000 lines of code changes and.. no.. |
I'm a little confused by your comments - this PR is small other than fairly simple changes to convert unittest to pytest, which had to be done in one go. You're reviewing code I did not write. I will remove the Dockerfile changes. This is stuff that was required or my PR wouldn't build, does that make sense? |
@bjornjorgensen okay, I see your point there are a few things going on here. I would ask for your patience - I tried to do one simple thing, create a graphframes motif finding tutorial and update the README to contain helpful content... it was bare. I added references to the motif finding tutorial and cleaned up the docs generally. This was intended to be one PR. Then the following happened:
All of this got REALLY hairy for me and I am trying to get the docs updated before the Hackathon. That is my driving mission. |
In this PR you have
Have one PR for each of them. |
@bjornjorgensen can you please refresh and look at my edited / updated comments? I agree that would be the way to go, but it does build and the hackathon looms. |
A lot of this stuff is all part of building a package. Don't you see the unifying theme of these things?
This is so we don't install the dev dependencies when we build the package.
These are files created by the package build process.
Removed.
Okay, I hear you. I will pull this out. Okay, I am going to do what you ask... it's just very time consuming. |
@bjornjorgensen oh, re: poetry... I need to create extras like |
@bjornjorgensen this is getting closer to ready for review... let me know what you think :) I could use @SauronShepherd and @SemyonSinchenko take as well. |
.github/workflows/python-ci.yml
Outdated
python -m pip install --upgrade pip wheel | ||
pip install -r ./python/requirements.txt | ||
pip install pyspark==${{ matrix.spark-version }} | ||
python -m pip install --upgrade poetry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend to use action instead.
run: | | ||
export SPARK_HOME=$(python -c "import os; from importlib.util import find_spec; print(os.path.join(os.path.dirname(find_spec('pyspark').origin)))") | ||
./python/run-tests.sh | ||
export SPARK_HOME=$(poetry run python -c "import os; from importlib.util import find_spec; spec = find_spec('pyspark'); print(os.path.join(os.path.dirname(spec.origin)))") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need it? Tests will work even without SPARK_HOME
@@ -0,0 +1,48 @@ | |||
[tool.poetry] | |||
name = "graphframes-py" | |||
version = "0.8.4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use something like this: https://pypi.org/project/poetry-dynamic-versioning/?
[tool.poetry.group.dev.dependencies] | ||
black = "^25.1.0" | ||
flake8 = "^7.1.1" | ||
isort = "^6.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like pytest
is missing.
graphframes
Python package during the build processgraphframes
Python package during the build process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I think there is still some place for improvements, but we can do it in next PRs. Thank you @rjurney !
Thanks! |
…orial (#518) * Minimized the PR to just these files * Created tutorials dependency group to minimize main bloat * Make motif.py execute in whole again * Minor isort format and cleanup of download.py * Minor isort format and cleanup of utils.py * Removed case sensitivity from the script - that was confusing people who just pasted or tried to run the code without a new SparkSession. * motif.py now matches tutorial code, runs and handles case insensitivity. * 1 of 3: Build a `graphframes` Python package during the build process (#512) * Converted tests to pytest. Build a Python package. Update requirements.txt and split out requirements-dev.txt. Version bumps. * Restore Python .gitignore * Extra newline removed * Added VERSION file set to 0.8.5 * isort; fiex edgesDF variable name. * Back out Dockerfile changes * Back out version change in build.sbt * Backout changes to config and run-tests * Back out pytest conversion * Back out version changes to make nose tests pass * Remove changes to requirements * Put nose back in requirements.txt * Remove version bump to version.sbt * Remove packages related to testing * Remove old setup.py / setup.cfg * New pyproject.toml and poetry.lock * Short README for Python package, poetry won't allow a ../README.md path * Remove requirements files in favor of pyproject.toml * Try to poetrize CI build * pyspark min 3.4 * Local python README in pyproject.toml * Trying to remove he working folder to debug scala issue * Set Python working directory again * Accidental newline * Install Python for test... * Run tests from python/ folder * Try running tests from python/ * poetry run the unit tests * poetry run the tests * Try just using 'python' instead of a path * poetry run the last line, graphframes.main * Remove test/ folder from style paths, it doesn't exist * Remove .vscode * VERSION back to 0.8.4 * Remove tutorials reference * VERSION is a Python thing, it belongs in python/ * Include the README.md and LICENSE in the Python package * Some classifiers for pyproject.toml * Trying poetry install action instead of manual install * Removing SPARK_HOME * Returned SPARK_HOME settings * Setup a 'graphframes stackexchange' comand. * Make graphframes.tutorials.motif use a checkpoint dir unique, and from SparkSession.sparkContext. Use click.echo instead of print * Use spark.sparkContext.setCheckpointDir directly instead of instantiating a SparkContext. print-->click.echo * Using 'from __future__ import annotations' intsead of List and Tuple * Now retry three times if we can't connect for any reason in 'graphframes stackexchange' command.
I did some improvements in the project's build and testing in the course of #473 that I have split out into this PR.
CI Changes
requirements-dev.txt
.gitignore
Build a Python Package
Really I want to poetrize the package but I wanted to do this first.
MANIFEST.in
python/setup.cfg
- config file for packagepython/setup.py
- filled out dependencies automaticallyWhat changes were proposed in this pull request?
Why are the changes needed?
graphframes.tutorials
and just in general.