Skip to content

Conversation

rjurney
Copy link
Collaborator

@rjurney rjurney commented Feb 16, 2025

I did some improvements in the project's build and testing in the course of #473 that I have split out into this PR.

CI Changes

  • split out a requirements-dev.txt

.gitignore

  • Ignore python build folders

Build a Python Package

Really I want to poetrize the package but I wanted to do this first.

  • Filled out the MANIFEST.in
  • python/setup.cfg - config file for package
  • python/setup.py - filled out dependencies automatically

What changes were proposed in this pull request?

Why are the changes needed?

  1. We need a Python package to reference from graphframes.tutorials and just in general.

…s.txt and split out requirements-dev.txt. Version bumps.
@rjurney rjurney requested a review from WeichenXu123 February 16, 2025 05:01
@rjurney
Copy link
Collaborator Author

rjurney commented Feb 16, 2025

This has to go in before #511 or #513 will work... I refer to graphframes.tutorials which only makes sense if a package exists.

@rjurney
Copy link
Collaborator Author

rjurney commented Feb 16, 2025

@SauronShepherd @SemyonSinchenko @bjornjorgensen plz take a look

python/setup.cfg Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to use pyproject.toml instead? It seems to me that setup.cfg is a legacy approach.

Copy link
Collaborator Author

@rjurney rjurney Feb 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SemyonSinchenko Because setup.py is already there and I needed to wrap this quickly so I could see the graphframes.tutorials package from other Python scripts. A later PR can do that, I'm good at poetry.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SemyonSinchenko okay, if I need extras I am going to poetrize it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjurney I did not mean moving to poetry! To be honest I was expecting that you leave setup.py as is and just move the package metadata to the pyproject file with backend "setuptools"... Are you 100% sure that we should use poetry here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apache spark does not use poetry, but there are some mails about the project python build at @dev

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thank you. I did not think about it. I do think we should try to use poetry, and deal with the merge when we merge... I don't want to rewrite this again. Poetry is 100% cleaner than the setup.py/cfg method I was just using. It was very easy, just got to fix CI. Is is also easy to publish to PyPi.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is that if we are moving to a new tool we should move to uv. Or let's stay on setuptools.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for uv

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's continue with poetry.

python/setup.cfg Outdated
include_package_data = True
install_requires =
pyspark>=2.0.0
click==8.1.8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need it in main requirements?

Copy link
Collaborator Author

@rjurney rjurney Feb 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SemyonSinchenko Because graphframes.tutorials.download uses it. Click has no dependencies except for colorama on Windows, so it seems a safe inclusion. If you want me to pull it into an extras, I think that could confuse new users who try to run the tutorials...

ed: I'll do a tutorials extra.

python/setup.cfg Outdated
click==8.1.8
numpy>=1.7
py7zr==0.22.0
requests==2.32.3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this should be a part of dev / additional dependencies

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SemyonSinchenko Okay, so I could do a tutorials extras. I can do that. It is more than one package it adds... alright :)

"spark.submit.pyFiles",
os.path.abspath("python/dist/graphframes-{VERSION}-py3-none-any.whl"),
)
cls.sc = SparkContext(master="local[4]", appName="GraphFramesTests", conf=cls.conf)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid it at all? The problem is that Spark Connect does not support SparkContext at all. And I don't see actual usage of cls.sc. Can we rely on the SparkSession only? As I remember it is also recommended by PySpark documentation too.

@bjornjorgensen
Copy link
Contributor

why do you add a dockerFile in the PR?
The dockeFile have nothing to do with the rest of the PR.

Convert tests to PyTest

somewhere you will have things like Apache Spark does it
like in this comment #504 (comment)
but now you will convert tests to pytests..

except:
raise TypeError("invalid minor version")
try:
version_info['minor'] = int(m.group(2))
version_info["minor"] = int(m.group(2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this?

assert gtu.spark_at_least_of_version("1.7")
assert gtu.spark_at_least_of_version("2.0")
assert gtu.spark_at_least_of_version("2.0.1")
assert gtu.spark_at_least_of_version("2.0.2")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bjornjorgensen I do not know, this is not an addition I made. It is not in my PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I converted it from unittest to pytest. I can alter the logic of the tests, but that was not the purpose of this PR.

all_edges = [z for (a, b) in edges for z in [(a, b), (b, a)]]
edges = self.spark.createDataFrame(all_edges, ["src", "dst"])
edgesDF = self.spark.createDataFrame(all_edges, ["src", "dst"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the same
edges != edgesDF

Copy link
Collaborator Author

@rjurney rjurney Feb 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bjornjorgensen Sorry, you don't like the use of edgesDF variable name? The edges are defined right there a few lines above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjurney
Copy link
Collaborator Author

rjurney commented Feb 16, 2025

why do you add a dockerFile in the PR? The dockeFile have nothing to do with the rest of the PR.

Convert tests to PyTest

@bjornjorgensen I don't add a Dockerfile. It was already there, I just cleaned it up a little.

somewhere you will have things like Apache Spark does it like in this comment #504 (comment) but now you will convert tests to pytests..

I didn't want to include formatting in this PR as I need to get the other stuff in before hackathon.

@rjurney
Copy link
Collaborator Author

rjurney commented Feb 16, 2025

Okay, if I have to do a tutorials extras I am just going to poetrize the whole thing...

@bjornjorgensen
Copy link
Contributor

@rjurney have 1 (one) PR for each thing..

not 1 for 40 half finished stuff..

@rjurney
Copy link
Collaborator Author

rjurney commented Feb 16, 2025

@rjurney have 1 (one) PR for each thing..

not 1 for 40 half finished stuff..

I'm going to finish them but obviously need help in reviewing the code. This wasn't easy to pull apart as the tutorials depends on the existence of a Python package.

@bjornjorgensen
Copy link
Contributor

@rjurney have 1 (one) PR for each thing..
not 1 for 40 half finished stuff..

I'm going to finish them but obviously need help in reviewing the code. This wasn't easy to pull apart as the tutorials depends on the existence of a Python package.

This takes a lot of time when you are posting PR's with over 1000 lines of code changes and.. no..
as i say in yours last PR start with one thing like updating the dockerFile and nothing more in that PR..
We can then revue it easier and it needed we can revers it to..

@rjurney
Copy link
Collaborator Author

rjurney commented Feb 16, 2025

@rjurney have 1 (one) PR for each thing..
not 1 for 40 half finished stuff..

I'm going to finish them but obviously need help in reviewing the code. This wasn't easy to pull apart as the tutorials depends on the existence of a Python package.

This takes a lot of time when you are posting PR's with over 1000 lines of code changes and.. no.. as i say in yours last PR start with one thing like updating the dockerFile and nothing more in that PR.. We can then revue it easier and it needed we can revers it to..

I'm a little confused by your comments - this PR is small other than fairly simple changes to convert unittest to pytest, which had to be done in one go. You're reviewing code I did not write. I will remove the Dockerfile changes. This is stuff that was required or my PR wouldn't build, does that make sense?

@rjurney
Copy link
Collaborator Author

rjurney commented Feb 16, 2025

@bjornjorgensen okay, I see your point there are a few things going on here. I would ask for your patience - I tried to do one simple thing, create a graphframes motif finding tutorial and update the README to contain helpful content... it was bare. I added references to the motif finding tutorial and cleaned up the docs generally. This was intended to be one PR.

Then the following happened:

  1. I could not import GraphFrames from my tutorial, because there was no actual package. pytest did not execute in spark-submit or pyspark so there was no way to reference it from tests. This has always been a weird thing about GraphFrames. So I setup a package. I'm redoing this in poetry now, as a separate PR.
  2. I wanted the code that builds the Stack Exchange dataset to be in the graphframes module because we need a dataset to do unit tests on that is substantial. So I included that code in graphframes.tutorials.
  3. I couldn't use GraphFrames with a modern Python because nose required Python 3.9. It is very old and a very bad package. So I removed it and converted the unit tests to pytest. This was done in one go. I didn't evaluate the logic of the tests to improve them - I just literally converted the tests from unittest to pytest. I'm all for improving testing, I just didn't plan on including more stuff in this PR.
  4. My branch would not build for unknown reasons. It wasn't clear from the error messages, so I fiddled with versions and things until it built. That is how you wind up with the stuff in this PR.

All of this got REALLY hairy for me and I am trying to get the docs updated before the Hackathon. That is my driving mission.

@bjornjorgensen
Copy link
Contributor

In this PR you have

  • upgraded version

  • div upgrades in CI like upgraded python, scala.

  • something with requierments-dev

  • added files and folder to gitignore

  • Updated dockerFile

  • Converted nose testes to pytests

  • build a python package

  • removed python2 from run tests script

  • add a new function parse_requirements

  • changed python sys to os for file operations

Have one PR for each of them.

@rjurney
Copy link
Collaborator Author

rjurney commented Feb 16, 2025

@bjornjorgensen can you please refresh and look at my edited / updated comments? I agree that would be the way to go, but it does build and the hackathon looms.

@rjurney
Copy link
Collaborator Author

rjurney commented Feb 16, 2025

In this PR you have

  • upgraded version
  • div upgrades in CI like upgraded python, scala.
  • something with requierments-dev
  • added files and folder to gitignore
  • Updated dockerFile
  • Converted nose testes to pytests
  • build a python package
  • removed python2 from run tests script
  • add a new function parse_requirements
  • changed python sys to os for file operations

Have one PR for each of them.

A lot of this stuff is all part of building a package. Don't you see the unifying theme of these things?

  • something with requierments-dev

This is so we don't install the dev dependencies when we build the package.

  • added files and folder to gitignore

These are files created by the package build process.

  • Updated dockerFile

Removed.

  • Converted nose testes to pytests

Okay, I hear you. I will pull this out.

Okay, I am going to do what you ask... it's just very time consuming.

@rjurney
Copy link
Collaborator Author

rjurney commented Feb 17, 2025

@bjornjorgensen oh, re: poetry... I need to create extras like graphframes[tutorials] due to feedback on that PR. The old setup.py with the old build tools can't support extras without pip. I would rather adopt poetry than pip if we have to choose one.

@rjurney
Copy link
Collaborator Author

rjurney commented Feb 17, 2025

@bjornjorgensen this is getting closer to ready for review... let me know what you think :) I could use @SauronShepherd and @SemyonSinchenko take as well.

python -m pip install --upgrade pip wheel
pip install -r ./python/requirements.txt
pip install pyspark==${{ matrix.spark-version }}
python -m pip install --upgrade poetry
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to use action instead.

run: |
export SPARK_HOME=$(python -c "import os; from importlib.util import find_spec; print(os.path.join(os.path.dirname(find_spec('pyspark').origin)))")
./python/run-tests.sh
export SPARK_HOME=$(poetry run python -c "import os; from importlib.util import find_spec; spec = find_spec('pyspark'); print(os.path.join(os.path.dirname(spec.origin)))")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need it? Tests will work even without SPARK_HOME

@@ -0,0 +1,48 @@
[tool.poetry]
name = "graphframes-py"
version = "0.8.4"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use something like this: https://pypi.org/project/poetry-dynamic-versioning/?

[tool.poetry.group.dev.dependencies]
black = "^25.1.0"
flake8 = "^7.1.1"
isort = "^6.0.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like pytest is missing.

@rjurney rjurney changed the title Build a graphframes Python package during the build process 1 of 3: Build a graphframes Python package during the build process Feb 18, 2025
Copy link
Collaborator

@SemyonSinchenko SemyonSinchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I think there is still some place for improvements, but we can do it in next PRs. Thank you @rjurney !

@rjurney
Copy link
Collaborator Author

rjurney commented Feb 20, 2025

Thanks!

@rjurney rjurney merged commit fb14eff into master Feb 20, 2025
6 checks passed
rjurney added a commit that referenced this pull request Feb 21, 2025
…orial (#518)

* Minimized the PR to just these files

* Created tutorials dependency group to minimize main bloat

* Make motif.py execute in whole again

* Minor isort format and cleanup of download.py

* Minor isort format and cleanup of utils.py

* Removed case sensitivity from the script - that was confusing people who just pasted or tried to run the code without a new SparkSession.

* motif.py now matches tutorial code, runs and handles case insensitivity.

* 1 of 3: Build a `graphframes` Python package during the build process (#512)

* Converted tests to pytest. Build a Python package. Update requirements.txt and split out requirements-dev.txt. Version bumps.

* Restore Python .gitignore

* Extra newline removed

* Added VERSION file set to 0.8.5

* isort; fiex edgesDF variable name.

* Back out Dockerfile changes

* Back out version change in build.sbt

* Backout changes to config and run-tests

* Back out pytest conversion

* Back out version changes to make nose tests pass

* Remove changes to requirements

* Put nose back in requirements.txt

* Remove version bump to version.sbt

* Remove packages related to testing

* Remove old setup.py / setup.cfg

* New pyproject.toml and poetry.lock

* Short README for Python package, poetry won't allow a ../README.md path

* Remove requirements files in favor of pyproject.toml

* Try to poetrize CI build

* pyspark min 3.4

* Local python README in pyproject.toml

* Trying to remove he working folder to debug scala issue

* Set Python working directory again

* Accidental newline

* Install Python for test...

* Run tests from python/ folder

* Try running tests from python/

* poetry run the unit tests

* poetry run the tests

* Try just using 'python' instead of a path

* poetry run the last line, graphframes.main

* Remove test/ folder from style paths, it doesn't exist

* Remove .vscode

* VERSION back to 0.8.4

* Remove tutorials reference

* VERSION is a Python thing, it belongs in python/

* Include the README.md and LICENSE in the Python package

* Some classifiers for pyproject.toml

* Trying poetry install action instead of manual install

* Removing SPARK_HOME

* Returned SPARK_HOME settings

* Setup a 'graphframes stackexchange' comand.

* Make graphframes.tutorials.motif use a checkpoint dir unique, and from SparkSession.sparkContext. Use click.echo instead of print

* Use spark.sparkContext.setCheckpointDir directly instead of instantiating a SparkContext. print-->click.echo

* Using 'from __future__ import annotations' intsead of List and Tuple

* Now retry three times if we can't connect for any reason in 'graphframes stackexchange' command.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants