1 of 3: Build a `graphframes` Python package during the build process #512

rjurney · 2025-02-16T05:01:22Z

I did some improvements in the project's build and testing in the course of #473 that I have split out into this PR.

CI Changes

split out a requirements-dev.txt

`.gitignore`

Ignore python build folders

Build a Python Package

Really I want to poetrize the package but I wanted to do this first.

Filled out the MANIFEST.in
python/setup.cfg - config file for package
python/setup.py - filled out dependencies automatically

What changes were proposed in this pull request?

Why are the changes needed?

We need a Python package to reference from graphframes.tutorials and just in general.

…s.txt and split out requirements-dev.txt. Version bumps.

rjurney · 2025-02-16T06:18:51Z

This has to go in before #511 or #513 will work... I refer to graphframes.tutorials which only makes sense if a package exists.

rjurney · 2025-02-16T06:19:09Z

@SauronShepherd @SemyonSinchenko @bjornjorgensen plz take a look

SemyonSinchenko · 2025-02-16T10:45:39Z

python/setup.cfg

Why not to use pyproject.toml instead? It seems to me that setup.cfg is a legacy approach.

@SemyonSinchenko Because setup.py is already there and I needed to wrap this quickly so I could see the graphframes.tutorials package from other Python scripts. A later PR can do that, I'm good at poetry.

@SemyonSinchenko okay, if I need extras I am going to poetrize it.

@rjurney I did not mean moving to poetry! To be honest I was expecting that you leave setup.py as is and just move the package metadata to the pyproject file with backend "setuptools"... Are you 100% sure that we should use poetry here?

https://packaging.python.org/en/latest/guides/writing-pyproject-toml/

apache spark does not use poetry, but there are some mails about the project python build at @dev

Ah, thank you. I did not think about it. I do think we should try to use poetry, and deal with the merge when we merge... I don't want to rewrite this again. Poetry is 100% cleaner than the setup.py/cfg method I was just using. It was very easy, just got to fix CI. Is is also easy to publish to PyPi.

My opinion is that if we are moving to a new tool we should move to uv. Or let's stay on setuptools.

Ok, let's continue with poetry.

SemyonSinchenko · 2025-02-16T10:46:34Z

python/setup.cfg

+include_package_data = True
+install_requires =
+    pyspark>=2.0.0
+    click==8.1.8


Why do we need it in main requirements?

@SemyonSinchenko Because graphframes.tutorials.download uses it. Click has no dependencies except for colorama on Windows, so it seems a safe inclusion. If you want me to pull it into an extras, I think that could confuse new users who try to run the tutorials...

ed: I'll do a tutorials extra.

SemyonSinchenko · 2025-02-16T10:46:55Z

python/setup.cfg

+    click==8.1.8
+    numpy>=1.7
+    py7zr==0.22.0
+    requests==2.32.3


It seems to me that this should be a part of dev / additional dependencies

@SemyonSinchenko Okay, so I could do a tutorials extras. I can do that. It is more than one package it adds... alright :)

SemyonSinchenko · 2025-02-16T10:49:53Z

python/graphframes/tests.py

+            "spark.submit.pyFiles",
+            os.path.abspath("python/dist/graphframes-{VERSION}-py3-none-any.whl"),
+        )
+        cls.sc = SparkContext(master="local[4]", appName="GraphFramesTests", conf=cls.conf)


Could we avoid it at all? The problem is that Spark Connect does not support SparkContext at all. And I don't see actual usage of cls.sc. Can we rely on the SparkSession only? As I remember it is also recommended by PySpark documentation too.

bjornjorgensen · 2025-02-16T11:33:17Z

why do you add a dockerFile in the PR?
The dockeFile have nothing to do with the rest of the PR.

Convert tests to PyTest

somewhere you will have things like Apache Spark does it
like in this comment #504 (comment)
but now you will convert tests to pytests..

bjornjorgensen · 2025-02-16T11:37:48Z

python/graphframes/tests.py

        except:
            raise TypeError("invalid minor version")
        try:
-            version_info['minor'] = int(m.group(2))
+            version_info["minor"] = int(m.group(2))


what is this?

bjornjorgensen · 2025-02-16T11:38:44Z

python/graphframes/tests.py

+        assert gtu.spark_at_least_of_version("1.7")
+        assert gtu.spark_at_least_of_version("2.0")
+        assert gtu.spark_at_least_of_version("2.0.1")
+        assert gtu.spark_at_least_of_version("2.0.2")


what is this?

@bjornjorgensen I do not know, this is not an addition I made. It is not in my PR.

I converted it from unittest to pytest. I can alter the logic of the tests, but that was not the purpose of this PR.

bjornjorgensen · 2025-02-16T11:42:36Z

python/graphframes/tests.py

        all_edges = [z for (a, b) in edges for z in [(a, b), (b, a)]]
-        edges = self.spark.createDataFrame(all_edges, ["src", "dst"])
+        edgesDF = self.spark.createDataFrame(all_edges, ["src", "dst"])


this is not the same
edges != edgesDF

@bjornjorgensen Sorry, you don't like the use of edgesDF variable name? The edges are defined right there a few lines above.

@bjornjorgensen fixed

rjurney · 2025-02-16T19:43:05Z

why do you add a dockerFile in the PR? The dockeFile have nothing to do with the rest of the PR.

Convert tests to PyTest

@bjornjorgensen I don't add a Dockerfile. It was already there, I just cleaned it up a little.

somewhere you will have things like Apache Spark does it like in this comment #504 (comment) but now you will convert tests to pytests..

I didn't want to include formatting in this PR as I need to get the other stuff in before hackathon.

…ney/build-upgrades

rjurney · 2025-02-16T20:28:11Z

Okay, if I have to do a tutorials extras I am just going to poetrize the whole thing...

bjornjorgensen · 2025-02-16T20:30:25Z

@rjurney have 1 (one) PR for each thing..

not 1 for 40 half finished stuff..

rjurney · 2025-02-16T20:31:34Z

@rjurney have 1 (one) PR for each thing..

not 1 for 40 half finished stuff..

I'm going to finish them but obviously need help in reviewing the code. This wasn't easy to pull apart as the tutorials depends on the existence of a Python package.

…ney/build-upgrades

bjornjorgensen · 2025-02-16T21:01:26Z

@rjurney have 1 (one) PR for each thing..
not 1 for 40 half finished stuff..

I'm going to finish them but obviously need help in reviewing the code. This wasn't easy to pull apart as the tutorials depends on the existence of a Python package.

This takes a lot of time when you are posting PR's with over 1000 lines of code changes and.. no..
as i say in yours last PR start with one thing like updating the dockerFile and nothing more in that PR..
We can then revue it easier and it needed we can revers it to..

rjurney · 2025-02-16T21:03:51Z

@rjurney have 1 (one) PR for each thing..
not 1 for 40 half finished stuff..

I'm going to finish them but obviously need help in reviewing the code. This wasn't easy to pull apart as the tutorials depends on the existence of a Python package.

This takes a lot of time when you are posting PR's with over 1000 lines of code changes and.. no.. as i say in yours last PR start with one thing like updating the dockerFile and nothing more in that PR.. We can then revue it easier and it needed we can revers it to..

I'm a little confused by your comments - this PR is small other than fairly simple changes to convert unittest to pytest, which had to be done in one go. You're reviewing code I did not write. I will remove the Dockerfile changes. This is stuff that was required or my PR wouldn't build, does that make sense?

rjurney · 2025-02-16T21:14:23Z

@bjornjorgensen okay, I see your point there are a few things going on here. I would ask for your patience - I tried to do one simple thing, create a graphframes motif finding tutorial and update the README to contain helpful content... it was bare. I added references to the motif finding tutorial and cleaned up the docs generally. This was intended to be one PR.

Then the following happened:

I could not import GraphFrames from my tutorial, because there was no actual package. pytest did not execute in spark-submit or pyspark so there was no way to reference it from tests. This has always been a weird thing about GraphFrames. So I setup a package. I'm redoing this in poetry now, as a separate PR.
I wanted the code that builds the Stack Exchange dataset to be in the graphframes module because we need a dataset to do unit tests on that is substantial. So I included that code in graphframes.tutorials.
I couldn't use GraphFrames with a modern Python because nose required Python 3.9. It is very old and a very bad package. So I removed it and converted the unit tests to pytest. This was done in one go. I didn't evaluate the logic of the tests to improve them - I just literally converted the tests from unittest to pytest. I'm all for improving testing, I just didn't plan on including more stuff in this PR.
My branch would not build for unknown reasons. It wasn't clear from the error messages, so I fiddled with versions and things until it built. That is how you wind up with the stuff in this PR.

All of this got REALLY hairy for me and I am trying to get the docs updated before the Hackathon. That is my driving mission.

bjornjorgensen · 2025-02-16T21:16:49Z

In this PR you have

upgraded version
div upgrades in CI like upgraded python, scala.
something with requierments-dev
added files and folder to gitignore
Updated dockerFile
Converted nose testes to pytests
build a python package
removed python2 from run tests script
add a new function parse_requirements
changed python sys to os for file operations

Have one PR for each of them.

rjurney · 2025-02-16T21:17:41Z

@bjornjorgensen can you please refresh and look at my edited / updated comments? I agree that would be the way to go, but it does build and the hackathon looms.

rjurney · 2025-02-16T21:19:33Z

In this PR you have

upgraded version

div upgrades in CI like upgraded python, scala.

something with requierments-dev

added files and folder to gitignore

Updated dockerFile

Converted nose testes to pytests

build a python package

removed python2 from run tests script

add a new function parse_requirements

changed python sys to os for file operations

Have one PR for each of them.

A lot of this stuff is all part of building a package. Don't you see the unifying theme of these things?

something with requierments-dev

This is so we don't install the dev dependencies when we build the package.

added files and folder to gitignore

These are files created by the package build process.

Updated dockerFile

Removed.

Converted nose testes to pytests

Okay, I hear you. I will pull this out.

Okay, I am going to do what you ask... it's just very time consuming.

rjurney · 2025-02-17T00:13:21Z

@bjornjorgensen oh, re: poetry... I need to create extras like graphframes[tutorials] due to feedback on that PR. The old setup.py with the old build tools can't support extras without pip. I would rather adopt poetry than pip if we have to choose one.

rjurney · 2025-02-17T01:13:24Z

@bjornjorgensen this is getting closer to ready for review... let me know what you think :) I could use @SauronShepherd and @SemyonSinchenko take as well.

SemyonSinchenko · 2025-02-17T10:12:14Z

.github/workflows/python-ci.yml

-        python -m pip install --upgrade pip wheel
-        pip install -r ./python/requirements.txt
-        pip install pyspark==${{ matrix.spark-version }}
+        python -m pip install --upgrade poetry


I would recommend to use action instead.

SemyonSinchenko · 2025-02-17T10:13:34Z

.github/workflows/python-ci.yml

      run: |
-        export SPARK_HOME=$(python -c "import os; from importlib.util import find_spec; print(os.path.join(os.path.dirname(find_spec('pyspark').origin)))")
-        ./python/run-tests.sh
+        export SPARK_HOME=$(poetry run python -c "import os; from importlib.util import find_spec; spec = find_spec('pyspark'); print(os.path.join(os.path.dirname(spec.origin)))")


Why do we need it? Tests will work even without SPARK_HOME

SemyonSinchenko · 2025-02-17T10:46:08Z

python/pyproject.toml

@@ -0,0 +1,48 @@
+[tool.poetry]
+name = "graphframes-py"
+version = "0.8.4"


Let's use something like this: https://pypi.org/project/poetry-dynamic-versioning/?

SemyonSinchenko · 2025-02-17T10:46:54Z

python/pyproject.toml

+[tool.poetry.group.dev.dependencies]
+black = "^25.1.0"
+flake8 = "^7.1.1"
+isort = "^6.0.0"


It looks like pytest is missing.

SemyonSinchenko

LGTM! I think there is still some place for improvements, but we can do it in next PRs. Thank you @rjurney !

rjurney · 2025-02-20T08:13:13Z

Thanks!

…orial (#518) * Minimized the PR to just these files * Created tutorials dependency group to minimize main bloat * Make motif.py execute in whole again * Minor isort format and cleanup of download.py * Minor isort format and cleanup of utils.py * Removed case sensitivity from the script - that was confusing people who just pasted or tried to run the code without a new SparkSession. * motif.py now matches tutorial code, runs and handles case insensitivity. * 1 of 3: Build a `graphframes` Python package during the build process (#512) * Converted tests to pytest. Build a Python package. Update requirements.txt and split out requirements-dev.txt. Version bumps. * Restore Python .gitignore * Extra newline removed * Added VERSION file set to 0.8.5 * isort; fiex edgesDF variable name. * Back out Dockerfile changes * Back out version change in build.sbt * Backout changes to config and run-tests * Back out pytest conversion * Back out version changes to make nose tests pass * Remove changes to requirements * Put nose back in requirements.txt * Remove version bump to version.sbt * Remove packages related to testing * Remove old setup.py / setup.cfg * New pyproject.toml and poetry.lock * Short README for Python package, poetry won't allow a ../README.md path * Remove requirements files in favor of pyproject.toml * Try to poetrize CI build * pyspark min 3.4 * Local python README in pyproject.toml * Trying to remove he working folder to debug scala issue * Set Python working directory again * Accidental newline * Install Python for test... * Run tests from python/ folder * Try running tests from python/ * poetry run the unit tests * poetry run the tests * Try just using 'python' instead of a path * poetry run the last line, graphframes.main * Remove test/ folder from style paths, it doesn't exist * Remove .vscode * VERSION back to 0.8.4 * Remove tutorials reference * VERSION is a Python thing, it belongs in python/ * Include the README.md and LICENSE in the Python package * Some classifiers for pyproject.toml * Trying poetry install action instead of manual install * Removing SPARK_HOME * Returned SPARK_HOME settings * Setup a 'graphframes stackexchange' comand. * Make graphframes.tutorials.motif use a checkpoint dir unique, and from SparkSession.sparkContext. Use click.echo instead of print * Use spark.sparkContext.setCheckpointDir directly instead of instantiating a SparkContext. print-->click.echo * Using 'from __future__ import annotations' intsead of List and Tuple * Now retry three times if we can't connect for any reason in 'graphframes stackexchange' command.

Converted tests to pytest. Build a Python package. Update requirement…

f4e9cdb

…s.txt and split out requirements-dev.txt. Version bumps.

rjurney requested a review from WeichenXu123 February 16, 2025 05:01

rjurney added 2 commits February 15, 2025 21:07

Restore Python .gitignore

c256244

Extra newline removed

6c3df0b

This was referenced Feb 16, 2025

Adding motif finding tutorial using the stats.meta.stackexchange.com data dump #473

Closed

3 of 3: Documentation cleanup and update. Added a motif finding tutorial. #511

Merged

SemyonSinchenko reviewed Feb 16, 2025

View reviewed changes

bjornjorgensen reviewed Feb 16, 2025

View reviewed changes

rjurney added 2 commits February 16, 2025 11:46

Merge branch 'master' of github.com:graphframes/graphframes into rjur…

b2838d2

…ney/build-upgrades

Added VERSION file set to 0.8.5

caf5091

rjurney added 2 commits February 16, 2025 12:40

isort; fiex edgesDF variable name.

7cfa2d1

Merge branch 'master' of github.com:graphframes/graphframes into rjur…

2ca9a15

…ney/build-upgrades

Back out Dockerfile changes

a8bf0be

rjurney added 2 commits February 16, 2025 13:21

Back out version change in build.sbt

54a942d

Backout changes to config and run-tests

8b0e346

rjurney added 7 commits February 16, 2025 16:19

Run tests from python/ folder

1b7b9f8

Try running tests from python/

58da493

poetry run the unit tests

9f4aa24

poetry run the tests

11b2782

Try just using 'python' instead of a path

9772344

poetry run the last line, graphframes.main

d55dbfe

Remove test/ folder from style paths, it doesn't exist

2fc4d08

rjurney added 6 commits February 16, 2025 17:18

Remove .vscode

8297a13

VERSION back to 0.8.4

2035d98

Remove tutorials reference

f9f4bd7

VERSION is a Python thing, it belongs in python/

9ddd6b2

Include the README.md and LICENSE in the Python package

7065647

Some classifiers for pyproject.toml

a6c7e91

SemyonSinchenko reviewed Feb 17, 2025

View reviewed changes

rjurney added 3 commits February 17, 2025 08:49

Trying poetry install action instead of manual install

51e3e6d

Removing SPARK_HOME

272be06

Returned SPARK_HOME settings

4587999

rjurney mentioned this pull request Feb 17, 2025

2 of 3: New minimized PR for a Python tutorial module graphframes.tutorial #518

Merged

rjurney changed the title ~~Build a graphframes Python package during the build process~~ 1 of 3: Build a graphframes Python package during the build process Feb 18, 2025

SemyonSinchenko mentioned this pull request Feb 19, 2025

Create GraphFrame.typeDegree() to provide degree by relationship type #519

Open

SemyonSinchenko approved these changes Feb 20, 2025

View reviewed changes

rjurney merged commit fb14eff into master Feb 20, 2025
6 checks passed

rjurney mentioned this pull request Feb 21, 2025

Rjurney/motif tutorial code min #520

Merged

This was referenced Feb 22, 2025

chore: use pyproject for python dependencies and extras #505

Closed

Remove python2 and upgrade to python3 #517

Closed

chore: add python linter and codestyle #504

Closed

1 of 3: Build a graphframes Python package during the build process #512

1 of 3: Build a graphframes Python package during the build process #512

Uh oh!

Conversation

rjurney commented Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI Changes

.gitignore

Build a Python Package

What changes were proposed in this pull request?

Why are the changes needed?

Uh oh!

rjurney commented Feb 16, 2025

Uh oh!

rjurney commented Feb 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjurney Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjurney Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjornjorgensen commented Feb 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjurney Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjurney commented Feb 16, 2025

Uh oh!

rjurney commented Feb 16, 2025

Uh oh!

bjornjorgensen commented Feb 16, 2025

Uh oh!

rjurney commented Feb 16, 2025

Uh oh!

bjornjorgensen commented Feb 16, 2025

Uh oh!

1 of 3: Build a `graphframes` Python package during the build process #512

1 of 3: Build a `graphframes` Python package during the build process #512

rjurney commented Feb 16, 2025 •

edited

Loading

`.gitignore`

rjurney Feb 16, 2025 •

edited

Loading

rjurney Feb 16, 2025 •

edited

Loading

rjurney Feb 16, 2025 •

edited

Loading

rjurney commented Feb 16, 2025 •

edited

Loading

rjurney commented Feb 16, 2025 •

edited

Loading