Adding motif finding tutorial using the stats.meta.stackexchange.com data dump #473

rjurney · 2024-12-25T05:30:31Z

This PR makes the following additions to create a tutorial on motif finding using stats.meta.stackexchange.com data dump at the internet archive. Teaching the concepts behind this powerful tool will drive increased adoption of GraphFrames.

A new tutorial on motif finding in docs/motif-tutorial.md
Code for the demo in python/graphframes/examples - download.py, xml_to_parquet.py, graph.py and motif.py
A new data folder python/graphframes/examples/data

This code was originally written by myself under the MIT License for a class at Connected Data London 2024 called Full Stack Graph Machine Learning. It can be found at https://github.com/Graphlet-AI/graphml-class/tree/main/graphml_class/stats

codecov-commenter · 2024-12-25T05:31:55Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.20%. Comparing base (bc487ef) to head (74432f7).
Report is 1 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #473      +/-   ##
==========================================
- Coverage   91.43%   91.20%   -0.24%     
==========================================
  Files          18       18              
  Lines         829      864      +35     
  Branches       52      101      +49     
==========================================
+ Hits          758      788      +30     
- Misses         71       76       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…ng tutorial.

…sts. Later will make these extras?

…xchange Data Dump from Internet Archive

… the graph.

…orial and two Databricks blog posts on GraphFrames.

rjurney · 2025-02-07T09:09:11Z

@SauronShepherd @bjornjorgensen Can you please review this PR? My motif finding tutorial is finally ready :) I want to ship it and then cut a new release. It includes a new extended README and other improvements. Please forgive me for the size - it got out of hand - I'll create smaller PRs in the future.

SemyonSinchenko · 2025-02-08T11:31:07Z

@rjurney Can we split this PR to series of smaller PRs? At least separate infrastructure part (CI, build, gitignore, etc.) and tutorial itself?

SauronShepherd · 2025-02-08T11:48:54Z

I agree on that, Sem. I've only reviewed the first ones, but I have some doubts: - The only differences in many lines seem to be the end character. Is that ok? 13c5e74 - Why mentioning explictly a concrete IDE in the .gitignore? Maybe that's something every developed should do on its own according to their IDE. .vscode 1319434 - Why excluding a data folder that maybe shouldn't be located inside the project in the first place? python/graphframes/examples/data Why download local test data inside the project? 8d84baa These are not critical points, they just crossed my mind while having a look to the PR. El sáb, 8 feb 2025 a las 12:31, Sem ***@***.***>) escribió:

…

@rjurney <https://github.com/rjurney> Can we split this PR to series of smaller PRs? At least separate infrastructure part (CI, build, gitignore, etc.) and tutorial itself? — Reply to this email directly, view it on GitHub <#473 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACCN674CC7WXYV3VP4G3ZHT2OXTJDAVCNFSM6AAAAABUFSGKESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNBVGA4TEMRTGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rjurney · 2025-02-08T23:34:35Z

Offhand I didn't know how to break it up, but I will figure it out and do so.

SemyonSinchenko · 2025-02-09T06:02:19Z

Offhand I didn't know how to break it up, but I will figure it out and do so.

The most important changes that blocks other PRs are related to the setup.py, requirements and changes in python tests. Can you separate these changes from the tutorial itself and the downloading scripts?

bjornjorgensen · 2025-02-09T21:45:52Z

Dockerfile

@@ -1,16 +1,16 @@
 FROM ubuntu:22.04

-ARG PYTHON_VERSION=3.8
+ARG PYTHON_VERSION=3.9


This is for a docker file that we don't use at here at github..
put this in one PR
Like update dockerFile..

bjornjorgensen · 2025-02-09T21:47:20Z

README.md

+
+```bash
+# Interactive Scala/Java
+$ spark-shell --packages graphframes:graphframes:0.8.3-spark3.5-s_2.12


graphframes:0.8.3 .4 I belive?

bjornjorgensen · 2025-02-09T21:48:29Z

README.md

+
+## GraphFrames Internals
+
+To learn how GraphFrames works internally to combine graph and relational queries, check out the paper [GraphFrames: An Integrated API for Mixing Graph and


Add a note about the google usergroup?

bjornjorgensen · 2025-02-09T21:49:06Z

README.md

-This project is compatible with Spark 2.4+.  However, significant speed improvements have been
-made to DataFrames in more recent versions of Spark, so you may see speedups from using the latest
-Spark version.
+This project is compatible with Spark 2.4+.  However, significant speed improvements have been made to DataFrames in more recent versions of Spark, so you may see speedups from using the latest Spark version.


Spark 3.4 or something..

bjornjorgensen · 2025-02-09T21:52:36Z

docs/README.md

-subprojects into the `docs` directory (and then also into the `_site` directory). We use a
-jekyll plugin to run `build/sbt unidoc` before building the site so if you haven't run it (recently) it
-may take some time as it generates all of the scaladoc.  The jekyll plugin also generates the
+When you run `jekyll` in the `docs` directory, it will also copy over the scaladoc for the various subprojects into the `docs` directory (and then also into the `_site` directory). We use a jekyll plugin to run `build/sbt unidoc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc.  The jekyll plugin also generates the


have this and dev/release_guide.md and docs/_config.yml in a own PR -> update docs

bjornjorgensen · 2025-02-09T22:07:17Z

python/graphframes/tests.py

+            .withVertexColumn(
+                "rank",
+                F.lit(1.0 / numVertices),
+                F.coalesce(Pregel.msg(), F.lit(0.0)) * F.lit(1.0 - alpha)


this it not the same as before..
lit(0.0)) * lit(1.0 - alpha) + lit(alpha / numVertices)) seams to be changed to F.lit(0.0)) * F.lit(1.0 - alpha)

bjornjorgensen · 2025-02-09T22:09:30Z

python/graphframes/tests.py

        resultRows = ranks.sort(ranks.id).collect()
-        result = map(lambda x: x.rank, resultRows)
+        result = list(map(lambda x: x.rank, resultRows))


I dont think you need a list here when you are using a zip 3 lines down...

bjornjorgensen · 2025-02-09T22:10:19Z

python/graphframes/tests.py

        for a, b in zip(result, expected):
-            self.assertAlmostEqual(a, b, delta = 1e-3)
+            assert a == pytest.approx(b, abs=1e-3)


what happends with delta ?

bjornjorgensen · 2025-02-09T22:15:12Z

python/graphframes/tests.py

        assert len(all1) == 1
        labels2 = labels.filter("id >= 5").select("label").collect()
-        all2 = set([x.label for x in labels2])
+        all2 = {row.label for row in labels2}


what is this?
change a set to dict?

bjornjorgensen · 2025-02-09T22:16:45Z

python/graphframes/tests.py

        all_edges = [z for (a, b) in edges for z in [(a, b), (b, a)]]
-        edges = self.spark.createDataFrame(all_edges, ["src", "dst"])
+        edgesDF = self.spark.createDataFrame(all_edges, ["src", "dst"])


no..
edges are another dataframe..

rjurney · 2025-02-11T06:44:59Z

@SemyonSinchenko @bjornjorgensen thank you guys VERY much for these reviews! Would you recommend I split it up before addressing the issues, or address the issues before splitting it up into multiple PRs?

SemyonSinchenko · 2025-02-11T07:07:51Z

@SemyonSinchenko @bjornjorgensen thank you guys VERY much for these reviews! Would you recommend I split it up before addressing the issues, or address the issues before splitting it up into multiple PRs?

I would recommend to leave it as is for now and open a series of small PRs, related to CI, pytest, build, etc.

rjurney · 2025-02-16T03:54:40Z

Okay guys, diving into splitting this PR up...

…ney/motif-tutorial

rjurney · 2025-02-16T04:34:31Z

@SauronShepherd @SemyonSinchenko @bjornjorgensen please have a look at #511 - the actual documentation portion of the PR. I will do a second and third one now for the docs code and build improvement stuff.

rjurney · 2025-02-16T05:44:37Z

@SauronShepherd @SemyonSinchenko @bjornjorgensen @WeichenXu123 okay also created #512 and #513. I want to try to merge these and ship a new release this coming week, in advance of the GraphFrames Hackathon.

rjurney · 2025-02-18T14:22:06Z

Closing in favor of #511, #512 and #518.

Make python/graphframes/examples/data folder exist

f76668d

rjurney added 28 commits December 25, 2024 00:36

Update user-guide.md section on motif finding to point at motif findi…

13c5e74

…ng tutorial.

Added motif finding tutorial to index.md

1145f6f

Ignore examples data dir, .vscode

1319434

Minor grammatical fixes to docs release guide

a50c5a3

We are all grownups here. We can wrap our own text :)

e3a5ca3

Text wrapping for README.md

76a7ea7

Ankur Dave's last name is Dave, not Ankur

a013c5f

Refer to motif finding tutorial from website header

9a8d845

Major README overhaul. I am classically bad at Scala

77f0233

Added motif output

333dc1b

Added requirements for motif finding tutorial: click, py7zr and reque…

ee843b3

…sts. Later will make these extras?

Script for motif finding tutorial, to download and uncompress Stack E…

058cdfb

…xchange Data Dump from Internet Archive

Working stackexchange data dump graph building script

8d84baa

Minor README improvements.

d1cbfe4

In progress motif tutorial. Covered downloading the data and building…

0654a3e

… the graph.

Added section 'Learn GraphFrames' that links to the motif finding tut…

d5becef

…orial and two Databricks blog posts on GraphFrames.

install and use

781a13e

More on entity resolution re: connected components

e872602

Shorten line into two lines

a323879

More on entity resolution

e87a985

Split long lines

c18f06f

Moved example graph.py to stackexchange.py due to existence of graphs.py

6dd0375

Changed graph.py path in motif tutorial to stackexchange.py

fc80c4e

Removed memory settings

f40afe2

Added utils file for motif tutorial

f644db8

Now loading the graph nodes/edges and counting the types

8dc6432

Motif finding tutorial script

d65ff29

Long note explaining there is one node type in a GraphFrame.

587c79f

Now using properties in an aggregation of motif paths

ea89dac

rjurney added 3 commits February 7, 2025 01:12

Remove unused images

eb57303

More unused images

4dc9cc1

Sync'd tutorial with motif.py

74432f7

rjurney requested a review from WeichenXu123 February 8, 2025 01:03

rjurney self-assigned this Feb 8, 2025

rjurney added the documentation label Feb 8, 2025

rjurney linked an issue Feb 8, 2025 that may be closed by this pull request

Add a motif finding tutorial #493

Closed

rjurney mentioned this pull request Feb 8, 2025

chore: use pyproject for python dependencies and extras #505

Closed

rjurney mentioned this pull request Feb 9, 2025

feat: SparkConnect support #506

Merged

bjornjorgensen suggested changes Feb 9, 2025

View reviewed changes

Merge branch 'master' of github.com:graphframes/graphframes into rjur…

e78c654

…ney/motif-tutorial

rjurney mentioned this pull request Feb 16, 2025

3 of 3: Documentation cleanup and update. Added a motif finding tutorial. #511

Merged

This was referenced Feb 16, 2025

1 of 3: Build a graphframes Python package during the build process #512

Merged

New Python tutorial module graphframes.tutorial #513

Closed

rjurney mentioned this pull request Feb 17, 2025

2 of 3: New minimized PR for a Python tutorial module graphframes.tutorial #518

Merged

rjurney closed this Feb 18, 2025

rjurney mentioned this pull request Feb 21, 2025

Rjurney/motif tutorial code min #520

Merged

rjurney deleted the rjurney/motif-tutorial branch April 15, 2025 00:33


		## GraphFrames Internals

		To learn how GraphFrames works internally to combine graph and relational queries, check out the paper [GraphFrames: An Integrated API for Mixing Graph and

Adding motif finding tutorial using the stats.meta.stackexchange.com data dump #473

Adding motif finding tutorial using the stats.meta.stackexchange.com data dump #473

Uh oh!

Conversation

rjurney commented Dec 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Dec 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rjurney commented Feb 7, 2025

Uh oh!

SemyonSinchenko commented Feb 8, 2025

Uh oh!

SauronShepherd commented Feb 8, 2025 via email

Uh oh!

rjurney commented Feb 8, 2025

Uh oh!

SemyonSinchenko commented Feb 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjurney commented Feb 11, 2025

Uh oh!

SemyonSinchenko commented Feb 11, 2025

Uh oh!

rjurney commented Feb 16, 2025

Uh oh!

rjurney commented Feb 16, 2025

Uh oh!

rjurney commented Feb 16, 2025

Uh oh!

rjurney commented Feb 18, 2025

Uh oh!

Uh oh!

rjurney commented Dec 25, 2024 •

edited

Loading

codecov-commenter commented Dec 25, 2024 •

edited

Loading