Skip to content

Conversation

rjurney
Copy link
Collaborator

@rjurney rjurney commented Dec 25, 2024

This PR makes the following additions to create a tutorial on motif finding using stats.meta.stackexchange.com data dump at the internet archive. Teaching the concepts behind this powerful tool will drive increased adoption of GraphFrames.

  • A new tutorial on motif finding in docs/motif-tutorial.md
  • Code for the demo in python/graphframes/examples - download.py, xml_to_parquet.py, graph.py and motif.py
  • A new data folder python/graphframes/examples/data

This code was originally written by myself under the MIT License for a class at Connected Data London 2024 called Full Stack Graph Machine Learning. It can be found at https://github.com/Graphlet-AI/graphml-class/tree/main/graphml_class/stats

@codecov-commenter
Copy link

codecov-commenter commented Dec 25, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.20%. Comparing base (bc487ef) to head (74432f7).
Report is 1 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #473      +/-   ##
==========================================
- Coverage   91.43%   91.20%   -0.24%     
==========================================
  Files          18       18              
  Lines         829      864      +35     
  Branches       52      101      +49     
==========================================
+ Hits          758      788      +30     
- Misses         71       76       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…orial and two Databricks blog posts on GraphFrames.
@rjurney
Copy link
Collaborator Author

rjurney commented Feb 7, 2025

@SauronShepherd @bjornjorgensen Can you please review this PR? My motif finding tutorial is finally ready :) I want to ship it and then cut a new release. It includes a new extended README and other improvements. Please forgive me for the size - it got out of hand - I'll create smaller PRs in the future.

@rjurney rjurney requested a review from WeichenXu123 February 8, 2025 01:03
@rjurney rjurney self-assigned this Feb 8, 2025
@rjurney rjurney linked an issue Feb 8, 2025 that may be closed by this pull request
@SemyonSinchenko
Copy link
Collaborator

@rjurney Can we split this PR to series of smaller PRs? At least separate infrastructure part (CI, build, gitignore, etc.) and tutorial itself?

@SauronShepherd
Copy link
Contributor

SauronShepherd commented Feb 8, 2025 via email

@rjurney
Copy link
Collaborator Author

rjurney commented Feb 8, 2025

Offhand I didn't know how to break it up, but I will figure it out and do so.

@SemyonSinchenko
Copy link
Collaborator

Offhand I didn't know how to break it up, but I will figure it out and do so.

The most important changes that blocks other PRs are related to the setup.py, requirements and changes in python tests. Can you separate these changes from the tutorial itself and the downloading scripts?

@@ -1,16 +1,16 @@
FROM ubuntu:22.04

ARG PYTHON_VERSION=3.8
ARG PYTHON_VERSION=3.9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for a docker file that we don't use at here at github..
put this in one PR
Like update dockerFile..


```bash
# Interactive Scala/Java
$ spark-shell --packages graphframes:graphframes:0.8.3-spark3.5-s_2.12
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

graphframes:0.8.3 .4 I belive?


## GraphFrames Internals

To learn how GraphFrames works internally to combine graph and relational queries, check out the paper [GraphFrames: An Integrated API for Mixing Graph and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note about the google usergroup?

This project is compatible with Spark 2.4+. However, significant speed improvements have been
made to DataFrames in more recent versions of Spark, so you may see speedups from using the latest
Spark version.
This project is compatible with Spark 2.4+. However, significant speed improvements have been made to DataFrames in more recent versions of Spark, so you may see speedups from using the latest Spark version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark 3.4 or something..

subprojects into the `docs` directory (and then also into the `_site` directory). We use a
jekyll plugin to run `build/sbt unidoc` before building the site so if you haven't run it (recently) it
may take some time as it generates all of the scaladoc. The jekyll plugin also generates the
When you run `jekyll` in the `docs` directory, it will also copy over the scaladoc for the various subprojects into the `docs` directory (and then also into the `_site` directory). We use a jekyll plugin to run `build/sbt unidoc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have this and dev/release_guide.md and docs/_config.yml in a own PR -> update docs

.withVertexColumn(
"rank",
F.lit(1.0 / numVertices),
F.coalesce(Pregel.msg(), F.lit(0.0)) * F.lit(1.0 - alpha)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this it not the same as before..
lit(0.0)) * lit(1.0 - alpha) + lit(alpha / numVertices)) seams to be changed to F.lit(0.0)) * F.lit(1.0 - alpha)

resultRows = ranks.sort(ranks.id).collect()
result = map(lambda x: x.rank, resultRows)
result = list(map(lambda x: x.rank, resultRows))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think you need a list here when you are using a zip 3 lines down...

for a, b in zip(result, expected):
self.assertAlmostEqual(a, b, delta = 1e-3)
assert a == pytest.approx(b, abs=1e-3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happends with delta ?

assert len(all1) == 1
labels2 = labels.filter("id >= 5").select("label").collect()
all2 = set([x.label for x in labels2])
all2 = {row.label for row in labels2}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this?
change a set to dict?

all_edges = [z for (a, b) in edges for z in [(a, b), (b, a)]]
edges = self.spark.createDataFrame(all_edges, ["src", "dst"])
edgesDF = self.spark.createDataFrame(all_edges, ["src", "dst"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no..
edges are another dataframe..

@rjurney
Copy link
Collaborator Author

rjurney commented Feb 11, 2025

@SemyonSinchenko @bjornjorgensen thank you guys VERY much for these reviews! Would you recommend I split it up before addressing the issues, or address the issues before splitting it up into multiple PRs?

@SemyonSinchenko
Copy link
Collaborator

@SemyonSinchenko @bjornjorgensen thank you guys VERY much for these reviews! Would you recommend I split it up before addressing the issues, or address the issues before splitting it up into multiple PRs?

I would recommend to leave it as is for now and open a series of small PRs, related to CI, pytest, build, etc.

@rjurney
Copy link
Collaborator Author

rjurney commented Feb 16, 2025

Okay guys, diving into splitting this PR up...

@rjurney
Copy link
Collaborator Author

rjurney commented Feb 16, 2025

@SauronShepherd @SemyonSinchenko @bjornjorgensen please have a look at #511 - the actual documentation portion of the PR. I will do a second and third one now for the docs code and build improvement stuff.

@rjurney
Copy link
Collaborator Author

rjurney commented Feb 16, 2025

@SauronShepherd @SemyonSinchenko @bjornjorgensen @WeichenXu123 okay also created #512 and #513. I want to try to merge these and ship a new release this coming week, in advance of the GraphFrames Hackathon.

@rjurney
Copy link
Collaborator Author

rjurney commented Feb 18, 2025

Closing in favor of #511, #512 and #518.

@rjurney rjurney closed this Feb 18, 2025
@rjurney rjurney deleted the rjurney/motif-tutorial branch April 15, 2025 00:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a motif finding tutorial
5 participants