-
Notifications
You must be signed in to change notification settings - Fork 252
Adding motif finding tutorial using the stats.meta.stackexchange.com data dump #473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #473 +/- ##
==========================================
- Coverage 91.43% 91.20% -0.24%
==========================================
Files 18 18
Lines 829 864 +35
Branches 52 101 +49
==========================================
+ Hits 758 788 +30
- Misses 71 76 +5 ☔ View full report in Codecov by Sentry. |
…sts. Later will make these extras?
…xchange Data Dump from Internet Archive
…orial and two Databricks blog posts on GraphFrames.
@SauronShepherd @bjornjorgensen Can you please review this PR? My motif finding tutorial is finally ready :) I want to ship it and then cut a new release. It includes a new extended README and other improvements. Please forgive me for the size - it got out of hand - I'll create smaller PRs in the future. |
@rjurney Can we split this PR to series of smaller PRs? At least separate infrastructure part (CI, build, gitignore, etc.) and tutorial itself? |
I agree on that, Sem.
I've only reviewed the first ones, but I have some doubts:
- The only differences in many lines seem to be the end character. Is that
ok?
13c5e74
- Why mentioning explictly a concrete IDE in the .gitignore? Maybe that's
something every developed should do on its own according to their IDE.
.vscode
1319434
- Why excluding a data folder that maybe shouldn't be located inside the
project in the first place? python/graphframes/examples/data
Why download local test data inside the project?
8d84baa
These are not critical points, they just crossed my mind while having a
look to the PR.
El sáb, 8 feb 2025 a las 12:31, Sem ***@***.***>) escribió:
… @rjurney <https://github.com/rjurney> Can we split this PR to series of
smaller PRs? At least separate infrastructure part (CI, build, gitignore,
etc.) and tutorial itself?
—
Reply to this email directly, view it on GitHub
<#473 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACCN674CC7WXYV3VP4G3ZHT2OXTJDAVCNFSM6AAAAABUFSGKESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNBVGA4TEMRTGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Offhand I didn't know how to break it up, but I will figure it out and do so. |
The most important changes that blocks other PRs are related to the |
@@ -1,16 +1,16 @@ | |||
FROM ubuntu:22.04 | |||
|
|||
ARG PYTHON_VERSION=3.8 | |||
ARG PYTHON_VERSION=3.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for a docker file that we don't use at here at github..
put this in one PR
Like update dockerFile..
|
||
```bash | ||
# Interactive Scala/Java | ||
$ spark-shell --packages graphframes:graphframes:0.8.3-spark3.5-s_2.12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
graphframes:0.8.3 .4 I belive?
|
||
## GraphFrames Internals | ||
|
||
To learn how GraphFrames works internally to combine graph and relational queries, check out the paper [GraphFrames: An Integrated API for Mixing Graph and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a note about the google usergroup?
This project is compatible with Spark 2.4+. However, significant speed improvements have been | ||
made to DataFrames in more recent versions of Spark, so you may see speedups from using the latest | ||
Spark version. | ||
This project is compatible with Spark 2.4+. However, significant speed improvements have been made to DataFrames in more recent versions of Spark, so you may see speedups from using the latest Spark version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark 3.4 or something..
subprojects into the `docs` directory (and then also into the `_site` directory). We use a | ||
jekyll plugin to run `build/sbt unidoc` before building the site so if you haven't run it (recently) it | ||
may take some time as it generates all of the scaladoc. The jekyll plugin also generates the | ||
When you run `jekyll` in the `docs` directory, it will also copy over the scaladoc for the various subprojects into the `docs` directory (and then also into the `_site` directory). We use a jekyll plugin to run `build/sbt unidoc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have this and dev/release_guide.md and docs/_config.yml in a own PR -> update docs
.withVertexColumn( | ||
"rank", | ||
F.lit(1.0 / numVertices), | ||
F.coalesce(Pregel.msg(), F.lit(0.0)) * F.lit(1.0 - alpha) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this it not the same as before..
lit(0.0)) * lit(1.0 - alpha) + lit(alpha / numVertices)) seams to be changed to F.lit(0.0)) * F.lit(1.0 - alpha)
resultRows = ranks.sort(ranks.id).collect() | ||
result = map(lambda x: x.rank, resultRows) | ||
result = list(map(lambda x: x.rank, resultRows)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think you need a list here when you are using a zip 3 lines down...
for a, b in zip(result, expected): | ||
self.assertAlmostEqual(a, b, delta = 1e-3) | ||
assert a == pytest.approx(b, abs=1e-3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happends with delta ?
assert len(all1) == 1 | ||
labels2 = labels.filter("id >= 5").select("label").collect() | ||
all2 = set([x.label for x in labels2]) | ||
all2 = {row.label for row in labels2} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this?
change a set to dict?
all_edges = [z for (a, b) in edges for z in [(a, b), (b, a)]] | ||
edges = self.spark.createDataFrame(all_edges, ["src", "dst"]) | ||
edgesDF = self.spark.createDataFrame(all_edges, ["src", "dst"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no..
edges are another dataframe..
@SemyonSinchenko @bjornjorgensen thank you guys VERY much for these reviews! Would you recommend I split it up before addressing the issues, or address the issues before splitting it up into multiple PRs? |
I would recommend to leave it as is for now and open a series of small PRs, related to CI, pytest, build, etc. |
Okay guys, diving into splitting this PR up... |
…ney/motif-tutorial
@SauronShepherd @SemyonSinchenko @bjornjorgensen please have a look at #511 - the actual documentation portion of the PR. I will do a second and third one now for the docs code and build improvement stuff. |
@SauronShepherd @SemyonSinchenko @bjornjorgensen @WeichenXu123 okay also created #512 and #513. I want to try to merge these and ship a new release this coming week, in advance of the GraphFrames Hackathon. |
This PR makes the following additions to create a tutorial on motif finding using
stats.meta.stackexchange.com
data dump at the internet archive. Teaching the concepts behind this powerful tool will drive increased adoption of GraphFrames.docs/motif-tutorial.md
python/graphframes/examples
-download.py
,xml_to_parquet.py
,graph.py
andmotif.py
python/graphframes/examples/data
This code was originally written by myself under the MIT License for a class at Connected Data London 2024 called Full Stack Graph Machine Learning. It can be found at https://github.com/Graphlet-AI/graphml-class/tree/main/graphml_class/stats