Skip to content

Conversation

EnricoMi
Copy link
Contributor

This guards against columns that contain dots. These column names must be quoted to avoid that Spark parses them again.

Fixes following exception when your vertices DataFrame contains columns with dots (e.g. "a.name"):

Cannot resolve column name "a.name" among (id, a.name); did you mean to quote the `a.name` column?;
org.apache.spark.sql.AnalysisException: Cannot resolve column name "a.name" among (id, a.name); did you mean to quote the `a.name` column?;
	at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:270)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.Dataset.resolve(Dataset.scala:263)
	at org.apache.spark.sql.Dataset.col(Dataset.scala:1353)
	at org.apache.spark.sql.Dataset.apply(Dataset.scala:1320)
	at org.graphframes.GraphFrame$.$anonfun$nestAsCol$1(GraphFrame.scala:820)
	...

Fixes this issue also in other places.

@codecov-commenter
Copy link

codecov-commenter commented Aug 17, 2020

Codecov Report

Merging #370 into master will increase coverage by 0.02%.
The diff coverage is 76.92%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #370      +/-   ##
==========================================
+ Coverage   91.26%   91.28%   +0.02%     
==========================================
  Files          18       18              
  Lines         870      872       +2     
  Branches       48       56       +8     
==========================================
+ Hits          794      796       +2     
  Misses         76       76              
Impacted Files Coverage Δ
src/main/scala/org/graphframes/GraphFrame.scala 89.27% <71.42%> (+0.09%) ⬆️
.../scala/org/graphframes/lib/GraphXConversions.scala 84.61% <75.00%> (ø)
...main/scala/org/graphframes/lib/ShortestPaths.scala 96.00% <100.00%> (ø)
...main/scala/org/graphframes/lib/TriangleCount.scala 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7209013...32e45a5. Read the comment docs.

@EnricoMi
Copy link
Contributor Author

@mengxr @felixcheung how do you feel about this fix?

@codecov-io
Copy link

codecov-io commented Nov 16, 2020

Codecov Report

Merging #370 (b6d5812) into master (2a0eeb0) will increase coverage by 0.02%.
The diff coverage is 76.92%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #370      +/-   ##
==========================================
+ Coverage   91.26%   91.28%   +0.02%     
==========================================
  Files          18       18              
  Lines         870      872       +2     
  Branches       48       56       +8     
==========================================
+ Hits          794      796       +2     
  Misses         76       76              
Impacted Files Coverage Δ
src/main/scala/org/graphframes/GraphFrame.scala 89.27% <71.42%> (+0.09%) ⬆️
.../scala/org/graphframes/lib/GraphXConversions.scala 84.61% <75.00%> (ø)
...main/scala/org/graphframes/lib/ShortestPaths.scala 96.00% <100.00%> (ø)
...main/scala/org/graphframes/lib/TriangleCount.scala 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a0eeb0...b6d5812. Read the comment docs.

@EnricoMi
Copy link
Contributor Author

@mengxr @felixcheung @WeichenXu123 ping

@EnricoMi EnricoMi force-pushed the backtick-quote-column-names branch from b6d5812 to ca9fd56 Compare February 19, 2021 11:04
@rjurney
Copy link
Collaborator

rjurney commented Sep 7, 2023

Oh thank you, I hate transforming to underscores!

@SemyonSinchenko
Copy link
Collaborator

@EnricoMi Hello! Is there any chance you can update the PR and resolve conflicts? It would be a nice feature for GraphFrames!

@EnricoMi EnricoMi force-pushed the backtick-quote-column-names branch 2 times, most recently from 5084ca8 to 104b030 Compare February 27, 2025 14:39
@EnricoMi
Copy link
Contributor Author

Rebased. Thanks for looking into this.

@SemyonSinchenko
Copy link
Collaborator

@rjurney @bjornjorgensen Hello! What do you think about this one? For me it looks very cool, I think it is very nice to have feature for GraphFrames!

@bjornjorgensen
Copy link
Contributor

cool thing.. but its something that should have been in spark not here..

@bjornjorgensen
Copy link
Contributor

.. and it is a breaking change to.

@SemyonSinchenko
Copy link
Collaborator

.. and it is a breaking change to.

To be honest I don't see why is it a breaking change? For end users everything should be exactly the same, until they are doing something similar on their side. From the other side complex pipelines with graphframes with column-names generation and similar things it should be much more smooth with that change.

@bjornjorgensen
Copy link
Contributor

.. and it is a breaking change to.

To be honest I don't see why is it a breaking change? For end users everything should be exactly the same, until they are doing something similar on their side. From the other side complex pipelines with graphframes with column-names generation and similar things it should be much more smooth with that change.

What you don't see any braking changes, but you explain them in the 3 next lines!

I don't understand what this have to do with graphframes as this project code..

@EnricoMi
Copy link
Contributor Author

.. and it is a breaking change to.

You mean if column names are backticked already? That should be doable.

Any other use case that breaks due to this change?

@bjornjorgensen
Copy link
Contributor

.. and it is a breaking change to.

You mean if column names are backticked already? That should be doable.

yes, and I think that it's very common to have.

Any other use case that breaks due to this change?

Not that I can think of right now.

@bjornjorgensen
Copy link
Contributor

like pandas json_normalize uses sep='.' as default. https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html

@EnricoMi EnricoMi force-pushed the backtick-quote-column-names branch from 104b030 to 5103ed5 Compare February 28, 2025 09:17
@EnricoMi
Copy link
Contributor Author

EnricoMi commented Feb 28, 2025

but its something that should have been in spark not here..
.. and it is a breaking change to.

Looking more closely, this is not a breaking change. And Spark API expects the calling site to properly escape column names if they contain dots (unless they reference nested columns). Spark won't change that, otherwise, that would be a breaking change on their side. Since Spark 4.0.0, they even provide methods in QuotingUtils that can be used in user code (similar to quote implemented here).

Backticks are only required when referencing a column by its name (select, col, ...). A column name in the schema is not escaped by backticks. This pull requests only touches code where Graphframes references columns by their name, and that name is taken from the schema (not the user calling Graphframes API). Hence, the user cannot provide backticed column names, Graphframes is responsible for properly escaping the column names.

Without this fix, users of Graphframes have to workaround this by renaming affected columns:
https://github.com/G-Research/spark-dgraph-connector/blob/09dacbf/src/main/scala/uk/co/gresearch/spark/dgraph/graphframes/package.scala#L42-L44

@SemyonSinchenko
Copy link
Collaborator

@bjornjorgensen Did you have a chance to check the latest commits? It seems to me that it should cover all of your concerns and nothing else is blocking merging. Thanks in advance!

@bjornjorgensen
Copy link
Contributor

hm... well I think its ok..

If we did have branches for graphframes we didn't needed this for branch 4

I think we can have it, but if there are some user problems with it we will need to remove it.

@SemyonSinchenko
Copy link
Collaborator

@rjurney What do you think about it? For me it is ready to merge.

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Mar 4, 2025

If we did have branches for graphframes we didn't needed this for branch 4

With "branch 4" I presume you refer to Spark4. What is different there? Do you imply backticks are not needed with Spark4? How are ambiguous column names resolved there?

@bjornjorgensen
Copy link
Contributor

If we did have branches for graphframes we didn't needed this for branch 4

With "branch 4" I presume you refer to Spark4. What is different there? Do you imply backticks are not needed with Spark4? How are ambiguous column names resolved there?

I was thinking about this "Note: This can be replaced with org.apache.spark.sql.catalyst.util.QuotingUtils.quoteIfNeeded

  • once support for Spark 3 has been dropped"
    and yes, branch 4 is a branch for spark 4

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Mar 5, 2025

Since Spark 4.0.0, they even provide methods in QuotingUtils that can be used in user code

With "user code" I meant code that uses the Spark API, which in our case is Graphframes code. So once Spark3 support is dropped by Graphframes, the quote implementation of this PR can be removed and replaced with calling into QuotingUtils. So most of this PR remains.

@SemyonSinchenko
Copy link
Collaborator

@bjornjorgensen Were your concerns resolved by the latest commits? If so, could you approve the PR?

cc: @rjurney @SauronShepherd could you take a look too? If you are fine with it, I would like to resolve conflicts and merge it. I see it as a very important feature.

@bjornjorgensen
Copy link
Contributor

LGTM

This guards against columns that contain dots. These column names must be quoted
to avoid that Spark parses them again.
@EnricoMi EnricoMi force-pushed the backtick-quote-column-names branch from 5103ed5 to 7a60f7a Compare March 28, 2025 18:42
@EnricoMi
Copy link
Contributor Author

Conflict resolved.

@SemyonSinchenko SemyonSinchenko self-requested a review March 28, 2025 18:50
@SauronShepherd
Copy link
Contributor

Shouldn't we add some tests to check that GraphFrames actually works fine with nested columns?

@SemyonSinchenko
Copy link
Collaborator

Shouldn't we add some tests to check that GraphFrames actually works fine with nested columns?

There are two tests in this PR already, do you think there should be more?

@SauronShepherd
Copy link
Contributor

SauronShepherd commented Mar 28, 2025 via email

@SemyonSinchenko
Copy link
Collaborator

@SauronShepherd Hi! Were your concerns resolved? I would like to merge this one to avoid even more merge conflicts, because this PR changes a lot of files.

@SauronShepherd
Copy link
Contributor

Sorry, I didn't have time to look at it. Go ahead.

@SemyonSinchenko SemyonSinchenko merged commit 2f3db30 into graphframes:master Apr 2, 2025
5 checks passed
@SemyonSinchenko
Copy link
Collaborator

Thanks for the contribution @EnricoMi !

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Apr 2, 2025

Thanks for merging!

@EnricoMi EnricoMi deleted the backtick-quote-column-names branch April 2, 2025 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants