Skip to content

bug: graphframes-py requires grpc related dependencies #646

@Kimahriman

Description

@Kimahriman

Describe the bug

Trying to use the new Python package requires all Connect related dependencies are installed even if you are not using Spark Connect.

To Reproduce

Steps to reproduce the behavior:

% pip install pyspark graphframes-py
% pyspark
Python 3.11.13 (main, Jun  3 2025, 18:38:25) [Clang 16.0.0 (clang-1600.0.26.6)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/17 13:57:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.0
      /_/

Using Python version 3.11.13 (main, Jun  3 2025 18:38:25)
Spark context available as 'sc' (master = local[*], app id = local-1752775048865).
SparkSession available as 'spark'.
>>> from graphframes import GraphFrame
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../test-venv/lib/python3.11/site-packages/graphframes/__init__.py", line 1, in <module>
    from .graphframe import GraphFrame
  File ".../test-venv/lib/python3.11/site-packages/graphframes/graphframe.py", line 39, in <module>
    from graphframes.connect.graphframe_client import GraphFrameConnect
  File ".../test-venv/lib/python3.11/site-packages/graphframes/connect/graphframe_client.py", line 4, in <module>
    from pyspark.sql.connect import proto
  File ".../test-venv/lib/python3.11/site-packages/pyspark/sql/connect/proto/__init__.py", line 18, in <module>
    from pyspark.sql.connect.proto.base_pb2_grpc import *
  File ".../test-venv/lib/python3.11/site-packages/pyspark/sql/connect/proto/base_pb2_grpc.py", line 19, in <module>
    import grpc
ModuleNotFoundError: No module named 'grpc'

Expected behavior

System [please complete the following information]:

  • OS: Mac
  • Python Version (if applied): Python 3.11
  • Spark / PySpark version: 4.0.0
  • GraphFrames version: 0.9.0

Component

  • Scala Core Internal
  • Scala API
  • Spark Connect Plugin
  • PySpark Classic
  • PySpark Connect

Additional context

Are you planning on creating a PR?

  • I'm willing to make a pull-request

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions