Skip to content

Conversation

dsmith111
Copy link
Contributor

Related Issues/PRs

Directly related issues: #2293, #1701
Issue that looks to be related, however, used an older API: #614

What changes are proposed in this pull request?

Problem

When attempting to load a Spark pipeline that includes both a non-java backed, "pure Python" transformer and a LightGBM model, an AttributeError was encountered, indicating that the module com.microsoft.azure.synapse.ml.lightgbm lacked the LightGBMClassificationModel attribute. The existing java_params_patch was not sufficient to resolve this, as it only applied to the JavaParams._from_java method and did not cover scenarios where DefaultParamsReader.loadParamsInstance was used.

The previous patch for JavaParams._from_java handled the conversion of Java objects to Python objects during deserialization but did not account for the direct instantiation of classes based on metadata in DefaultParamsReader.loadParamsInstance (as is done during Pipeline/PipelineModel stage loading). As a result, the class references within the ComplexParamsMixin superclass were not correctly resolved, leading to the AttributeError in the above issues.

Solution

In similar fashion to the existing patch, this solution adds a loadParamsInstance patch which enhances the existing Python class handling condition to correct the SynapseML class name in addition to its existing pyspark class name correction.

This PR ensures that the necessary class name transformations are correctly handled during the loading process, addressing the limitation of the previous patch.

How is this patch tested?

  • I have written tests (not required for typo or doc fix) and confirmed the proposed bug-fix works.
  • I have trained, saved and loaded these PipelineModels with and without the patch to confirm it works.

Does this PR change any dependencies?

  • No. You can skip this section.
  • Yes. Make sure the dependencies are resolved correctly, and list changes here.

Does this PR add a new feature? If so, have you added samples on website?

  • No. You can skip this section.
  • Yes. Make sure you have added samples following below steps.
  1. Find the corresponding markdown file for your new feature in website/docs/documentation folder.
    Make sure you choose the correct class estimators/transformers and namespace.
  2. Follow the pattern in markdown file and add another section for your new API, including pyspark, scala (and .NET potentially) samples.
  3. Make sure the DocTable points to correct API link.
  4. Navigate to website folder, and run yarn run start to make sure the website renders correctly.
  5. Don't forget to add <!--pytest-codeblocks:cont--> before each python code blocks to enable auto-tests for python samples.
  6. Make sure the WebsiteSamplesTests job pass in the pipeline.

@dsmith111 dsmith111 force-pushed the dsmith111/lightgbm-fix-pipeline branch from 3391beb to 1f1bfdb Compare March 23, 2025 02:44
@dsmith111 dsmith111 marked this pull request as ready for review March 23, 2025 02:48
@dsmith111
Copy link
Contributor Author

@mhamilton723, @svotaw, tagging maintainers that have both interacted with this problem in the past.

Comment on lines 46 to 63
def setUpClass(cls):
cls.spark = SparkSession.builder \
.appName("LightGBMSerializationTests") \
.getOrCreate()

@classmethod
def tearDownClass(cls):
cls.spark.stop()

def setUp(self):
# Define a temporary directory for saving models/pipelines.
self.temp_dir = "./tmp/lightgbm_serialization_test"
if os.path.exists(self.temp_dir):
shutil.rmtree(self.temp_dir)

def tearDown(self):
if os.path.exists(self.temp_dir):
shutil.rmtree(self.temp_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our tests manage the spark and spark context objects on behalf of the user, im not sure youll need this if you copy the structure of other tests in the repo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, reused their components. Removed handling functions.

from synapse.ml.stages import SelectColumns
import synapse.ml.lightgbm as lgbm

class StringArrayToVectorTransformer(Transformer, DefaultParamsReadable, DefaultParamsWritable):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a throwaway piece of code to just test serialization of a custom model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not exactly sure on the interpretation of this but stripped a fair amount of logic from the test and changed the function name/description to reflect better what it is for. Based on the update, let me know if you think if it should be further reduced. I have the single vs pipeline serialization test to demonstrate the standard single-model Python serialization/deserialization was not affected along with the pipeline test demonstrating the mentioned issue is resolved.

@mhamilton723
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov-commenter
Copy link

codecov-commenter commented Mar 25, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.60%. Comparing base (b7971eb) to head (a1eec3f).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2357      +/-   ##
==========================================
- Coverage   84.66%   84.60%   -0.06%     
==========================================
  Files         331      331              
  Lines       17179    17179              
  Branches     1550     1550              
==========================================
- Hits        14544    14535       -9     
- Misses       2635     2644       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mhamilton723
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dsmith111
Copy link
Contributor Author

/azp run

Copy link

Commenter does not have sufficient privileges for PR 2357 in repo microsoft/SynapseML

@mhamilton723
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mhamilton723
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mhamilton723 mhamilton723 merged commit 26baa09 into microsoft:master Apr 4, 2025
61 of 63 checks passed
@dsmith111 dsmith111 deleted the dsmith111/lightgbm-fix-pipeline branch April 4, 2025 02:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants