Skip to content

XGBoost-Spark killing SparkContext on Task Failure #4826

@lnmohankumar

Description

@lnmohankumar

Hi Team,

While running Machine Learning training models using xgboost in cluster and sparkContaxt is always getting shutdown after encountering any training task failure exception. So every time we need to restart the cluster to bring it back to normal state.

After looking for the root cause We found the code which causing the sparkcontext to close. I am not sure why sparkContext has to shutdown for any task failure, This is causing the other training models job to end which is not required.

https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/org/apache/spark/SparkParallelismTracker.scala#L127

above code is rolled out in 0.82 and 0.9 versions, Is it possible to fix it or any reason for this change in the new versions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions