Skip to content

[core] Unserializable Exceptions should fallback gracefully #55171

@richardliaw

Description

@richardliaw

Description

It's often the case that we cannot serialize exceptions due to some part of user program being non-serializable. As a result, users end up seeing an unactionable error message such as:

  File "/usr/local/lib/python3.11/dist-packages/ray/exceptions.py", line 45, in from_bytes
    return RayError.from_ray_exception(ray_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/ray/exceptions.py", line 54, in from_ray_exception
   raise RuntimeError(msg) from e
RuntimeError: Failed to unpickle serialized exception

(from #50138)

or

(TaskRunner pid=530440) Failed to unpickle serialized exception
(TaskRunner pid=530440) Traceback (most recent call last):
(TaskRunner pid=530440)   File "/usr/local/lib/python3.12/site-packages/ray/exceptions.py", line 51, in from_ray_exception
(TaskRunner pid=530440)     return pickle.loads(ray_exception.serialized_exception)
(TaskRunner pid=530440)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=530440) TypeError: BackendCompilerFailed.__init__() missing 1 required positional argument: 'inner_exception'
(TaskRunner pid=530440) 
(TaskRunner pid=530440) The above exception was the direct cause of the following exception:
(TaskRunner pid=530440) 
(TaskRunner pid=530440) Traceback (most recent call last):
(TaskRunner pid=530440)   File "/usr/local/lib/python3.12/site-packages/ray/_private/serialization.py", line 460, in deserialize_objects
(TaskRunner pid=530440)     obj = self._deserialize_object(data, metadata, object_ref)
(TaskRunner pid=530440)           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=530440)   File "/usr/local/lib/python3.12/site-packages/ray/_private/serialization.py", line 342, in _deserialize_object
(TaskRunner pid=530440)     return RayError.from_bytes(obj)
(TaskRunner pid=530440)            ^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=530440)   File "/usr/local/lib/python3.12/site-packages/ray/exceptions.py", line 45, in from_bytes
(TaskRunner pid=530440)     return RayError.from_ray_exception(ray_exception)
(TaskRunner pid=530440)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=530440)   File "/usr/local/lib/python3.12/site-packages/ray/exceptions.py", line 54, in from_ray_exception
(TaskRunner pid=530440)     raise RuntimeError(msg) from e
(TaskRunner pid=530440) RuntimeError: Failed to unpickle serialized exception

from #54341


A workaround would be for the user to try-catch the program block themselves and reraise a different, serializable exception, but often times this is done in some 3rd party library code, and the user doesn't have access to this exception.

Ray can continue to throw the runtime error but also provide a string representation of the exception/stacktrace, making it easier for users to consume/understand.

Reproducible Script

import openai
import ray
from openai import AuthenticationError


def call_openai_and_error_out():
    client = openai.OpenAI(
        base_url="https://api.endpoints.anyscale.com/v1",
        api_key="test",
    )
    try:
        client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "You are a chatbot."},
                {"role": "user", "content": "What is the capital of France?"},
            ],
        )
    except AuthenticationError as e:
        print("Errored as expected given API key is invalid.")
        raise e

remote_fn = ray.remote(call_openai_and_error_out)
ray.get(remote_fn.remote())

This gives a non-actionable stacktrace, like:

2025-08-02 14:19:36,726 ERROR serialization.py:462 -- Failed to unpickle serialized exception
Traceback (most recent call last):
  File "/Users/rliaw/miniconda3/lib/python3.10/site-packages/ray/exceptions.py", line 51, in from_ray_exception
    return pickle.loads(ray_exception.serialized_exception)
TypeError: APIStatusError.__init__() missing 2 required keyword-only arguments: 'response' and 'body'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/rliaw/miniconda3/lib/python3.10/site-packages/ray/_private/serialization.py", line 460, in deserialize_objects
    obj = self._deserialize_object(data, metadata, object_ref)
  File "/Users/rliaw/miniconda3/lib/python3.10/site-packages/ray/_private/serialization.py", line 342, in _deserialize_object
    return RayError.from_bytes(obj)
  File "/Users/rliaw/miniconda3/lib/python3.10/site-packages/ray/exceptions.py", line 45, in from_bytes
    return RayError.from_ray_exception(ray_exception)
  File "/Users/rliaw/miniconda3/lib/python3.10/site-packages/ray/exceptions.py", line 54, in from_ray_exception
    raise RuntimeError(msg) from e
RuntimeError: Failed to unpickle serialized exception
Traceback (most recent call last):
  File "/Users/rliaw/dev/proteins/_test.py", line 31, in <module>
    ray.get(remote_fn.remote())
  File "/Users/rliaw/miniconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/Users/rliaw/miniconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/Users/rliaw/miniconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2782, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/Users/rliaw/miniconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 931, in get_objects
    raise value
ray.exceptions.RaySystemError: System error: Failed to unpickle serialized exception
traceback: Traceback (most recent call last):
  File "/Users/rliaw/miniconda3/lib/python3.10/site-packages/ray/exceptions.py", line 51, in from_ray_exception
    return pickle.loads(ray_exception.serialized_exception)
TypeError: APIStatusError.__init__() missing 2 required keyword-only arguments: 'response' and 'body'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/rliaw/miniconda3/lib/python3.10/site-packages/ray/_private/serialization.py", line 460, in deserialize_objects
    obj = self._deserialize_object(data, metadata, object_ref)
  File "/Users/rliaw/miniconda3/lib/python3.10/site-packages/ray/_private/serialization.py", line 342, in _deserialize_object
    return RayError.from_bytes(obj)
  File "/Users/rliaw/miniconda3/lib/python3.10/site-packages/ray/exceptions.py", line 45, in from_bytes
    return RayError.from_ray_exception(ray_exception)
  File "/Users/rliaw/miniconda3/lib/python3.10/site-packages/ray/exceptions.py", line 54, in from_ray_exception
    raise RuntimeError(msg) from e
RuntimeError: Failed to unpickle serialized exception

However, it's actually possible to print the stacktrace and message, by wrapping the function in a try-catch:

try:
...
except Exception:
    import traceback
    print(type(e), e.code)
    print(traceback.format_exc())
    raise e

which then gives something much more reasonable:

(call_openai_and_error_out pid=32523) <class 'openai.NotFoundError'> None
(call_openai_and_error_out pid=32523) Traceback (most recent call last):
(call_openai_and_error_out pid=32523)   File "/Users/rliaw/dev/proteins/_test.py", line 13, in call_openai_and_error_out
(call_openai_and_error_out pid=32523)     client.chat.completions.create(
(call_openai_and_error_out pid=32523)   File "/Users/rliaw/miniconda3/lib/python3.10/site-packages/openai/_utils/_utils.py", line 275, in wrapper
(call_openai_and_error_out pid=32523)     return func(*args, **kwargs)
(call_openai_and_error_out pid=32523)   File "/Users/rliaw/miniconda3/lib/python3.10/site-packages/openai/resources/chat/completions.py", line 667, in create
(call_openai_and_error_out pid=32523)     return self._post(
(call_openai_and_error_out pid=32523)   File "/Users/rliaw/miniconda3/lib/python3.10/site-packages/openai/_base_client.py", line 1208, in post
(call_openai_and_error_out pid=32523)     return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
(call_openai_and_error_out pid=32523)   File "/Users/rliaw/miniconda3/lib/python3.10/site-packages/openai/_base_client.py", line 897, in request
(call_openai_and_error_out pid=32523)     return self._request(
(call_openai_and_error_out pid=32523)   File "/Users/rliaw/miniconda3/lib/python3.10/site-packages/openai/_base_client.py", line 988, in _request
(call_openai_and_error_out pid=32523)     raise self._make_status_error_from_response(err.response) from None
(call_openai_and_error_out pid=32523) openai.NotFoundError

All of these issues are related:

Metadata

Metadata

Labels

P1Issue that should be fixed within a few weekscoreIssues that should be addressed in Ray CoreenhancementRequest for new feature and/or capabilityjira-core

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions