Skip to content

Segmentation faults and uninitialised wait sets #478

@gbiggs

Description

@gbiggs

Bug report

Required Info:

  • Operating System:
    • Ubuntu 20.04
  • Installation type:
    • Binaries, from source
  • Version or commit hash:
    • Binaries: 1.2.1-1focal.20201007.210239
    • Source: f54c74b
  • DDS implementation:
    • rmw_fastrtps_cpp
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

  1. Check out the rmf_core repository into a workspace: https://github.com/osrf/rmf_core
  2. Switch to the fastdds_segfaults branch
  3. Compile the rmf_fleet_adapter package:
    rosdep install --from-paths src --ignore-src -yr
    colcon build --packages-up-to rmf_fleet_adapter
    
  4. Execute the small program that reliably triggers the segmentation fault
    source install/setup.bash
    ./build/rmf_fleet_adapter/segfaulter
    

Expected behavior

The sample program completes successfully without any errors.

Actual behavior

The sample program, in most iterations after the first couple, either fails to delete a wait set or causes segmentation faults in rmw_fastrtps_cpp code.

Example output:

$ ./build/rmf_fleet_adapter/segfaulter   
0
[INFO] [1604462249.727237249] [test_node_0]: Added a robot named [test_robot] with participant ID [0]
1
[INFO] [1604462251.764308128] [test_node_1]: Added a robot named [test_robot] with participant ID [0]
2
[INFO] [1604462253.797161121] [test_node_2]: Added a robot named [test_robot] with participant ID [0]
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
3
[INFO] [1604462255.833644587] [test_node_3]: Added a robot named [test_robot] with participant ID [0]
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
4
[INFO] [1604462257.876494584] [test_node_4]: Added a robot named [test_robot] with participant ID [0]
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
5
[INFO] [1604462259.924830354] [test_node_5]: Added a robot named [test_robot] with participant ID [0]
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
6
[INFO] [1604462261.974901964] [test_node_6]: Added a robot named [test_robot] with participant ID [0]
7
[INFO] [1604462264.032900604] [test_node_7]: Added a robot named [test_robot] with participant ID [0]
8
[INFO] [1604462266.085180435] [test_node_8]: Added a robot named [test_robot] with participant ID [0]
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
"/home/geoff/src/workspaces/ros2_foxy_debug/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp":__function__:150"failed to destroy wait set": ros discovery info listener thread will shutdown ...
zsh: segmentation fault (core dumped)  ./build/rmf_fleet_adapter/segfaulter

Additional information

We have traced both errors to the node_listener function in listen_thread.cpp.

For the wait set deletion failure, the error occurs when the context is deallocated and a new one allocated in the same memory before the node_listen function returns. It tries to delete a wait set pointer that is null, and the null pointer check in rmw_fastrtps_shared_cpp::__rmw_destroy_wait_set catches the null pointer and returns an error, triggering the error message.

The segmentation fault has a similar cause. The context is deallocated and a new one allocated in the same memory. This time it tries to use a member of the zero-initialised context, which is a null pointer, which triggers a segmentation fault.

In both cases, we have not been able to trace where the context is being overwritten. Both errors appear to be race conditions, and as far as we can tell they are occurring inside the rmw_fastrtps_cpp code.

The sample program is a cut-down version of a test we have that used to work on the version of Fast RTPS that was in Eloquent, and started failing with the shift to Fast DDS in Foxy. It starts up several threads to handle messages in ROS at the rclcpp level, and the test itself hammers the ROS initialisation and finalisation machinery, creating and destroying contexts constantly and rapidly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions