Skip to content

Grains on SiloA deactivated because of SiloB shutting down #4217

@emilevr

Description

@emilevr

Hi there.

I've found an issue where app grains on one silo are often being deactivated because another silo shut down. This seems like a serious reliability problem to me, as any grain can essentially be deactivated anywhere in the cluster, at any time, even though their hosting silo is not affected.

Hoping this is not expected behaviour, as any grain that uses timers and does not regularly receive messages could be deactivated in this way.

It's easy to reproduce (see steps at end) and I have been able to repro in v1.5.1 and v1.5.3, using memory and azure table storage (via emulator).

Here are the entries in the log that precede the deactivation:

[2018-03-14 05:57:13.401 GMT    10      INFO    100524  Catalog 10.1.26.61:11111]       Catalog is deactivating 11 activations due to a failure of silo S10.1.26.61:11112:258702790/x177DA9FF, since it is a primary directory partition to these grain ids.
[2018-03-14 05:57:13.407 GMT    14      INFO    103004  VirtualBucketsRingProvider      10.1.26.61:11111]       Removed Server S10.1.26.61:11112:258702790/x177DA9FF. Current view: [S10.1.26.61:11111:258702592 -> <MultiRange: Size=x100000000, %Ring=100.000%>]
[2018-03-14 05:57:13.419 GMT    10      INFO    100541  Catalog 10.1.26.61:11111]       DeactivateActivations: total 11 to shutdown, out of them 11 promptly, 0 later when become idle and 0 are already being destroyed or invalid.
[2018-03-14 05:57:13.425 GMT    14      INFO    103005  VirtualBucketsRingProvider      10.1.26.61:11111]       -NotifyLocalRangeSubscribers about old <MultiRange: Size=x863E5CFB, %Ring=52.439%> new <MultiRange: Size=x100000000, %Ring=100.000%> increased? True
[2018-03-14 05:57:13.442 GMT    10      INFO    100503  Catalog 10.1.26.61:11111]       Starting DestroyActivations #0 of 11 activations
[2018-03-14 05:57:13.447 GMT    14      INFO    100612  MembershipOracle        10.1.26.61:11111]   Will watch (actively ping) 0 silos: []
[2018-03-14 05:57:13.447 GMT    13      INFO    102934  Orleans.Runtime.ReminderService.LocalReminderService    10.1.26.61:11111]    My range changed from <MultiRange: Size=x863E5CFB, %Ring=52.439%> to <MultiRange: Size=x100000000, %Ring=100.000%> increased = True
[2018-03-14 05:57:13.476 GMT     8      ERROR   -1      OrleansTest.PubSubGrain ]       !!!!!!!!!! PubSubGrain on silo S10.1.26.61:11111:258702592 being deactivated because of shutdown of another silo!

The last error above is being logged by me in OnDeactivateAsync(), of the grain that should not be deactivated. It was created on the initial silo and requests delayed deactivation in OnActivateAsync().

I've attached a simple project that makes it easy to repro and debug, by following these steps:

  • Open OrleansTest.sln in VS 2017, build and update the debug executable to the OrleansHost.exe in the bin/Debug directory.
  • Start the Azure Storage Emulator. Alternatively rename the included OrleansConfiguration_memorystorage.xml and OrleansConfiguration2_memorystorage.xml files to OrleansConfiguration.xml and OrleansConfiguration2.xml, respectively.
  • Start Debug session.
  • Wait for first silo to start successfully and for one of the messages to be logged, e.g. "Processing message 'some message' in grain ..."
  • Open command line in bin/Debug directory.
    *## Start the second silo by executing "OrleansHost.exe SecondSilo OrleansConfiguration2.xml".
  • Wait for this second silo to start logging "Processing message 'some message' in grain ...". This indicates that a grain on the second silo is also now receiving messages.
  • Press Ctrl+C to stop second silo.
  • About 60% of the time, the grain sending the messages (which started on the first silo) will be shut down because of the second silo being stopped. The other 40% of the time the grains on the second silo will be activated on the first silo and everything continues working as expected.
  • If the incorrect deactivation does not occur, just keep repeating from step *## above and it will happen.

Thanks.

OrleansDeactivationRepro.zip

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions