-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Hi there.
I've found an issue where app grains on one silo are often being deactivated because another silo shut down. This seems like a serious reliability problem to me, as any grain can essentially be deactivated anywhere in the cluster, at any time, even though their hosting silo is not affected.
Hoping this is not expected behaviour, as any grain that uses timers and does not regularly receive messages could be deactivated in this way.
It's easy to reproduce (see steps at end) and I have been able to repro in v1.5.1 and v1.5.3, using memory and azure table storage (via emulator).
Here are the entries in the log that precede the deactivation:
[2018-03-14 05:57:13.401 GMT 10 INFO 100524 Catalog 10.1.26.61:11111] Catalog is deactivating 11 activations due to a failure of silo S10.1.26.61:11112:258702790/x177DA9FF, since it is a primary directory partition to these grain ids.
[2018-03-14 05:57:13.407 GMT 14 INFO 103004 VirtualBucketsRingProvider 10.1.26.61:11111] Removed Server S10.1.26.61:11112:258702790/x177DA9FF. Current view: [S10.1.26.61:11111:258702592 -> <MultiRange: Size=x100000000, %Ring=100.000%>]
[2018-03-14 05:57:13.419 GMT 10 INFO 100541 Catalog 10.1.26.61:11111] DeactivateActivations: total 11 to shutdown, out of them 11 promptly, 0 later when become idle and 0 are already being destroyed or invalid.
[2018-03-14 05:57:13.425 GMT 14 INFO 103005 VirtualBucketsRingProvider 10.1.26.61:11111] -NotifyLocalRangeSubscribers about old <MultiRange: Size=x863E5CFB, %Ring=52.439%> new <MultiRange: Size=x100000000, %Ring=100.000%> increased? True
[2018-03-14 05:57:13.442 GMT 10 INFO 100503 Catalog 10.1.26.61:11111] Starting DestroyActivations #0 of 11 activations
[2018-03-14 05:57:13.447 GMT 14 INFO 100612 MembershipOracle 10.1.26.61:11111] Will watch (actively ping) 0 silos: []
[2018-03-14 05:57:13.447 GMT 13 INFO 102934 Orleans.Runtime.ReminderService.LocalReminderService 10.1.26.61:11111] My range changed from <MultiRange: Size=x863E5CFB, %Ring=52.439%> to <MultiRange: Size=x100000000, %Ring=100.000%> increased = True
[2018-03-14 05:57:13.476 GMT 8 ERROR -1 OrleansTest.PubSubGrain ] !!!!!!!!!! PubSubGrain on silo S10.1.26.61:11111:258702592 being deactivated because of shutdown of another silo!
The last error above is being logged by me in OnDeactivateAsync(), of the grain that should not be deactivated. It was created on the initial silo and requests delayed deactivation in OnActivateAsync().
I've attached a simple project that makes it easy to repro and debug, by following these steps:
- Open OrleansTest.sln in VS 2017, build and update the debug executable to the OrleansHost.exe in the bin/Debug directory.
- Start the Azure Storage Emulator. Alternatively rename the included OrleansConfiguration_memorystorage.xml and OrleansConfiguration2_memorystorage.xml files to OrleansConfiguration.xml and OrleansConfiguration2.xml, respectively.
- Start Debug session.
- Wait for first silo to start successfully and for one of the messages to be logged, e.g. "Processing message 'some message' in grain ..."
- Open command line in bin/Debug directory.
*## Start the second silo by executing "OrleansHost.exe SecondSilo OrleansConfiguration2.xml". - Wait for this second silo to start logging "Processing message 'some message' in grain ...". This indicates that a grain on the second silo is also now receiving messages.
- Press Ctrl+C to stop second silo.
- About 60% of the time, the grain sending the messages (which started on the first silo) will be shut down because of the second silo being stopped. The other 40% of the time the grains on the second silo will be activated on the first silo and everything continues working as expected.
- If the incorrect deactivation does not occur, just keep repeating from step *## above and it will happen.
Thanks.