-
Notifications
You must be signed in to change notification settings - Fork 142
Description
Adopting the error handling in the example code from range.py seems to obviate the reconnection behaviours of micropython-mqtt
Is there an alternative reference example for invoking micropython-mqtt to include error-handling for long-lived applications where transient disconnections are possible? What is the proper invocation to combine a handful of scheduled async functions in such a way that the (implicit) async functions of micropython-mqtt have their errors handled properly, and the application doesn't terminate?
I created a routine based on range.py, but after a very long, and visibly successful test run (e.g. between 20hours and 2 days), it seems to hit unrecoverable connection errors which look like pages and pages of...
LmacRxBlk:1
LmacRxBlk:1
LmacRxBlk:1
LmacRxBlk:1
...followed by cycles of...
state: 5 -> 2 (2c0)
rm 0
LmacRxBlk:1
reconnect
state: 2 -> 0 (0)
scandone
state: 0 -> 2 (b0)
state: 2 -> 3 (0)
state: 3 -> 5 (10)
add 0
aid 1
cnt
LmacRxBlk:1
LmacRxBlk:1
LmacRxBlk:1
LmacRxBlk:1
state: 5 -> 2 (2c0)
And the application makes no apparent attempt to resume.
Previously though failures happened, the logic of micropython-mqtt seemed to negotiate a reconnect. The change as far as I can see; I added back the finally clause (as per the original example code at range.py ) to trigger client.close() for KeyboardInterrupt (see this commit).
This avoids LmacRxBlk errors when testing interactively (handles keyboard interrupts more gracefully without the repeated errors spewing to the console), but I suspect the finally and client.close() is also triggered when there is a socket error, which then prevents the system from being able to auto-recover from a disconnection.
I am not sure what triggers the disconnection and it seems probabilistic. There may be some configuration option which would improve things. I am not certain whether access-point mode is fully disabled by the invocations I am making, for example. However, I certainly wouldn't expect the ESP8266 to be 'out of range' given the access point is just a couple of metres away.
The 2metreX2metre display, driven by a single ESP8266 (Wemos D1 Mini or equivalent clone), can be seen running in real-time at https://photos.app.goo.gl/14ryl7bCXeYJbNdy2 which gives you an idea of the data rates involved - just the information for 16 segments of 3 'RGB' bytes re-published lazily (only when segments change), and then forwarded by serial from Micropython to an Arduino which controls the lights.
The issue of lost connection can always be resolved by a hard reboot and the display can be back very quickly on reset, then runs for another several hours. However, I have to eliminate the need for a manual reset, and would like the error recovery and any reset to be handled in software instead, and hopefully therefore a bit faster. This is intended to be part of a display of 20 characters, so if they all have a MTBF of 20 hours, that means a glitch every hour. If we encounter any environmental factors which might trigger wifi/network errors more frequently than once a day, I definitely need better error-handling. The shorter and rarer I can get failures to be, the better.
QUESTIONS
I seem to recall there is a fallback option on ESP8266 of wiring an output pin to its own reset, and triggering a hard reset that way. In your experience, is a 'hardware self-reset' the best way to conceive of a long-lived networked device which might not occasionally enter an unrecoverable loop of errors from some OS-level network issue? Seems heavy-handed for just recovering a transient wifi or socket error.
Is there a way to avoid these networking errors in the first place by configuring the application differently (e.g. turning off some Wifi feature)?
Finally is there a better way to instruct the handling of these errors than the finally clause in range.py. Ideally error handling which is designed to keep the application running rather than terminate it?
Application background information
My code was derived from the range.py example, but adding a bit of translation between byte frames dispatched over MQTT and serial frames sent over to 5V Arduino-compatible circuit. The Arduino acts as a 'co-processor' to service large arrays of WS2811 addressable LEDs, but could also potentially take over the role of a reset watchdog if really necessary (e.g. power cycle the ESP8266 if it doesn't get anything over serial for a while).
My current Micropython code is at uartcharactertest.py and the Arduino code is at uartcharacter.ino (although the ESP8266 receive pin isn't even wired, so the Arduino code is strictly 'downstream' and can't have any effect on the micropython board).
The controller (MQTT publisher) and MQTT broker are running on a Pi 3. They remain functional throughout and have run for many days at a time without visible fault (as demonstrated by other desktop-based subscribers to the display information). The wifi network is served by a GL iNet AR150 running OpenWRT which has also never failed so far. Repowering the micropython board is enough to bring back full function in all cases I have encountered, meaning the error exists somewhere in the ESP8266 or Micropython networking stack, or in my error handling (based on range.py).
The raw log of a failure is visible at https://raw.githubusercontent.com/cefn/retrotextual/master/gist/cockle-longrun-networking-error.log - search for LmacRxBlk to find the start of the failure.