-
Notifications
You must be signed in to change notification settings - Fork 51
fix: reconnect(): respect QoS and fail-safe #254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
BMDan
wants to merge
3
commits into
adafruit:main
Choose a base branch
from
BMDan:fix/reconnect_qos_and_drops
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the broad exception could be reduced to the MQTT exception ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, no matter what the exception is, we re-raise it (see the bare
raisea few lines below). This is the moral equivalent of afinallyordeferclause; we aren't masking nor handling the exception, merely pausing its propagation long enough to make sure our object is left in a sane state. In fact, we could do it withfinally, if you'd prefer.As I see it, there are three options here, in addition to what I've implemented:
_original_subscriptions(or_remaining_original_subscriptions) in the object. This costs some space and complexity, but isn't otherwise too terrible. It does introduce an edge case where we would potentially subscribe to a topic twice, but that's probably not too awful.IndexErrorarising partway through a re-subscription) could cause us to violate our API contract and not fully re-subscribe upon reconnect._subscribed_topics). Plus, it means our guarantee of re-subscription upon reconnect cannot be relied upon.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed information about the thought process, really appreciated.
I think the key question here is whether anything besides
MMQTTExceptionbeingthrownraised from the depths of the library code is expected to be recoverable (in general and also w.r.t. the internal MQTT object state). My take on this is that if there is, it should be wrapped inMMQTTException, i.e. I do not see the need for the broad exception catch.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Thank you for laying that out so clearly!
To work out the best path, I think it's helpful to have a real scenario. One of the lines within the try-catch is:
Let us imagine that a custom global logging handler has a PotM bug, and that an exception will be raised from
self.logger.debugif it is called at 4:56 A.M. on any Tuesday in March of 2026.I put myself in the position of an engineer (who doesn't control the MiniMQTT library) who became aware of this bug when it triggered last week, but doesn't quite know how to reproduce it yet. Helpfully, the custom logger handler throws a corresponding
CustomLoggingExceptions, and my kernel's a nice, clean loop, so I can do something like:And inside of
run_main_loop_once, we already had something like:Perfect! Now I've got resilient code that won't crash in the face of a CLE, but will give me lots of debugging info.
Problem is, and we happen to fail our
pingright around 4:55 A.M., and the time ticking over to 4:56 A.M. happens to occur partway through the reconnect loop, and if the nextdebugthat gets called happens to be the one inside of the MiniMQTT library'sreconnect(), then when I resume after that CLE, I will only be subscribed to a fraction of my topics. As you can see, that's quite a difficult scenario to debug.However, I think the bigger issue is that, if that bug is found a different way that doesn't result in a partial re-subscription, then even given a very skilled programmer who is tasked with working around that bug (and, let us say, is somehow prevented from directly addressing the bug itself), their solution almost certainly would rely on
reconnect()'s apparent semantics and thus would introduce a new, far more subtle bug that's incredibly difficult to reproduce. Indeed, even given an omniscient programmer who foresaw howreconnect()would be affected, their only options to handle it cleanly are:_subscribed_topicsand compare that to a locally-kept complete list, orBoth of these require a lot more (branching) code, and both carry costs in at least two of the three categories of CPU, memory, and/or network traffic. Further, they require a level of defensive coding that seems unreasonable to expect from a consumer of this library.
Put simply, our API contract isn't supposed to require this sort of legwork from our upstream consumer. They were told that the
resub_topicsparameter worked in a particular way. I think it'd therefore be a bug ifreconnect(resub_topics=True)didn't result in a full resubscription if called twice, even if the first call threw an exception of some kind, so long as the second of those calls succeeded. The whole idea of "reconnect and resubscribe" is to restore a known-good state. Let's do that.