fix(postgresql-proxy): prevent SSL COPY stalls by draining nonblocking reads#11
Conversation
e7d5df8 to
1c51600
Compare
pinzon
left a comment
There was a problem hiding this comment.
So basically, the SSL handshake is now done incrementally in the event loop instead of blocking on accept, and once it's complete we keep draining the SSL buffer until it's empty so data doesn't get stranded. Right?
Thank you for also adding a test workflow, now we can test changes here without relying solely on LS. I'll let the final say to @bentsku or @cloutierMat
cloutierMat
left a comment
There was a problem hiding this comment.
This PR does a lot of things that it is hard to review properly. I think the issue is much simpler that this and there is a lot of rewrite of ssl/socket functionality that is most likely handle properly by their respective libraries.
Maybe it is also tackling many other issues I don't know of, but a bit of research revealed this stack overflow post.
My understanding here is that self.listen calls selector.select but as there might still be decrypted data left in the socket, it would be lost and unseen by select. There might be a smarter way to solve but using the sample in the post I could fix the ssl hanging issue by changing the service_connection to handle pending bytes in sock.pending
if recv_data:
LOG.debug('%s received data:\n%s', conn.name, recv_data)
conn.received(recv_data)
# excerpt from https://docs.python.org/3/library/ssl.html#ssl-nonblocking
# Conversely, since the SSL layer has its own framing, a SSL socket may still have data available
# for reading without select() being aware of it. Therefore, you should first call SSLSocket.recv()
# to drain any potentially available data, and then only block on a select() call if still necessary.
while isinstance(sock, ssl.SSLSocket) and sock.pending() > 0:
extra = sock.recv(min(4096, sock.pending()))
if extra:
LOG.debug('%s received pending SSL data:\n%s', conn.name, extra)
conn.received(extra)There might still be something cleaner we can do by calling sock.recv in listen but I think this PR is kinda overboard and makes a lot of changes that would be hard to measure the impact.
I love the addition of a pipeline and cleaner testing, but I would also argue that it would be easier to review and comment, if we would seperate it in multiple PRs
…g reads Co-authored-by: GitHub Copilot <copilot@github.com>
Co-authored-by: GitHub Copilot <copilot@github.com>
1c51600 to
46aa6f6
Compare
|
Thanks for the comment @cloutierMat and for the critical thinking. While reverting, I also caught a related pre-existing bug: on macOS, accepted sockets inherit Total change to |
cloutierMat
left a comment
There was a problem hiding this comment.
Given that we would like this change in the monthly release next week and there are a few points to review on both the new testing and the new ci, I think we would benefit from splitting this pr.
An easier to merge approach would be to only change the proxy.py along with the README changelog and setup.py version update in a first PR. We can then push to get the new version release while we review the testing and CI in stacked PRs.
We can still run the tests from the stacked PR or by locally installing the proxy in downstream project for the meantime and validates it fixes the issue at hand.
| # for reading without select() being aware of it. Therefore, you should first call SSLSocket.recv() | ||
| # to drain any potentially available data, and then only block on a select() call if still necessary. | ||
| while isinstance(sock, ssl.SSLSocket) and sock.pending() > 0: | ||
| extra = sock.recv(min(4096, sock.pending())) |
There was a problem hiding this comment.
nit: I know this is from the code I shared last week, but now that I read it, I don't think the min comparison is really necessary since the recv method already works that it grabs what is present until value provided in arg. We can probably safely achieve the same result with extra = sock.recv(4096).
Motivation
Fixes
UNC-470: Postgres proxy could hang forsslmode=requiresessions running largepsql -f/ COPY-heavy workloads.Root cause
selector.select()operates at the TCP layer and is unaware of data already decrypted and buffered inside the SSL layer. After an initialrecv(), buffered SSL bytes would be silently dropped until the nextselect()wake-up, stalling large transfers.Changes (
postgresql_proxy/proxy.py)recv()inservice_connection(), drain any remaining SSL-buffered bytes usingsock.pending()(per Python SSL non-blocking docs).setblocking(True)on accepted sockets before SSL negotiation. On macOS/BSD, accepted sockets inheritO_NONBLOCKfrom the listening socket (unlike Linux), causing intermittentBlockingIOErroron theMSG_PEEKrecv during SSL startup.Expected impact
Prevents SSL COPY-session stalls for large dump-like workloads without changing the proxy's overall non-blocking architecture.
Test coverage
test_psql_ssl_file_batch_stress_no_hangfor the UNC-470 scenario.