Summary
While investigating psycopg/libpq nonblocking TCP connects under Quark, I observed a second possible uring TCP connect-state issue separate from the deterministic select(2) exception-fdset bug fixed in #1373.
This issue tracks the less-proven behavior so it can be isolated before proposing a fix.
Observed Behavior
In one minimal Quark run against a real host PostgreSQL server, the client sequence:
- create TCP socket
- set nonblocking
connect() / connect_ex() returns EINPROGRESS
- immediately call
getsockopt(SOL_SOCKET, SO_ERROR)
returned stale EINPROGRESS from Quark where the runc control returned 0.
That matters because libpq/psycopg checks SO_ERROR during its nonblocking connect flow.
Current Reproduction Status
The exact stale immediate SO_ERROR=EINPROGRESS symptom has not been reproduced deterministically yet.
What is reproducible:
summary bad_immediate=0/100 bad_select_except=100/100
psycopg_iter=7 error=ConnectionTimeout: connection timeout expired
psycopg_failures=1/20
After the select fix was restored:
summary bad_immediate=0/100 bad_select_except=0/100
psycopg_failures=0/20
So the remaining SO_ERROR concern may be timing/path dependent, or it may have been downstream of the select readiness issue in some runs.
Suggested Next Steps
- Build a focused repro that exercises Quark's uring
TCPConnecting state directly.
- Stress the timing between nonblocking
connect(), async connect completion, immediate SO_ERROR, and readiness registration.
- Compare against
runc and Linux behavior for immediate SO_ERROR after loopback connect returns EINPROGRESS.
- Only propose a
getsockopt(SO_ERROR) / PostConnect() change after this is reproduced independently from the select fdset bug.
Summary
While investigating psycopg/libpq nonblocking TCP connects under Quark, I observed a second possible uring TCP connect-state issue separate from the deterministic
select(2)exception-fdset bug fixed in #1373.This issue tracks the less-proven behavior so it can be isolated before proposing a fix.
Observed Behavior
In one minimal Quark run against a real host PostgreSQL server, the client sequence:
connect()/connect_ex()returnsEINPROGRESSgetsockopt(SOL_SOCKET, SO_ERROR)returned stale
EINPROGRESSfrom Quark where therunccontrol returned0.That matters because libpq/psycopg checks
SO_ERRORduring its nonblocking connect flow.Current Reproduction Status
The exact stale immediate
SO_ERROR=EINPROGRESSsymptom has not been reproduced deterministically yet.What is reproducible:
select(2)exception-fdset bug is deterministic and tracked separately in fix select exception fdset clearing #1373.postgres:18-alpine, a minimal loop showed:After the select fix was restored:
So the remaining
SO_ERRORconcern may be timing/path dependent, or it may have been downstream of the select readiness issue in some runs.Suggested Next Steps
TCPConnectingstate directly.connect(), async connect completion, immediateSO_ERROR, and readiness registration.runcand Linux behavior for immediateSO_ERRORafter loopback connect returnsEINPROGRESS.getsockopt(SO_ERROR)/PostConnect()change after this is reproduced independently from the select fdset bug.