Skip to content

Conversation

@bayoumi17m
Copy link

Fix HTTP/2 GOAWAY Race Conditions

Problem

When using HTTPX with HTTP/2 under high concurrency, users encounter errors where connections are closed for streams beyond the last_stream_id. This forces applications to implement retry logic for what should be transparent library behavior.

The root cause is a set of race conditions between h2's state machine and httpcore's event processing:

  1. State machine race: h2 transitions to CLOSED before httpcore processes the GOAWAY event
  2. Disconnect without context: Server disconnects immediately after GOAWAY, raising generic RemoteProtocolError("Server disconnected") with no retry opportunity
  3. Insufficient exception context: ConnectionNotAvailable didn't convey why the connection was unavailable, preventing informed retry decisions

A user proposed changing > to >= in the stream ID comparison (discussion #1041), but this would violate RFC 7540 Section 6.8:

"The last stream identifier in the GOAWAY frame contains the highest-numbered stream identifier for which the sender of the GOAWAY frame might have taken some action on..."

A stream matching last_stream_id may have already been processed, making automatic retry unsafe for non-idempotent operations.

Solution

This PR introduces proper GOAWAY handling that maintains RFC compliance while enabling safe automatic retries:

1. New ConnectionGoingAway Exception

A dedicated exception (subclass of ConnectionNotAvailable) that carries GOAWAY context:

  • last_stream_id / request_stream_id for RFC 7540 stream comparison
  • error_code to distinguish graceful shutdown (NO_ERROR) from errors
  • headers_sent / body_sent to track request phase at time of GOAWAY
  • Helper properties: is_safe_to_retry, is_graceful_shutdown, may_have_side_effects

This exception is exported publicly, allowing applications to implement idempotency-aware retry logic when needed.

2. New DRAINING Connection State

Connections receiving a graceful GOAWAY (NO_ERROR) transition to DRAINING rather than CLOSED:

  • Allows in-flight streams to complete normally
  • Rejects new requests (connection reports is_available() = False)
  • Connection expires once all streams complete

This fits naturally into the existing HTTPConnectionState enum pattern and makes state transitions easier to reason about.

3. Request Phase Tracking

Each stream tracks headers_sent and body_sent to determine potential side effects:

  • If GOAWAY arrives before headers are sent → definitely no side effects
  • If headers/body were sent → server may have processed the request

This information flows through ConnectionGoingAway.may_have_side_effects for retry decisions.

4. Connection Pool Retry Logic

The connection pool now handles ConnectionGoingAway with RFC-compliant retry semantics:

Condition Behavior
stream_id > last_stream_id Auto-retry (guaranteed unprocessed per RFC 7540)
Graceful shutdown + no side effects Auto-retry (likely safe)
Otherwise Raise RemoteProtocolError with context for application decision

Behavior Summary

Scenario Before After
Stream ID > last_stream_id ConnectionNotAvailable (retry worked but was implicit) Auto-retry with explicit RFC justification
Graceful shutdown, headers not sent RemoteProtocolError or ConnectionNotAvailable Auto-retry
Potentially processed request Generic error RemoteProtocolError with cause chain to ConnectionGoingAway
h2 CLOSED race condition Generic protocol error Proper ConnectionGoingAway with conservative assumptions
Server disconnect after GOAWAY RemoteProtocolError("Server disconnected") ConnectionGoingAway with retry context

Design Decisions

Why a new connection state vs. a boolean flag?
The DRAINING state fits the existing HTTPConnectionState pattern and makes state transitions explicit. A boolean would work but adds cognitive overhead when reasoning about valid state combinations.

Why track headers and body separately?
Both are relevant for side-effect determination. Applications can use this granular information - for example, a read-only GET that sent headers but no body might be safer to retry than a POST with body sent.

Why raise RemoteProtocolError instead of ConnectionGoingAway for potentially-processed requests?
Balances API simplicity with power. Most users catch RemoteProtocolError; advanced users can inspect __cause__ to access the full ConnectionGoingAway context. Since ConnectionGoingAway extends ConnectionNotAvailable, it's also catchable directly for those who need it.

Why conservative assumptions when h2 is CLOSED without GOAWAY event?
When the connection is closed but we haven't processed a GOAWAY event, we can't know the true last_stream_id. Assuming the current stream may have been processed is the safe default - better to surface an error than silently retry a potentially-processed request.

Related Issues

Test Plan

  • Unit tests for ConnectionGoingAway exception properties
  • Unit tests for DRAINING state transitions
  • Integration tests simulating GOAWAY during request
  • Integration tests for race condition scenarios (h2 CLOSED before event processing)
  • Verify existing tests still pass

Checklist

  • I understand that this PR may be closed in case there was no previous discussion. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.

@bayoumi17m bayoumi17m closed this Jan 29, 2026
@bayoumi17m bayoumi17m reopened this Jan 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant