Jaehong Jung

Istio Part 3-1: 503 and Half-open Connections

The original version of this post is available on the Channel.io Tech Blog.

Hi, this is Jetty (Jaehong Jung) and Dylan from the Channel.io DevOps team.

This is Part 3-1 of our Istio Ambient mode adoption series. In Part 3, we planned to cover several issues we encountered while applying Ambient mode in production, but the most difficult 503 error deserved its own deep dive. In this post, we trace how that 503 error led us to a half-open, or stale, connection problem.

This post assumes the Envoy config structure covered in Part 2. In particular, it helps to be familiar with the connect_originate cluster, internal listeners, ORIGINAL_DST clusters, and HBONE tunnels used when a waypoint sends traffic to an in-mesh destination.

The Problem

The issue was an intermittent 503 response during workload rollouts, such as restarts or deployments. At first, we suspected common causes of Istio 503s: idle timeouts, keep-alive timeouts, or config propagation delays. But changing timeout values did not remove the symptom, and the timing in the logs did not fit a config propagation delay.

The reproduction frequency also varied by environment. It reproduced relatively often in dev, where application deployments and restarts happen more frequently, but only rarely in production.

The request path and the point where the 503 was generated can be simplified as follows.

flowchart LR
    C["Client"]
    GW["public gateway<br />(Envoy)"]
    WP["waypoint<br />(Envoy)"]
    ZT["ztunnel<br />:15008"]
    POD["newly created Pod"]

    C --> GW --> WP -->|"HBONE"| ZT --> POD

    GW -. "503 / via_upstream" .-> C
    WP -. "503 / UC<br />connection_termination" .-> GW

    style WP fill:#ffcdd2
    style ZT fill:#fff3cd

Following the logs by component revealed where the 503 was actually generated. The public gateway logged response_code: 503 and response_code_details: via_upstream, with upstream_host pointing to envoy://connect_originate/...:15008. This meant the gateway did not create the error itself. It forwarded a 503 it received from the next hop, the waypoint.

The waypoint logs showed response_code: 503, response_code_details: upstream_reset_before_response_started{connection_termination}, and response_flags: UC. UC stands for UpstreamConnectionTermination, meaning the upstream connection was terminated before the response started.

But the next hop after the waypoint, ztunnel, had no abnormal logs. istio-cni did not show anything unusual either. The waypoint said the upstream had terminated the connection, but the upstream ztunnel showed no error. So the first question was simple: where exactly was the connection being terminated?

Reproducing the Issue

Debugging this directly in production was difficult. In a real service environment, Istio components generate a large volume of traffic, making it hard to isolate the relevant traces. We built a separate reproduction environment with a dummy application and dedicated gateway, waypoint, and ztunnel components.

As we narrowed the scope, we found that the same 503 could be reproduced even in a Pod -> waypoint -> waypoint path with the gateway removed. From that point on, the analysis focused on the waypoint, ztunnel, and destination Pod path.

Access logs alone were not enough, so we collected two additional sources of data.

  • Debug-level logs from the waypoint Envoy
  • TCP packet captures from inside the destination Pod

For packet capture, we injected a tcpdump sidecar with NET_RAW and NET_ADMIN permissions. Since we suspected the problem was related to Pod termination, we captured the full lifecycle from Pod creation through the packets remaining after termination. The captured pcap files were uploaded to S3 for analysis.

Analysis

Finding the Difference in pcap

We became confident this was a real connection-level issue when we captured traffic directly from the destination Pod and opened it in Wireshark.

When capturing all network interfaces in the destination Pod, two kinds of packets are visible at the same time. One is the encrypted HBONE/mTLS traffic from the waypoint to ztunnel. The other is plaintext traffic after it passes through the ztunnel socket and is delivered to the application. This let us compare the encrypted traffic inside the tunnel and the plaintext traffic after decapsulation in one place.

flowchart LR
    WP["waypoint<br />(Envoy)"]
    subgraph POD["destination pod (tcpdump: all interfaces)"]
        direction LR
        ZT["ztunnel socket<br />:15008"]
        APP["application"]
    end

    WP -->|"1. encrypted<br />(HBONE / mTLS)"| ZT
    ZT -->|"2. decrypted<br />(plaintext)"| APP

    style ZT fill:#fff3cd

The packet capture from a reproduced 503 was decisive. Immediately after a new destination Pod was created, an application data stream arrived without a TLS handshake. In the normal case, TCP and TLS handshakes should happen first in the encrypted path, followed by data frames. In the abnormal case, that entire process was skipped and data frames arrived first.

In the normal case, a new HBONE tunnel is established before data frames are exchanged.

sequenceDiagram
    participant W as Waypoint
    participant Z as Ztunnel
    participant P as New Pod

    W->>Z: TCP Handshake (SYN/SYN-ACK/ACK)
    W->>Z: TLS Handshake
    Note over W,Z: HBONE tunnel established
    W->>Z: HTTP/2 Data Frame (GET /ping)
    Z->>P: Decapsulated Request
    P->>Z: 200 OK
    Z->>W: HTTP/2 Data Frame (Response)

In the abnormal case, data frames arrive without any handshake. The new Pod’s network namespace has no state for that TCP connection, so the kernel TCP stack responds with RST.

sequenceDiagram
    participant W as Waypoint
    participant Z as Ztunnel
    participant P as New Pod

    Note over W,Z: No TCP/TLS handshake (existing connection reused)
    W->>Z: HTTP/2 Data Frame
    Note over Z: New Pod netns has no state for this TCP connection
    Z-->>W: TCP RST
    W-->>W: upstream_reset -> 503 (UC)

This narrowed the suspicion. If application data was continuing into a newly created Pod without a handshake, the sender, the waypoint, might be treating that Pod as a destination it had already connected to. In other words, Envoy was likely reusing an existing upstream connection.

The question changed: why did Envoy send application data to a new Pod over a connection that had no handshake?

Pod State Analysis

The Pod that returned the abnormal response was not unhealthy. Its probes and running state were normal. But there was an important clue: the Pod IP that received the abnormal response had been reused shortly before. A newly created Pod had received the same IP that had just been used by a deleted Pod.

Root Cause

Not IP Collision, But a Stale Connection

It is tempting to conclude that the problem was “AWS VPC CNI reused a Pod IP, causing an IP collision.” But IP reuse was not the root cause. It was a condition that made the issue visible.

The real cause was that waypoint Envoy kept an HTTP/2 connection keyed by IP:Port even after the destination Pod had terminated. When the same IP was assigned to a new Pod, Envoy could see the stale connection as an existing live connection to the same destination and reuse it. The result was the 503.

Two component behaviors interacted here.

  • waypoint Envoy manages its upstream connection pool by IP:Port.
  • ztunnel did not close the HBONE connection held by the waypoint with GOAWAY or FIN when the Pod terminated.

It is important to distinguish the traffic path. As mentioned by an Istio maintainer in istio/ztunnel#1637, the ztunnel -> ztunnel path is relatively safer because it considers destination IP and Service Account together, and discards the HBONE connection when it receives RST. The Envoy (waypoint) -> ztunnel path is more sensitive to Envoy’s connection reuse behavior. Our issue occurred exactly on that path.

In this post, a half-open or stale connection means a connection that the new Pod and ztunnel do not know about, but the waypoint still believes is alive.

Waypoint Does Not Manage Downstream and Upstream as One Connection

This is where we need to recall Part 2. Waypoint Envoy does not manage the downstream client connection and the upstream HBONE connection to the destination Pod as one direct connection. Internally, they are separated.

One side is the downstream listener that receives client requests. The other side is the internal listener connect_originate used to create HBONE tunnels, followed by the connect_originate ORIGINAL_DST cluster. The upstream HBONE connection is stored and reused in that ORIGINAL_DST cluster’s connection pool using IP:Port as the key.

flowchart LR
    C["Client<br />(downstream)"]

    subgraph WP["Inside Waypoint (Envoy)"]
        direction LR
        DL["downstream<br />listener"]
        IL["internal listener<br />connect_originate"]
        POOL["connect_originate cluster<br />(ORIGINAL_DST)<br />connection pool<br />key = IP:Port"]
        DL -. "internal listener boundary<br />(user-space)" .-> IL
        IL --> POOL
    end

    C -->|"downstream conn"| DL
    POOL -->|"upstream conn<br />HBONE / mTLS"| ZT["ztunnel :15008"] --> P["Pod"]

    style IL fill:#e3d7ff
    style POOL fill:#fff3cd

Because of this structure, the downstream request path does not immediately reveal whether the upstream HBONE connection is stale. If Envoy’s pool still marks the connection as fully connected, a request to the same IP:Port can reuse it.

Confirming Connection Reuse with Waypoint Envoy Debug Logs

To verify the hypothesis, we analyzed waypoint debug logs. The experiment had two phases.

  • Phase 1: Send a request to Pod-aaa, which uses a new IP, so that a new HBONE connection is created.
  • Phase 2: Delete Pod-aaa, then send a request to Pod-bbb, which reuses the same IP.

The result was clear. The connection ID created in Phase 1 appeared again in Phase 2. Even though the request was going to a new Pod, Envoy did not create a new connection. It reused the existing one.

New HBONE connection log

A request to Pod-aaa, which received a new IP, creates a new HBONE connection in the connect_originate cluster. We used the same ConnectionId to track whether it was reused later.

Existing HBONE connection reuse log

After Pod-aaa was deleted, Pod-bbb received the same IP. The waypoint did not create a new connection and instead reused the existing one.

sequenceDiagram
    participant W as Waypoint Envoy
    participant Pool as connect_originate connection pool
    participant A as Pod-aaa (IP X)
    participant B as Pod-bbb (IP X reused)

    rect rgb(212, 237, 218)
    Note over W,A: Phase 1 - Pod-aaa with a new IP
    W->>Pool: Request upstream connection
    Pool->>A: Establish new HBONE connection
    A->>W: Request proxied successfully
    end

    Note over A,B: Previous Pod deleted, then IP X is reused

    rect rgb(255, 205, 210)
    Note over W,B: Phase 2 - Pod-bbb with the same IP
    W->>Pool: Request upstream connection
    Pool-->>W: Reuse existing fully connected connection
    Pool--xB: Send data, then reset
    end

The important part of the debug log was the reuse of an existing connection. Envoy reused the Phase 1 connection in Phase 2 because the destination had the same IP:Port.

How Does ztunnel Close the Connection?

Then, was ztunnel cleaning up the connection when the Pod terminated? We inspected pcap files covering the full lifecycle of the destination Pod, but did not observe HTTP/2 GOAWAY or FIN packets. ztunnel did not gracefully close the HBONE connection when the Pod terminated, so the waypoint had no way to know the connection was dead.

We reported this behavior upstream to Istio. Dylan summarized the reproduction process and waypoint logs in istio/ztunnel#1637. The same symptom was reported not only in AWS VPC CNI/EKS environments, but also in Envoy -> ztunnel paths using only an ingress gateway without waypoint. Broader discussion around connection lifecycle and draining is happening in istio/ztunnel#1191.

Socket State in the Waypoint

Finally, we checked socket state. If the hypothesis was correct, even after an in-mesh Pod was deleted, the waypoint should keep a socket to that Pod IP on :15008 in ESTABLISHED state for some time.

That is exactly what we observed. After deleting an arbitrary Pod, the waypoint still had a socket to the Pod IP in ESTABLISHED state for a while. Logs, pcap, and socket state all pointed to the same conclusion: stale connection reuse was the cause.

Mitigation

The goal was clear. Even if a reset happens in a network component, it should not propagate as a 5xx application response. We first considered root-cause fixes, then looked at what we could apply immediately.

Root Fix: Improve the Connection Pool Key

The cleanest fix is to make the connection pool key more specific than plain IP:Port. If Envoy’s ORIGINAL_DST cluster connection pool key included instance-level metadata such as Pod UID, then even if the same IP was reused, a new Pod would be treated as a different destination. That would prevent stale connection reuse at the source.

ztunnel’s HBONE connection pool already considers more than just IP. It roughly combines source identity, destination identity, destination address, and source IP into a WorkloadKey, and uses that as the pool key. The destination identity has the form spiffe://<trust-domain>/ns/<namespace>/sa/<service-account>.

However, Service Account alone is not enough. New Pods from the same Deployment usually use the same Service Account. To reliably distinguish Pod-aaa from Pod-bbb, the key needs a value that changes per Pod instance, such as Pod UID.

Root Fix: Improve Connection State Management

Another direction is for ztunnel to clean up the connection held by the waypoint when the Pod terminates. For example, GOAWAY or FIN could prevent the waypoint from reusing that HBONE connection.

But this is harder than it sounds. Ambient mode must be transparent to applications, so changing application code to send termination signals to waypoint does not fit the model. That leaves ztunnel, but when a Pod terminates, CNI may tear down the veth and network namespace, and the HBONE termination socket that ztunnel created inside the Pod network namespace may disappear as well. Even if ztunnel detects the Pod termination afterward, the path to send GOAWAY over that connection may already be gone.

sequenceDiagram
    participant WP as Waypoint Envoy
    participant ZT as Ztunnel socket in pod netns
    participant POD as Pod application

    Note over WP,ZT: The waypoint-ztunnel HBONE connection terminates inside the pod network namespace

    rect rgb(255, 205, 210)
    Note over POD: Pod terminates (SIGTERM -> exit)
    Note over ZT,POD: netns/veth are cleaned up with the Pod lifecycle
    POD--xZT: ztunnel's HBONE termination socket also disappears
    Note over ZT: After-the-fact detection may be too late to send a cleanup signal
    ZT--xWP: GOAWAY delivery fails
    end

    Note over WP: No cleanup signal received -> stale connection remains
    Note over WP,POD: Same IP assigned to a new Pod -> stale connection reused

Another complication is that GOAWAY is not a magic signal that immediately closes every connection. HBONE carries inner TCP streams inside an outer connection created with HTTP/2 CONNECT. GOAWAY mostly means no more new streams should be created. Cleaning up already active inner connections remains a separate problem.

istio/ztunnel#1191 discusses several approaches: sending GOAWAY at ShutdownStarting, using a CNI DEL hook to clean up before network teardown, having the client detect Pod deletion and remove connections from the pool, or relying on keepalive to drive timeout cleanup. Each option has different timing and complexity trade-offs.

Both of these root-fix directions require upstream changes in Istio or Envoy, so we needed a short-term mitigation.

Immediate Mitigation: Retry on RST

The practical mitigation we could apply immediately was retrying on RST.

Our existing configuration retried only on reset-before-request. By expanding it to include reset, the waypoint can automatically retry when a stale connection causes a reset. In this case, it is more accurate to view the RST not as ztunnel application logic deciding a connection is invalid, but as a reset from the kernel TCP stack because the new Pod’s network namespace has no state for the old TCP connection.

sequenceDiagram
    participant C as Client
    participant W as Waypoint (Retry Enabled)
    participant Z as Ztunnel
    participant P as New Pod

    C->>W: Request
    Note over W: First attempt - stale connection used
    W->>Z: Data (old connection)
    Z-->>W: TCP RST
    Note over W: Reset detected -> automatic retry
    W->>Z: TCP/TLS Handshake (new)
    W->>Z: Data (new connection)
    Z->>P: Forward Request
    P->>Z: 200 OK
    Z->>W: Response
    W->>C: 200 OK

We also considered aggressive HTTP/2 keepalive and HBONE idle timeout tuning. Istio has a meshConfig.hboneIdleTimeout setting that controls how long Envoy proxy keeps HBONE connections to ztunnel in the pool. Shortening this value can clean up idle stale connections more quickly. HTTP/2 keepalive follows a similar idea: detect stale connections sooner.

However, both approaches only reduce stale connections faster. They do not fundamentally prevent the timing overlap between IP reuse and connection reuse. Community reports also indicated that tuning ztunnel’s KEEPALIVE_* environment variables alone did not solve the problem, so we chose retry as the short-term mitigation.

There is one important caveat when enabling retry on reset. It can also retry RSTs caused by other reasons, not only the waypoint -> ztunnel stale connection reuse issue we observed. Pod OOM, process crashes, or unknown unexpected RSTs can all match the same retry condition. Before applying this policy, you need to check whether the target APIs are idempotent and whether duplicate execution can cause side effects in application state or external systems.

Conclusion

By applying reset retry at the waypoint level, we were able to mitigate the UpstreamConnectionTermination 503 caused by stale connection reuse.

The core issue was not IP reuse itself. Envoy identified a connection by IP:Port, failed to discard it after the destination Pod disappeared, and then reused that stale connection when the same IP was assigned to a new Pod. Connection reuse, the absence of graceful close from ztunnel, and IP reuse combined into a trap specific to Ambient mode.

Unlike Sidecar mode, Ambient mode splits connection handling across ztunnel and waypoint. That means access logs alone were not enough. We had to correlate Envoy debug logs, pcap files, and socket state. That tracing process is something we can reuse when investigating similar 503 or reset issues in the future.

Thanks for reading.