Files
cpp-httplib/test/dns_test_fixture.py
yhirose d14e4fc05f Reproducer test for #2431 (getaddrinfo_a use-after-free) (#2433)
* Add reproducer for #2431 (getaddrinfo_a use-after-free)

On Linux/glibc, getaddrinfo_with_timeout() runs DNS asynchronously via
getaddrinfo_a(GAI_NOWAIT) using a stack-local gaicb. When gai_suspend()
hits the connection timeout, gai_cancel() is called and the function
returns immediately — but gai_cancel() is non-blocking and can return
EAI_NOTCANCELED, leaving the resolver worker thread alive and still
referencing the destroyed stack frame.

Adds three opt-in gtest cases (GetAddrInfoAsyncCancelTest.*) that
exercise the cancel path repeatedly. They are gated on Linux/glibc +
CPPHTTPLIB_USE_NON_BLOCKING_GETADDRINFO at compile time, and on the
CPPHTTPLIB_TEST_ISSUE_2431=1 env var at runtime, so a normal `make
test` run is unaffected.

Also adds a dedicated CI job (issue-2431-repro) and a Docker-based
local runner (test/run_issue_2431_repro.sh) that sinkhole UDP/53 so
the timeout branch is taken, and run the test under ASAN/LSAN. With
the bug present these runs are expected to fail; with a fix applied
they should pass.

Refs: https://github.com/yhirose/cpp-httplib/issues/2431

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix split build for #2431 reproducer tests

The new GetAddrInfoAsyncCancelTest cases call detail::getaddrinfo_with_timeout
directly. In split builds (make test_split) split.py moves the definition into
httplib.cc and strips `inline`, so the symbol is not declared in the public
httplib.h and test.cc fails to compile -- breaking the ubuntu/test-no-exceptions
CI jobs that the PR description says should be unaffected.

Add a forward declaration in test.cc, gated by the same #if as the tests
themselves, so it links against the split-build symbol without changing the
header-only build.

* Cap issue-2431 repro job at 5 minutes

The bug manifests as orphan getaddrinfo_a resolver workers that keep the
runner from completing job teardown -- the previous run had all steps
succeed in ~1m37s but then hung in "Cleaning up orphan processes" for
~57m before GitHub force-killed the job.

A job-level timeout-minutes makes the failure signal fast and predictable:
bug present -> killed at 5 min, bug fixed -> ~2 min pass. Step-level timeout
isn't enough since the hang is in post-job cleanup, not the test step.

* Enable ASAN detect_stack_use_after_return for #2431 repro

The bug is a textbook stack-use-after-return: a stack-local struct gaicb
is destroyed when getaddrinfo_with_timeout returns after gai_cancel()
yields EAI_NOTCANCELED, then the still-live resolver worker thread writes
back into the freed frame. ASAN's detect_stack_use_after_return is the
direct detector for exactly this pattern -- enabling it lets the failure
surface as a clear ASAN diagnostic during the test run instead of as an
orphan-process hang at job teardown.

* Revert ASAN detect_stack_use_after_return for #2431 repro

The option did not detect the bug in CI -- the resolver worker write
likely lands on the heap (via the gaicb's pai pointer) or happens after
the test process exits, neither of which stack-use-after-return can
catch. Roll back to relying on the job-level timeout: bug present ->
post-cleanup hangs ~8min then job-level timeout cancels at 10min total;
bug fixed -> job completes in ~2min.

* Switch issue-2431 repro to a delayed loopback DNS test fixture

The previous repro setup dropped UDP/53 outright, which made glibc's
resolver hang forever on every lookup -- the worker never actually
received a response and so never reached the buggy write-back path
that #2431 is about. As a result, neither the broken HEAD nor the
fix made any visible difference in CI: both produced "tests pass +
post-cleanup hangs ~10min" because the orphan resolver thread is a
structural property of *any* getaddrinfo path on a hung resolver,
not a property of the bug.

Replace the sinkhole with a small loopback test fixture
(test/dns_test_fixture.py, ~50 lines, stdlib only) that answers DNS
queries after a 3s delay -- longer than the test's 1s timeout. An
iptables NAT rule routes the test job's lookups to the fixture
without touching /etc/resolv.conf, so the rest of the runner's DNS
behaviour is unaffected.

With ASAN's detect_stack_use_after_return enabled, the worker's
late write-back into the destroyed gaicb stack frame is now caught
as a stack-use-after-return diagnostic, so the broken HEAD fails
fast at the test step (clear red) and the fix turns the same job
green in well under a minute.

Same fixture is wired into both the GitHub Actions job and the
docker-based test/run_issue_2431_repro.sh script, so local repro on
macOS and CI repro on Linux exercise the identical path.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 18:17:19 +09:00

88 lines
2.9 KiB
Python
Executable File

#!/usr/bin/env python3
"""Delayed UDP responder used as a loopback test fixture.
This is a self-contained test fixture for the GetAddrInfoAsyncCancelTest
cases (reproducer for cpp-httplib issue #2431). It is NOT a general-purpose
nameserver and is only intended to run on 127.0.0.1 inside the test job's
own runner / container.
What it does
------------
Binds a UDP socket on 127.0.0.1:<port>, accepts well-formed DNS queries
from the test process, waits <delay_seconds>, then sends back a minimal
NXDOMAIN reply. The deliberate delay is what makes the bug reproducible:
* The test calls getaddrinfo_with_timeout() with timeout_sec=1.
* gai_suspend() returns EAI_AGAIN after 1s; the function returns and
its stack frame is destroyed.
* The fixture replies after <delay_seconds> (= 3s by default), so the
glibc resolver worker thread receives the response *after* the
caller's frame is gone and writes back into freed stack memory.
* AddressSanitizer (with detect_stack_use_after_return=1) catches the
write and aborts with a stack-use-after-return diagnostic.
Without this fixture the bug is hard to surface: dropping UDP/53 makes
the resolver hang forever, so the worker never receives anything and
never reaches the buggy write-back path.
Usage
-----
python3 test/dns_test_fixture.py <port> [<delay_seconds>]
Only standard library; no third-party dependencies.
"""
import socket
import struct
import sys
import threading
import time
def serve(port: int, delay_sec: float) -> None:
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.bind(("127.0.0.1", port))
print(
f"[dns_test_fixture] listening on 127.0.0.1:{port}, "
f"reply delay={delay_sec}s",
flush=True,
)
while True:
try:
data, addr = sock.recvfrom(2048)
except OSError:
return
threading.Thread(
target=_reply_after_delay,
args=(sock, data, addr, delay_sec),
daemon=True,
).start()
def _reply_after_delay(sock, query: bytes, addr, delay_sec: float) -> None:
time.sleep(delay_sec)
if len(query) < 12:
return
# Header: copy transaction id, set QR=1 RA=1 RCODE=3 (NXDOMAIN),
# preserve the requester's RD bit, then echo the question section so
# glibc's resolver accepts the reply as matching its outstanding query.
txid = query[:2]
rd_bit = query[2] & 0x01
flags = struct.pack(">H", 0x8003 | (rd_bit << 8))
counts = struct.pack(">HHHH", 1, 0, 0, 0)
question = query[12:]
reply = txid + flags + counts + question
try:
sock.sendto(reply, addr)
except OSError:
pass
if __name__ == "__main__":
if len(sys.argv) < 2:
print(__doc__, file=sys.stderr)
sys.exit(2)
port_arg = int(sys.argv[1])
delay_arg = float(sys.argv[2]) if len(sys.argv) > 2 else 3.0
serve(port_arg, delay_arg)