mirror of
https://github.com/yhirose/cpp-httplib.git
synced 2026-06-10 16:47:14 +00:00
* Add reproducer for #2431 (getaddrinfo_a use-after-free) On Linux/glibc, getaddrinfo_with_timeout() runs DNS asynchronously via getaddrinfo_a(GAI_NOWAIT) using a stack-local gaicb. When gai_suspend() hits the connection timeout, gai_cancel() is called and the function returns immediately — but gai_cancel() is non-blocking and can return EAI_NOTCANCELED, leaving the resolver worker thread alive and still referencing the destroyed stack frame. Adds three opt-in gtest cases (GetAddrInfoAsyncCancelTest.*) that exercise the cancel path repeatedly. They are gated on Linux/glibc + CPPHTTPLIB_USE_NON_BLOCKING_GETADDRINFO at compile time, and on the CPPHTTPLIB_TEST_ISSUE_2431=1 env var at runtime, so a normal `make test` run is unaffected. Also adds a dedicated CI job (issue-2431-repro) and a Docker-based local runner (test/run_issue_2431_repro.sh) that sinkhole UDP/53 so the timeout branch is taken, and run the test under ASAN/LSAN. With the bug present these runs are expected to fail; with a fix applied they should pass. Refs: https://github.com/yhirose/cpp-httplib/issues/2431 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix split build for #2431 reproducer tests The new GetAddrInfoAsyncCancelTest cases call detail::getaddrinfo_with_timeout directly. In split builds (make test_split) split.py moves the definition into httplib.cc and strips `inline`, so the symbol is not declared in the public httplib.h and test.cc fails to compile -- breaking the ubuntu/test-no-exceptions CI jobs that the PR description says should be unaffected. Add a forward declaration in test.cc, gated by the same #if as the tests themselves, so it links against the split-build symbol without changing the header-only build. * Cap issue-2431 repro job at 5 minutes The bug manifests as orphan getaddrinfo_a resolver workers that keep the runner from completing job teardown -- the previous run had all steps succeed in ~1m37s but then hung in "Cleaning up orphan processes" for ~57m before GitHub force-killed the job. A job-level timeout-minutes makes the failure signal fast and predictable: bug present -> killed at 5 min, bug fixed -> ~2 min pass. Step-level timeout isn't enough since the hang is in post-job cleanup, not the test step. * Enable ASAN detect_stack_use_after_return for #2431 repro The bug is a textbook stack-use-after-return: a stack-local struct gaicb is destroyed when getaddrinfo_with_timeout returns after gai_cancel() yields EAI_NOTCANCELED, then the still-live resolver worker thread writes back into the freed frame. ASAN's detect_stack_use_after_return is the direct detector for exactly this pattern -- enabling it lets the failure surface as a clear ASAN diagnostic during the test run instead of as an orphan-process hang at job teardown. * Revert ASAN detect_stack_use_after_return for #2431 repro The option did not detect the bug in CI -- the resolver worker write likely lands on the heap (via the gaicb's pai pointer) or happens after the test process exits, neither of which stack-use-after-return can catch. Roll back to relying on the job-level timeout: bug present -> post-cleanup hangs ~8min then job-level timeout cancels at 10min total; bug fixed -> job completes in ~2min. * Switch issue-2431 repro to a delayed loopback DNS test fixture The previous repro setup dropped UDP/53 outright, which made glibc's resolver hang forever on every lookup -- the worker never actually received a response and so never reached the buggy write-back path that #2431 is about. As a result, neither the broken HEAD nor the fix made any visible difference in CI: both produced "tests pass + post-cleanup hangs ~10min" because the orphan resolver thread is a structural property of *any* getaddrinfo path on a hung resolver, not a property of the bug. Replace the sinkhole with a small loopback test fixture (test/dns_test_fixture.py, ~50 lines, stdlib only) that answers DNS queries after a 3s delay -- longer than the test's 1s timeout. An iptables NAT rule routes the test job's lookups to the fixture without touching /etc/resolv.conf, so the rest of the runner's DNS behaviour is unaffected. With ASAN's detect_stack_use_after_return enabled, the worker's late write-back into the destroyed gaicb stack frame is now caught as a stack-use-after-return diagnostic, so the broken HEAD fails fast at the test step (clear red) and the fix turns the same job green in well under a minute. Same fixture is wired into both the GitHub Actions job and the docker-based test/run_issue_2431_repro.sh script, so local repro on macOS and CI repro on Linux exercise the identical path. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
88 lines
2.9 KiB
Python
Executable File
88 lines
2.9 KiB
Python
Executable File
#!/usr/bin/env python3
|
|
"""Delayed UDP responder used as a loopback test fixture.
|
|
|
|
This is a self-contained test fixture for the GetAddrInfoAsyncCancelTest
|
|
cases (reproducer for cpp-httplib issue #2431). It is NOT a general-purpose
|
|
nameserver and is only intended to run on 127.0.0.1 inside the test job's
|
|
own runner / container.
|
|
|
|
What it does
|
|
------------
|
|
Binds a UDP socket on 127.0.0.1:<port>, accepts well-formed DNS queries
|
|
from the test process, waits <delay_seconds>, then sends back a minimal
|
|
NXDOMAIN reply. The deliberate delay is what makes the bug reproducible:
|
|
|
|
* The test calls getaddrinfo_with_timeout() with timeout_sec=1.
|
|
* gai_suspend() returns EAI_AGAIN after 1s; the function returns and
|
|
its stack frame is destroyed.
|
|
* The fixture replies after <delay_seconds> (= 3s by default), so the
|
|
glibc resolver worker thread receives the response *after* the
|
|
caller's frame is gone and writes back into freed stack memory.
|
|
* AddressSanitizer (with detect_stack_use_after_return=1) catches the
|
|
write and aborts with a stack-use-after-return diagnostic.
|
|
|
|
Without this fixture the bug is hard to surface: dropping UDP/53 makes
|
|
the resolver hang forever, so the worker never receives anything and
|
|
never reaches the buggy write-back path.
|
|
|
|
Usage
|
|
-----
|
|
python3 test/dns_test_fixture.py <port> [<delay_seconds>]
|
|
|
|
Only standard library; no third-party dependencies.
|
|
"""
|
|
|
|
import socket
|
|
import struct
|
|
import sys
|
|
import threading
|
|
import time
|
|
|
|
|
|
def serve(port: int, delay_sec: float) -> None:
|
|
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
|
|
sock.bind(("127.0.0.1", port))
|
|
print(
|
|
f"[dns_test_fixture] listening on 127.0.0.1:{port}, "
|
|
f"reply delay={delay_sec}s",
|
|
flush=True,
|
|
)
|
|
while True:
|
|
try:
|
|
data, addr = sock.recvfrom(2048)
|
|
except OSError:
|
|
return
|
|
threading.Thread(
|
|
target=_reply_after_delay,
|
|
args=(sock, data, addr, delay_sec),
|
|
daemon=True,
|
|
).start()
|
|
|
|
|
|
def _reply_after_delay(sock, query: bytes, addr, delay_sec: float) -> None:
|
|
time.sleep(delay_sec)
|
|
if len(query) < 12:
|
|
return
|
|
# Header: copy transaction id, set QR=1 RA=1 RCODE=3 (NXDOMAIN),
|
|
# preserve the requester's RD bit, then echo the question section so
|
|
# glibc's resolver accepts the reply as matching its outstanding query.
|
|
txid = query[:2]
|
|
rd_bit = query[2] & 0x01
|
|
flags = struct.pack(">H", 0x8003 | (rd_bit << 8))
|
|
counts = struct.pack(">HHHH", 1, 0, 0, 0)
|
|
question = query[12:]
|
|
reply = txid + flags + counts + question
|
|
try:
|
|
sock.sendto(reply, addr)
|
|
except OSError:
|
|
pass
|
|
|
|
|
|
if __name__ == "__main__":
|
|
if len(sys.argv) < 2:
|
|
print(__doc__, file=sys.stderr)
|
|
sys.exit(2)
|
|
port_arg = int(sys.argv[1])
|
|
delay_arg = float(sys.argv[2]) if len(sys.argv) > 2 else 3.0
|
|
serve(port_arg, delay_arg)
|