From d14e4fc05f8576c553b0888270e85abf4069841f Mon Sep 17 00:00:00 2001 From: yhirose Date: Tue, 28 Apr 2026 18:17:19 +0900 Subject: [PATCH] Reproducer test for #2431 (getaddrinfo_a use-after-free) (#2433) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Add reproducer for #2431 (getaddrinfo_a use-after-free) On Linux/glibc, getaddrinfo_with_timeout() runs DNS asynchronously via getaddrinfo_a(GAI_NOWAIT) using a stack-local gaicb. When gai_suspend() hits the connection timeout, gai_cancel() is called and the function returns immediately — but gai_cancel() is non-blocking and can return EAI_NOTCANCELED, leaving the resolver worker thread alive and still referencing the destroyed stack frame. Adds three opt-in gtest cases (GetAddrInfoAsyncCancelTest.*) that exercise the cancel path repeatedly. They are gated on Linux/glibc + CPPHTTPLIB_USE_NON_BLOCKING_GETADDRINFO at compile time, and on the CPPHTTPLIB_TEST_ISSUE_2431=1 env var at runtime, so a normal `make test` run is unaffected. Also adds a dedicated CI job (issue-2431-repro) and a Docker-based local runner (test/run_issue_2431_repro.sh) that sinkhole UDP/53 so the timeout branch is taken, and run the test under ASAN/LSAN. With the bug present these runs are expected to fail; with a fix applied they should pass. Refs: https://github.com/yhirose/cpp-httplib/issues/2431 Co-Authored-By: Claude Opus 4.7 (1M context) * Fix split build for #2431 reproducer tests The new GetAddrInfoAsyncCancelTest cases call detail::getaddrinfo_with_timeout directly. In split builds (make test_split) split.py moves the definition into httplib.cc and strips `inline`, so the symbol is not declared in the public httplib.h and test.cc fails to compile -- breaking the ubuntu/test-no-exceptions CI jobs that the PR description says should be unaffected. Add a forward declaration in test.cc, gated by the same #if as the tests themselves, so it links against the split-build symbol without changing the header-only build. * Cap issue-2431 repro job at 5 minutes The bug manifests as orphan getaddrinfo_a resolver workers that keep the runner from completing job teardown -- the previous run had all steps succeed in ~1m37s but then hung in "Cleaning up orphan processes" for ~57m before GitHub force-killed the job. A job-level timeout-minutes makes the failure signal fast and predictable: bug present -> killed at 5 min, bug fixed -> ~2 min pass. Step-level timeout isn't enough since the hang is in post-job cleanup, not the test step. * Enable ASAN detect_stack_use_after_return for #2431 repro The bug is a textbook stack-use-after-return: a stack-local struct gaicb is destroyed when getaddrinfo_with_timeout returns after gai_cancel() yields EAI_NOTCANCELED, then the still-live resolver worker thread writes back into the freed frame. ASAN's detect_stack_use_after_return is the direct detector for exactly this pattern -- enabling it lets the failure surface as a clear ASAN diagnostic during the test run instead of as an orphan-process hang at job teardown. * Revert ASAN detect_stack_use_after_return for #2431 repro The option did not detect the bug in CI -- the resolver worker write likely lands on the heap (via the gaicb's pai pointer) or happens after the test process exits, neither of which stack-use-after-return can catch. Roll back to relying on the job-level timeout: bug present -> post-cleanup hangs ~8min then job-level timeout cancels at 10min total; bug fixed -> job completes in ~2min. * Switch issue-2431 repro to a delayed loopback DNS test fixture The previous repro setup dropped UDP/53 outright, which made glibc's resolver hang forever on every lookup -- the worker never actually received a response and so never reached the buggy write-back path that #2431 is about. As a result, neither the broken HEAD nor the fix made any visible difference in CI: both produced "tests pass + post-cleanup hangs ~10min" because the orphan resolver thread is a structural property of *any* getaddrinfo path on a hung resolver, not a property of the bug. Replace the sinkhole with a small loopback test fixture (test/dns_test_fixture.py, ~50 lines, stdlib only) that answers DNS queries after a 3s delay -- longer than the test's 1s timeout. An iptables NAT rule routes the test job's lookups to the fixture without touching /etc/resolv.conf, so the rest of the runner's DNS behaviour is unaffected. With ASAN's detect_stack_use_after_return enabled, the worker's late write-back into the destroyed gaicb stack frame is now caught as a stack-use-after-return diagnostic, so the broken HEAD fails fast at the test step (clear red) and the fix turns the same job green in well under a minute. Same fixture is wired into both the GitHub Actions job and the docker-based test/run_issue_2431_repro.sh script, so local repro on macOS and CI repro on Linux exercise the identical path. --------- Co-authored-by: Claude Opus 4.7 (1M context) --- .github/workflows/test.yaml | 86 ++++++++++++++++++++++ test/dns_test_fixture.py | 87 +++++++++++++++++++++++ test/run_issue_2431_repro.sh | 102 ++++++++++++++++++++++++++ test/test.cc | 134 +++++++++++++++++++++++++++++++++++ 4 files changed, 409 insertions(+) create mode 100755 test/dns_test_fixture.py create mode 100755 test/run_issue_2431_repro.sh diff --git a/.github/workflows/test.yaml b/.github/workflows/test.yaml index 69cc8d7..12bee95 100644 --- a/.github/workflows/test.yaml +++ b/.github/workflows/test.yaml @@ -114,6 +114,92 @@ jobs: - name: build and run ThreadPool test run: cd test && make test_thread_pool && ./test_thread_pool + # Reproducer for https://github.com/yhirose/cpp-httplib/issues/2431. + # On Linux/glibc, getaddrinfo_with_timeout() schedules an asynchronous + # DNS lookup with getaddrinfo_a(GAI_NOWAIT) using a stack-local gaicb. + # When gai_suspend() hits the connection timeout, gai_cancel() is called + # but does not block; the resolver worker can later write back into the + # destroyed stack frame. To make the worker actually reach that write, + # the test job runs a loopback UDP responder (test/dns_test_fixture.py) + # that delays its reply past the test's 1s timeout, and uses an iptables + # NAT rule so glibc's lookups land on that fixture instead of a real + # nameserver. With ASAN's detect_stack_use_after_return enabled, the + # late write-back is reported as a stack-use-after-return. + issue-2431-repro: + runs-on: ubuntu-latest + if: > + (github.event_name == 'push') || + (github.event_name == 'pull_request' && + github.event.pull_request.head.repo.full_name != github.event.pull_request.base.repo.full_name) || + (github.event_name == 'workflow_dispatch' && github.event.inputs.test_linux == 'true') + name: issue-2431 repro (Linux + ASAN) + # Bound the whole job in case anything in the test harness hangs + # unexpectedly. With the fixture in place a normal run is well under + # a minute either way (ASAN abort on broken HEAD, clean pass on fix). + timeout-minutes: 5 + env: + DNS_FIXTURE_PORT: "15353" + DNS_FIXTURE_DELAY: "3" + steps: + - name: checkout + uses: actions/checkout@v4 + - name: install libraries + run: | + sudo apt-get update + sudo apt-get install -y libssl-dev zlib1g-dev libbrotli-dev \ + libzstd-dev libcurl4-openssl-dev iptables util-linux iproute2 + - name: start loopback DNS test fixture + run: | + # Force glibc through its DNS code path: Ubuntu's default + # nsswitch short-circuits to NOTFOUND through mdns4_minimal, + # which would skip the buggy code entirely. + sudo sed -i 's/^hosts:.*/hosts: dns/' /etc/nsswitch.conf + # Run the loopback fixture (delayed UDP responder). + python3 test/dns_test_fixture.py "$DNS_FIXTURE_PORT" "$DNS_FIXTURE_DELAY" \ + >/tmp/dns_fixture.log 2>&1 & + echo $! | sudo tee /tmp/dns_fixture.pid >/dev/null + # Wait for the fixture to start listening. + for _ in $(seq 1 50); do + if ss -lun "( sport = :$DNS_FIXTURE_PORT )" | grep -q ":$DNS_FIXTURE_PORT"; then + break + fi + sleep 0.1 + done + ss -lun "( sport = :$DNS_FIXTURE_PORT )" | grep -q ":$DNS_FIXTURE_PORT" \ + || { echo "fixture failed to start"; cat /tmp/dns_fixture.log; exit 1; } + # Send the test process's DNS lookups to the loopback fixture. + # NAT only the local OUTPUT chain; conntrack handles the reply path. + sudo iptables -t nat -I OUTPUT -p udp --dport 53 \ + -j REDIRECT --to-port "$DNS_FIXTURE_PORT" + # Sanity check: a query must take at least the fixture delay + # and resolve to NXDOMAIN (proving traffic reaches the fixture). + start=$(date +%s) + getent hosts unresolvable-host.invalid >/dev/null 2>&1 || true + elapsed=$(( $(date +%s) - start )) + if [ "$elapsed" -lt 2 ]; then + echo "ERROR: lookup returned in ${elapsed}s; fixture not in path" >&2 + exit 1 + fi + echo "[ok] DNS lookups are routed to the test fixture (took ${elapsed}s)" + - name: build test binary + run: cd test && make test + - name: run GetAddrInfoAsyncCancelTest + run: | + cd test + ARCH=$(uname -m) + CPPHTTPLIB_TEST_ISSUE_2431=1 \ + ASAN_OPTIONS=detect_stack_use_after_return=1 \ + LSAN_OPTIONS=suppressions=lsan_suppressions.txt \ + setarch "$ARCH" -R \ + ./test --gtest_filter='GetAddrInfoAsyncCancelTest.*' + - name: tear down test fixture + if: always() + run: | + sudo iptables -t nat -F OUTPUT || true + if [ -f /tmp/dns_fixture.pid ]; then + sudo kill "$(cat /tmp/dns_fixture.pid)" 2>/dev/null || true + fi + macos: runs-on: macos-latest if: > diff --git a/test/dns_test_fixture.py b/test/dns_test_fixture.py new file mode 100755 index 0000000..3bb55ca --- /dev/null +++ b/test/dns_test_fixture.py @@ -0,0 +1,87 @@ +#!/usr/bin/env python3 +"""Delayed UDP responder used as a loopback test fixture. + +This is a self-contained test fixture for the GetAddrInfoAsyncCancelTest +cases (reproducer for cpp-httplib issue #2431). It is NOT a general-purpose +nameserver and is only intended to run on 127.0.0.1 inside the test job's +own runner / container. + +What it does +------------ +Binds a UDP socket on 127.0.0.1:, accepts well-formed DNS queries +from the test process, waits , then sends back a minimal +NXDOMAIN reply. The deliberate delay is what makes the bug reproducible: + + * The test calls getaddrinfo_with_timeout() with timeout_sec=1. + * gai_suspend() returns EAI_AGAIN after 1s; the function returns and + its stack frame is destroyed. + * The fixture replies after (= 3s by default), so the + glibc resolver worker thread receives the response *after* the + caller's frame is gone and writes back into freed stack memory. + * AddressSanitizer (with detect_stack_use_after_return=1) catches the + write and aborts with a stack-use-after-return diagnostic. + +Without this fixture the bug is hard to surface: dropping UDP/53 makes +the resolver hang forever, so the worker never receives anything and +never reaches the buggy write-back path. + +Usage +----- + python3 test/dns_test_fixture.py [] + +Only standard library; no third-party dependencies. +""" + +import socket +import struct +import sys +import threading +import time + + +def serve(port: int, delay_sec: float) -> None: + sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) + sock.bind(("127.0.0.1", port)) + print( + f"[dns_test_fixture] listening on 127.0.0.1:{port}, " + f"reply delay={delay_sec}s", + flush=True, + ) + while True: + try: + data, addr = sock.recvfrom(2048) + except OSError: + return + threading.Thread( + target=_reply_after_delay, + args=(sock, data, addr, delay_sec), + daemon=True, + ).start() + + +def _reply_after_delay(sock, query: bytes, addr, delay_sec: float) -> None: + time.sleep(delay_sec) + if len(query) < 12: + return + # Header: copy transaction id, set QR=1 RA=1 RCODE=3 (NXDOMAIN), + # preserve the requester's RD bit, then echo the question section so + # glibc's resolver accepts the reply as matching its outstanding query. + txid = query[:2] + rd_bit = query[2] & 0x01 + flags = struct.pack(">H", 0x8003 | (rd_bit << 8)) + counts = struct.pack(">HHHH", 1, 0, 0, 0) + question = query[12:] + reply = txid + flags + counts + question + try: + sock.sendto(reply, addr) + except OSError: + pass + + +if __name__ == "__main__": + if len(sys.argv) < 2: + print(__doc__, file=sys.stderr) + sys.exit(2) + port_arg = int(sys.argv[1]) + delay_arg = float(sys.argv[2]) if len(sys.argv) > 2 else 3.0 + serve(port_arg, delay_arg) diff --git a/test/run_issue_2431_repro.sh b/test/run_issue_2431_repro.sh new file mode 100755 index 0000000..60aa9e9 --- /dev/null +++ b/test/run_issue_2431_repro.sh @@ -0,0 +1,102 @@ +#!/usr/bin/env bash +# Reproducer runner for Issue #2431 +# (https://github.com/yhirose/cpp-httplib/issues/2431). +# +# Spins up an Ubuntu container, runs the loopback DNS test fixture +# (test/dns_test_fixture.py), routes the container's DNS lookups to +# that fixture via an iptables NAT rule, builds the test suite with +# g++ + ASAN, and runs the GetAddrInfoAsyncCancelTest cases. +# +# Expected outcomes: +# - HEAD prior to the fix: ASAN reports stack-use-after-return inside +# getaddrinfo_with_timeout's getaddrinfo_a path during one of the +# GetAddrInfoAsyncCancelTest cases. +# - HEAD with the fix applied: all three cases PASS. +# +# Usage: +# bash test/run_issue_2431_repro.sh +# +# Requirements: Docker (Linux container support). The container needs +# --privileged because the test binary uses `setarch -R` to disable ASLR +# for ASAN compatibility, and because the test job manages iptables +# rules inside the container. + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)" + +docker run --rm --privileged \ + -v "$REPO_ROOT:/work" \ + -w /work/test \ + ubuntu:24.04 bash -c ' +set -euo pipefail +export DEBIAN_FRONTEND=noninteractive + +apt-get update -qq +apt-get install -y -qq --no-install-recommends \ + ca-certificates g++ make pkg-config iptables iproute2 util-linux coreutils file \ + python3 \ + libssl-dev zlib1g-dev libbrotli-dev libzstd-dev libcurl4-openssl-dev \ + >/dev/null + +# Force DNS-only resolution: Ubuntu defaults nsswitch.conf to +# "hosts: files mdns4_minimal [NOTFOUND=return] dns ...", which +# short-circuits to NOTFOUND before reaching glibc DNS code, so the +# gai_cancel() branch never gets exercised. +sed -i "s/^hosts:.*/hosts: dns/" /etc/nsswitch.conf + +# Start the loopback DNS test fixture (delayed UDP responder). +DNS_FIXTURE_PORT=15353 +DNS_FIXTURE_DELAY=3 +python3 /work/test/dns_test_fixture.py "$DNS_FIXTURE_PORT" "$DNS_FIXTURE_DELAY" \ + >/tmp/dns_fixture.log 2>&1 & +FIXTURE_PID=$! + +# Route the container DNS lookups to the fixture; conntrack handles the +# reply path automatically. /etc/resolv.conf is left untouched. +iptables -t nat -I OUTPUT -p udp --dport 53 \ + -j REDIRECT --to-port "$DNS_FIXTURE_PORT" + +trap '"'"'iptables -t nat -F OUTPUT 2>/dev/null || true; kill "$FIXTURE_PID" 2>/dev/null || true'"'"' EXIT + +# Wait for the fixture to start listening. +for _ in $(seq 1 50); do + if ss -lun "( sport = :$DNS_FIXTURE_PORT )" | grep -q ":$DNS_FIXTURE_PORT"; then + break + fi + sleep 0.1 +done +ss -lun "( sport = :$DNS_FIXTURE_PORT )" | grep -q ":$DNS_FIXTURE_PORT" || { + echo "ERROR: dns_test_fixture failed to start" >&2 + cat /tmp/dns_fixture.log >&2 || true + exit 1 +} + +# Sanity check: a DNS lookup must take at least the fixture delay +# (proving the NAT rule routes the query to the fixture). +start=$(date +%s) +getent hosts unresolvable-host.invalid >/dev/null 2>&1 || true +elapsed=$(( $(date +%s) - start )) +if [ "$elapsed" -lt 2 ]; then + echo "ERROR: lookup returned in ${elapsed}s; fixture not in DNS path" >&2 + exit 1 +fi +echo "[ok] DNS lookups are routed to the test fixture (took ${elapsed}s)" + +cd /work/test +echo "=== building test binary (g++ + ASAN) ===" +make CXX=g++ test 2>&1 | tail -5 + +ARCH=$(uname -m) +echo "=== running GetAddrInfoAsyncCancelTest with CPPHTTPLIB_TEST_ISSUE_2431=1 ===" +set +e +CPPHTTPLIB_TEST_ISSUE_2431=1 \ +ASAN_OPTIONS=detect_stack_use_after_return=1 \ +setarch "$ARCH" -R \ + ./test --gtest_filter="GetAddrInfoAsyncCancelTest.*" 2>&1 +rc=$? +set -e +echo "=== test exit: $rc ===" +exit $rc +' diff --git a/test/test.cc b/test/test.cc index 30daf37..204c524 100644 --- a/test/test.cc +++ b/test/test.cc @@ -1549,6 +1549,140 @@ TEST(GetAddrInfoDanglingRefTest, LongTimeout) { std::this_thread::sleep_for(std::chrono::seconds(8)); } +#if defined(__linux__) && defined(__GLIBC__) && \ + defined(CPPHTTPLIB_USE_NON_BLOCKING_GETADDRINFO) + +// Forward declaration: in split builds split.py strips `inline` and moves the +// definition into httplib.cc, so detail::getaddrinfo_with_timeout is not +// visible from the public httplib.h. Re-declaring it here lets the tests link +// against the symbol in both header-only and split builds. +namespace httplib { +namespace detail { +int getaddrinfo_with_timeout(const char *node, const char *service, + const struct addrinfo *hints, + struct addrinfo **res, time_t timeout_sec); +} // namespace detail +} // namespace httplib + +// Reproducer for https://github.com/yhirose/cpp-httplib/issues/2431. +// +// On Linux/glibc, getaddrinfo_with_timeout() runs the lookup via +// getaddrinfo_a(GAI_NOWAIT) using a stack-local `struct gaicb`. When the +// gai_suspend() call hits the connection timeout the function calls +// gai_cancel() and returns immediately. gai_cancel() is non-blocking and +// can return EAI_NOTCANCELED, in which case the resolver worker thread is +// still alive and still references the now-destroyed stack frame. +// +// Triggering the bug requires DNS to actually hang (UDP/53 dropped, etc.), +// so these tests are gated on CPPHTTPLIB_TEST_ISSUE_2431=1 and are skipped +// during normal runs. test/run_issue_2431_repro.sh sets up the environment +// and runs them in a container. +namespace { +bool should_run_issue_2431_tests() { + const char *v = getenv("CPPHTTPLIB_TEST_ISSUE_2431"); + return v && *v && std::string(v) != "0"; +} + +std::string unique_unresolvable_host(int n) { + // .invalid is reserved (RFC 6761) and is never served by real DNS, but + // glibc still asks the configured nameserver — which is exactly the path + // we want to exercise. A unique label per call avoids the resolver cache. + auto t = std::chrono::steady_clock::now().time_since_epoch().count(); + return "h-" + std::to_string(::getpid()) + "-" + std::to_string(t) + "-" + + std::to_string(n) + ".invalid"; +} +} // namespace + +TEST(GetAddrInfoAsyncCancelTest, DirectCallSingleThread) { + if (!should_run_issue_2431_tests()) { + GTEST_SKIP() + << "Set CPPHTTPLIB_TEST_ISSUE_2431=1 (and sinkhole DNS) to run"; + } + + for (int i = 0; i < 8; ++i) { + struct addrinfo hints; + memset(&hints, 0, sizeof(hints)); + hints.ai_family = AF_UNSPEC; + hints.ai_socktype = SOCK_STREAM; + + auto host = unique_unresolvable_host(i); + struct addrinfo *result = nullptr; + int rc = detail::getaddrinfo_with_timeout(host.c_str(), "80", &hints, + &result, /*timeout_sec=*/1); + if (rc == 0 && result) { freeaddrinfo(result); } + } + + // Give orphaned getaddrinfo_a worker threads a chance to write into the + // stack region they still believe holds their gaicb. + std::this_thread::sleep_for(std::chrono::seconds(3)); +} + +TEST(GetAddrInfoAsyncCancelTest, DirectCallMultiThread) { + if (!should_run_issue_2431_tests()) { + GTEST_SKIP() + << "Set CPPHTTPLIB_TEST_ISSUE_2431=1 (and sinkhole DNS) to run"; + } + + std::atomic stop{false}; + std::vector threads; + for (int t = 0; t < 8; ++t) { + threads.emplace_back([t, &stop] { + int i = 0; + while (!stop.load(std::memory_order_relaxed)) { + struct addrinfo hints; + memset(&hints, 0, sizeof(hints)); + hints.ai_family = AF_UNSPEC; + hints.ai_socktype = SOCK_STREAM; + + auto host = unique_unresolvable_host(t * 100000 + i++); + struct addrinfo *result = nullptr; + int rc = detail::getaddrinfo_with_timeout(host.c_str(), "80", &hints, + &result, /*timeout_sec=*/1); + if (rc == 0 && result) { freeaddrinfo(result); } + } + }); + } + + std::this_thread::sleep_for(std::chrono::seconds(8)); + stop.store(true, std::memory_order_relaxed); + for (auto &th : threads) { + th.join(); + } + std::this_thread::sleep_for(std::chrono::seconds(3)); +} + +TEST(GetAddrInfoAsyncCancelTest, ClientGetMultiThread) { + if (!should_run_issue_2431_tests()) { + GTEST_SKIP() + << "Set CPPHTTPLIB_TEST_ISSUE_2431=1 (and sinkhole DNS) to run"; + } + + std::atomic stop{false}; + std::vector threads; + for (int t = 0; t < 8; ++t) { + threads.emplace_back([t, &stop] { + int i = 0; + while (!stop.load(std::memory_order_relaxed)) { + auto host = unique_unresolvable_host(t * 100000 + i++); + Client cli(host, 80); + cli.set_connection_timeout(1, 0); + cli.set_read_timeout(1, 0); + cli.set_write_timeout(1, 0); + (void)cli.Get("/"); + } + }); + } + + std::this_thread::sleep_for(std::chrono::seconds(8)); + stop.store(true, std::memory_order_relaxed); + for (auto &th : threads) { + th.join(); + } + std::this_thread::sleep_for(std::chrono::seconds(3)); +} + +#endif // __linux__ && __GLIBC__ && CPPHTTPLIB_USE_NON_BLOCKING_GETADDRINFO + TEST(ConnectionErrorTest, InvalidHost) { auto host = "-abcde.com";