mirror of
https://github.com/yhirose/cpp-httplib.git
synced 2026-06-10 16:47:14 +00:00
* Add reproducer for #2431 (getaddrinfo_a use-after-free) On Linux/glibc, getaddrinfo_with_timeout() runs DNS asynchronously via getaddrinfo_a(GAI_NOWAIT) using a stack-local gaicb. When gai_suspend() hits the connection timeout, gai_cancel() is called and the function returns immediately — but gai_cancel() is non-blocking and can return EAI_NOTCANCELED, leaving the resolver worker thread alive and still referencing the destroyed stack frame. Adds three opt-in gtest cases (GetAddrInfoAsyncCancelTest.*) that exercise the cancel path repeatedly. They are gated on Linux/glibc + CPPHTTPLIB_USE_NON_BLOCKING_GETADDRINFO at compile time, and on the CPPHTTPLIB_TEST_ISSUE_2431=1 env var at runtime, so a normal `make test` run is unaffected. Also adds a dedicated CI job (issue-2431-repro) and a Docker-based local runner (test/run_issue_2431_repro.sh) that sinkhole UDP/53 so the timeout branch is taken, and run the test under ASAN/LSAN. With the bug present these runs are expected to fail; with a fix applied they should pass. Refs: https://github.com/yhirose/cpp-httplib/issues/2431 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix split build for #2431 reproducer tests The new GetAddrInfoAsyncCancelTest cases call detail::getaddrinfo_with_timeout directly. In split builds (make test_split) split.py moves the definition into httplib.cc and strips `inline`, so the symbol is not declared in the public httplib.h and test.cc fails to compile -- breaking the ubuntu/test-no-exceptions CI jobs that the PR description says should be unaffected. Add a forward declaration in test.cc, gated by the same #if as the tests themselves, so it links against the split-build symbol without changing the header-only build. * Cap issue-2431 repro job at 5 minutes The bug manifests as orphan getaddrinfo_a resolver workers that keep the runner from completing job teardown -- the previous run had all steps succeed in ~1m37s but then hung in "Cleaning up orphan processes" for ~57m before GitHub force-killed the job. A job-level timeout-minutes makes the failure signal fast and predictable: bug present -> killed at 5 min, bug fixed -> ~2 min pass. Step-level timeout isn't enough since the hang is in post-job cleanup, not the test step. * Enable ASAN detect_stack_use_after_return for #2431 repro The bug is a textbook stack-use-after-return: a stack-local struct gaicb is destroyed when getaddrinfo_with_timeout returns after gai_cancel() yields EAI_NOTCANCELED, then the still-live resolver worker thread writes back into the freed frame. ASAN's detect_stack_use_after_return is the direct detector for exactly this pattern -- enabling it lets the failure surface as a clear ASAN diagnostic during the test run instead of as an orphan-process hang at job teardown. * Revert ASAN detect_stack_use_after_return for #2431 repro The option did not detect the bug in CI -- the resolver worker write likely lands on the heap (via the gaicb's pai pointer) or happens after the test process exits, neither of which stack-use-after-return can catch. Roll back to relying on the job-level timeout: bug present -> post-cleanup hangs ~8min then job-level timeout cancels at 10min total; bug fixed -> job completes in ~2min. * Switch issue-2431 repro to a delayed loopback DNS test fixture The previous repro setup dropped UDP/53 outright, which made glibc's resolver hang forever on every lookup -- the worker never actually received a response and so never reached the buggy write-back path that #2431 is about. As a result, neither the broken HEAD nor the fix made any visible difference in CI: both produced "tests pass + post-cleanup hangs ~10min" because the orphan resolver thread is a structural property of *any* getaddrinfo path on a hung resolver, not a property of the bug. Replace the sinkhole with a small loopback test fixture (test/dns_test_fixture.py, ~50 lines, stdlib only) that answers DNS queries after a 3s delay -- longer than the test's 1s timeout. An iptables NAT rule routes the test job's lookups to the fixture without touching /etc/resolv.conf, so the rest of the runner's DNS behaviour is unaffected. With ASAN's detect_stack_use_after_return enabled, the worker's late write-back into the destroyed gaicb stack frame is now caught as a stack-use-after-return diagnostic, so the broken HEAD fails fast at the test step (clear red) and the fix turns the same job green in well under a minute. Same fixture is wired into both the GitHub Actions job and the docker-based test/run_issue_2431_repro.sh script, so local repro on macOS and CI repro on Linux exercise the identical path. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
86
.github/workflows/test.yaml
vendored
86
.github/workflows/test.yaml
vendored
@@ -114,6 +114,92 @@ jobs:
|
||||
- name: build and run ThreadPool test
|
||||
run: cd test && make test_thread_pool && ./test_thread_pool
|
||||
|
||||
# Reproducer for https://github.com/yhirose/cpp-httplib/issues/2431.
|
||||
# On Linux/glibc, getaddrinfo_with_timeout() schedules an asynchronous
|
||||
# DNS lookup with getaddrinfo_a(GAI_NOWAIT) using a stack-local gaicb.
|
||||
# When gai_suspend() hits the connection timeout, gai_cancel() is called
|
||||
# but does not block; the resolver worker can later write back into the
|
||||
# destroyed stack frame. To make the worker actually reach that write,
|
||||
# the test job runs a loopback UDP responder (test/dns_test_fixture.py)
|
||||
# that delays its reply past the test's 1s timeout, and uses an iptables
|
||||
# NAT rule so glibc's lookups land on that fixture instead of a real
|
||||
# nameserver. With ASAN's detect_stack_use_after_return enabled, the
|
||||
# late write-back is reported as a stack-use-after-return.
|
||||
issue-2431-repro:
|
||||
runs-on: ubuntu-latest
|
||||
if: >
|
||||
(github.event_name == 'push') ||
|
||||
(github.event_name == 'pull_request' &&
|
||||
github.event.pull_request.head.repo.full_name != github.event.pull_request.base.repo.full_name) ||
|
||||
(github.event_name == 'workflow_dispatch' && github.event.inputs.test_linux == 'true')
|
||||
name: issue-2431 repro (Linux + ASAN)
|
||||
# Bound the whole job in case anything in the test harness hangs
|
||||
# unexpectedly. With the fixture in place a normal run is well under
|
||||
# a minute either way (ASAN abort on broken HEAD, clean pass on fix).
|
||||
timeout-minutes: 5
|
||||
env:
|
||||
DNS_FIXTURE_PORT: "15353"
|
||||
DNS_FIXTURE_DELAY: "3"
|
||||
steps:
|
||||
- name: checkout
|
||||
uses: actions/checkout@v4
|
||||
- name: install libraries
|
||||
run: |
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y libssl-dev zlib1g-dev libbrotli-dev \
|
||||
libzstd-dev libcurl4-openssl-dev iptables util-linux iproute2
|
||||
- name: start loopback DNS test fixture
|
||||
run: |
|
||||
# Force glibc through its DNS code path: Ubuntu's default
|
||||
# nsswitch short-circuits to NOTFOUND through mdns4_minimal,
|
||||
# which would skip the buggy code entirely.
|
||||
sudo sed -i 's/^hosts:.*/hosts: dns/' /etc/nsswitch.conf
|
||||
# Run the loopback fixture (delayed UDP responder).
|
||||
python3 test/dns_test_fixture.py "$DNS_FIXTURE_PORT" "$DNS_FIXTURE_DELAY" \
|
||||
>/tmp/dns_fixture.log 2>&1 &
|
||||
echo $! | sudo tee /tmp/dns_fixture.pid >/dev/null
|
||||
# Wait for the fixture to start listening.
|
||||
for _ in $(seq 1 50); do
|
||||
if ss -lun "( sport = :$DNS_FIXTURE_PORT )" | grep -q ":$DNS_FIXTURE_PORT"; then
|
||||
break
|
||||
fi
|
||||
sleep 0.1
|
||||
done
|
||||
ss -lun "( sport = :$DNS_FIXTURE_PORT )" | grep -q ":$DNS_FIXTURE_PORT" \
|
||||
|| { echo "fixture failed to start"; cat /tmp/dns_fixture.log; exit 1; }
|
||||
# Send the test process's DNS lookups to the loopback fixture.
|
||||
# NAT only the local OUTPUT chain; conntrack handles the reply path.
|
||||
sudo iptables -t nat -I OUTPUT -p udp --dport 53 \
|
||||
-j REDIRECT --to-port "$DNS_FIXTURE_PORT"
|
||||
# Sanity check: a query must take at least the fixture delay
|
||||
# and resolve to NXDOMAIN (proving traffic reaches the fixture).
|
||||
start=$(date +%s)
|
||||
getent hosts unresolvable-host.invalid >/dev/null 2>&1 || true
|
||||
elapsed=$(( $(date +%s) - start ))
|
||||
if [ "$elapsed" -lt 2 ]; then
|
||||
echo "ERROR: lookup returned in ${elapsed}s; fixture not in path" >&2
|
||||
exit 1
|
||||
fi
|
||||
echo "[ok] DNS lookups are routed to the test fixture (took ${elapsed}s)"
|
||||
- name: build test binary
|
||||
run: cd test && make test
|
||||
- name: run GetAddrInfoAsyncCancelTest
|
||||
run: |
|
||||
cd test
|
||||
ARCH=$(uname -m)
|
||||
CPPHTTPLIB_TEST_ISSUE_2431=1 \
|
||||
ASAN_OPTIONS=detect_stack_use_after_return=1 \
|
||||
LSAN_OPTIONS=suppressions=lsan_suppressions.txt \
|
||||
setarch "$ARCH" -R \
|
||||
./test --gtest_filter='GetAddrInfoAsyncCancelTest.*'
|
||||
- name: tear down test fixture
|
||||
if: always()
|
||||
run: |
|
||||
sudo iptables -t nat -F OUTPUT || true
|
||||
if [ -f /tmp/dns_fixture.pid ]; then
|
||||
sudo kill "$(cat /tmp/dns_fixture.pid)" 2>/dev/null || true
|
||||
fi
|
||||
|
||||
macos:
|
||||
runs-on: macos-latest
|
||||
if: >
|
||||
|
||||
Reference in New Issue
Block a user