"Building a Desktop LLM App with cpp-httplib" (#2403)

2026-06-10 08:37:15 +00:00 · 2026-03-21 23:31:55 -04:00
parent c2bdb1c5c1
commit 7178f451a4
35 changed files with 8889 additions and 35 deletions
--- a/docs-src/pages/en/index.md
+++ b/docs-src/pages/en/index.md
@@ -18,8 +18,8 @@ Under the hood, it uses blocking I/O with a thread pool. It's not built for hand
 ## Documentation

 - [A Tour of cpp-httplib](tour/) — A step-by-step tutorial covering the basics. Start here if you're new
+- [Building a Desktop LLM App](llm-app/) — A hands-on guide to building a desktop app with llama.cpp, step by step

 ## Stay Tuned

 - [Cookbook](cookbook/) — A collection of recipes organized by topic. Jump to whatever you need
- [Building a Desktop LLM App](llm-app/) — A hands-on guide to building a desktop app with llama.cpp, step by step
--- a/docs-src/pages/en/llm-app/app.png
+++ b/docs-src/pages/en/llm-app/app.png
--- a/docs-src/pages/en/llm-app/ch01-setup.md
+++ b/docs-src/pages/en/llm-app/ch01-setup.md
@@ -0,0 +1,236 @@
+---
+title: "1. Setting Up the Project Environment"
+order: 1
+
+---
+
+Let's incrementally build a text translation REST API server using llama.cpp as the inference engine. By the end, a request like this will return a translation result.
+
+```bash
+curl -X POST http://localhost:8080/translate \
+  -H "Content-Type: application/json" \
+  -d '{"text": "The weather is nice today. Shall we go for a walk?", "target_lang": "ja"}'
+```
+
+```json
+{
+  "translation": "今日はいい天気ですね。散歩に行きましょうか？"
+}
+```
+
+The "Translation API" is just one example. By swapping out the prompt, you can adapt this to any LLM application you like, such as summarization, code generation, or a chatbot.
+
+Here's the full list of APIs the server will provide.
+
+| Method | Path | Description | Chapter |
+| -------- | ---- | ---- | -- |
+| `GET` | `/health` | Returns server status | 1 |
+| `POST` | `/translate` | Translates text and returns JSON | 2 |
+| `POST` | `/translate/stream` | SSE streaming on a per-token basis | 3 |
+| `GET` | `/models` | Model list (available / downloaded / selected) | 4 |
+| `POST` | `/models/select` | Select a model (automatically downloads if not yet downloaded) | 4 |
+
+In this chapter, let's set up the project environment. We'll fetch the dependency libraries, create the directory structure, configure the build settings, and grab the model file, so that we're ready to start writing code in the next chapter.
+
+## Prerequisites
+
+- A C++20-compatible compiler (GCC 10+, Clang 10+, MSVC 2019 16.8+)
+- CMake 3.20 or later
+- OpenSSL (used for the HTTPS client in Chapter 4. macOS: `brew install openssl`, Ubuntu: `sudo apt install libssl-dev`)
+- Sufficient disk space (model files can be several GB)
+
+## 1.1 What We Will Use
+
+Here are the libraries we'll use.
+
+| Library | Role |
+| ----------- | ------ |
+| [cpp-httplib](https://github.com/yhirose/cpp-httplib) | HTTP server/client |
+| [nlohmann/json](https://github.com/nlohmann/json) | JSON parser |
+| [cpp-llamalib](https://github.com/yhirose/cpp-llamalib) | llama.cpp wrapper |
+| [llama.cpp](https://github.com/ggml-org/llama.cpp) | LLM inference engine |
+| [webview/webview](https://github.com/webview/webview) | Desktop WebView (used in Chapter 6) |
+
+cpp-httplib, nlohmann/json, and cpp-llamalib are header-only libraries. You could just download a single header file with `curl` and `#include` it, but in this book we use CMake's `FetchContent` to fetch them automatically. Declare them in `CMakeLists.txt`, and `cmake -B build` downloads and builds everything for you. webview is used in Chapter 6, so you don't need to worry about it for now.
+
+## 1.2 Directory Structure
+
+The final structure will look like this.
+
+```ascii
+translate-app/
+├── CMakeLists.txt
+├── models/
+│   └── (GGUF files)
+└── src/
+    └── main.cpp
+```
+
+We don't include library source code in the project. CMake's `FetchContent` fetches them automatically at build time, so all you need is your own code.
+
+Let's create the project directory and initialize a git repository.
+
+```bash
+mkdir translate-app && cd translate-app
+mkdir src models
+git init
+```
+
+## 1.3 Obtaining the GGUF Model File
+
+You need a model file for LLM inference. GGUF is the model format used by llama.cpp, and you can find many models on Hugging Face.
+
+Let's start by trying a small model. The quantized version of Google's Gemma 2 2B (~1.6 GB) is a good starting point. It's lightweight but supports multiple languages and works well for translation tasks.
+
+```bash
+curl -L -o models/gemma-2-2b-it-Q4_K_M.gguf \
+  https://huggingface.co/bartowski/gemma-2-2b-it-GGUF/resolve/main/gemma-2-2b-it-Q4_K_M.gguf
+```
+
+In Chapter 4, we'll add the ability to download models from within the app using cpp-httplib's client functionality.
+
+## 1.4 CMakeLists.txt
+
+Create a `CMakeLists.txt` in the project root. By declaring dependencies with `FetchContent`, CMake will automatically download and build them for you.
+
+<!-- data-file="CMakeLists.txt" -->
+```cmake
+cmake_minimum_required(VERSION 3.20)
+project(translate-server CXX)
+set(CMAKE_CXX_STANDARD 20)
+
+include(FetchContent)
+
+# llama.cpp (LLM inference engine)
+FetchContent_Declare(llama
+    GIT_REPOSITORY https://github.com/ggml-org/llama.cpp
+    GIT_TAG        master
+    GIT_SHALLOW    TRUE
+)
+FetchContent_MakeAvailable(llama)
+
+# cpp-httplib (HTTP server/client)
+FetchContent_Declare(httplib
+    GIT_REPOSITORY https://github.com/yhirose/cpp-httplib
+    GIT_TAG        master
+)
+FetchContent_MakeAvailable(httplib)
+
+# nlohmann/json (JSON parser)
+FetchContent_Declare(json
+    URL https://github.com/nlohmann/json/releases/download/v3.11.3/json.tar.xz
+)
+FetchContent_MakeAvailable(json)
+
+# cpp-llamalib (header-only llama.cpp wrapper)
+FetchContent_Declare(cpp_llamalib
+    GIT_REPOSITORY https://github.com/yhirose/cpp-llamalib
+    GIT_TAG        main
+)
+FetchContent_MakeAvailable(cpp_llamalib)
+
+add_executable(translate-server src/main.cpp)
+
+target_link_libraries(translate-server PRIVATE
+    httplib::httplib
+    nlohmann_json::nlohmann_json
+    cpp-llamalib
+)
+```
+
+`FetchContent_Declare` tells CMake where to find each library, and `FetchContent_MakeAvailable` fetches and builds them. The first `cmake -B build` will take some time because it downloads all libraries and builds llama.cpp, but subsequent runs will use the cache.
+
+Just link with `target_link_libraries`, and each library's CMake configuration sets up include paths and build settings for you.
+
+## 1.5 Creating the Skeleton Code
+
+We'll use this skeleton code as a base and add functionality chapter by chapter.
+
+<!-- data-file="main.cpp" -->
+```cpp
+// src/main.cpp
+#include <httplib.h>
+#include <nlohmann/json.hpp>
+
+#include <csignal>
+#include <iostream>
+
+using json = nlohmann::json;
+
+httplib::Server svr;
+
+// Graceful shutdown on `Ctrl+C`
+void signal_handler(int sig) {
+  if (sig == SIGINT || sig == SIGTERM) {
+    std::cout << "\nReceived signal, shutting down gracefully...\n";
+    svr.stop();
+  }
+}
+
+int main() {
+  // Log requests and responses
+  svr.set_logger([](const auto &req, const auto &res) {
+    std::cout << req.method << " " << req.path << " -> " << res.status
+              << std::endl;
+  });
+
+  // Health check
+  svr.Get("/health", [](const auto &, auto &res) {
+    res.set_content(json{{"status", "ok"}}.dump(), "application/json");
+  });
+
+  // Stub implementations for each endpoint (replaced with real ones in later chapters)
+  svr.Post("/translate",
+           [](const auto &req, auto &res) {
+    res.set_content(json{{"translation", "TODO"}}.dump(), "application/json");
+  });
+
+  svr.Post("/translate/stream",
+           [](const auto &req, auto &res) {
+    res.set_content("data: \"TODO\"\n\ndata: [DONE]\n\n", "text/event-stream");
+  });
+
+  svr.Get("/models",
+          [](const auto &req, auto &res) {
+    res.set_content(json{{"models", json::array()}}.dump(), "application/json");
+  });
+
+  svr.Post("/models/select",
+           [](const auto &req, auto &res) {
+    res.set_content(json{{"status", "TODO"}}.dump(), "application/json");
+  });
+
+  // Allow the server to be stopped with `Ctrl+C` (`SIGINT`) or `kill` (`SIGTERM`)
+  signal(SIGINT, signal_handler);
+  signal(SIGTERM, signal_handler);
+
+  // Start the server
+  std::cout << "Listening on http://127.0.0.1:8080" << std::endl;
+  svr.listen("127.0.0.1", 8080);
+}
+```
+
+## 1.6 Building and Verifying
+
+Build the project, start the server, and verify that requests work with curl.
+
+```bash
+cmake -B build
+cmake --build build -j
+./build/translate-server
+```
+
+From another terminal, try it with curl.
+
+```bash
+curl http://localhost:8080/health
+# => {"status":"ok"}
+```
+
+If you see JSON come back, the setup is complete.
+
+## Next Chapter
+
+Now that the environment is set up, in the next chapter we'll implement the translation REST API on top of this skeleton. We'll run inference with llama.cpp and expose it as an HTTP endpoint with cpp-httplib.
+
+**Next:** [Integrating llama.cpp to Build a REST API](../ch02-rest-api)
--- a/docs-src/pages/en/llm-app/ch02-rest-api.md
+++ b/docs-src/pages/en/llm-app/ch02-rest-api.md
@@ -0,0 +1,212 @@
+---
+title: "2. Integrating llama.cpp to Build a REST API"
+order: 2
+
+---
+
+In the skeleton from Chapter 1, `/translate` simply returned `"TODO"`. In this chapter we integrate llama.cpp inference and turn it into an API that actually returns translation results.
+
+Calling the llama.cpp API directly makes the code quite long, so we use a thin wrapper library called [cpp-llamalib](https://github.com/yhirose/cpp-llamalib). It lets you load a model and run inference in just a few lines, keeping the focus on cpp-httplib.
+
+## 2.1 Initializing the LLM
+
+Simply pass the path to a model file to `llamalib::Llama`, and model loading, context creation, and sampler configuration are all taken care of. If you downloaded a different model in Chapter 1, adjust the path accordingly.
+
+```cpp
+#include <cpp-llamalib.h>
+
+int main() {
+  auto llm = llamalib::Llama{"models/gemma-2-2b-it-Q4_K_M.gguf"};
+
+  // LLM inference takes time, so set a longer timeout (default is 5 seconds)
+  svr.set_read_timeout(300);
+  svr.set_write_timeout(300);
+
+  // ... Build and start the HTTP server ...
+}
+```
+
+If you want to change the number of GPU layers, context length, or other settings, you can specify them via `llamalib::Options`.
+
+```cpp
+auto llm = llamalib::Llama{"models/gemma-2-2b-it-Q4_K_M.gguf", {
+  .n_gpu_layers = 0,  // CPU only
+  .n_ctx = 4096,
+}};
+```
+
+## 2.2 The `/translate` Handler
+
+We replace the handler that returned dummy JSON in Chapter 1 with actual inference.
+
+```cpp
+svr.Post("/translate",
+         [&](const httplib::Request &req, httplib::Response &res) {
+  // Parse JSON (3rd arg `false`: don't throw on failure, check with `is_discarded()`)
+  auto input = json::parse(req.body, nullptr, false);
+  if (input.is_discarded()) {
+    res.status = 400;
+    res.set_content(json{{"error", "Invalid JSON"}}.dump(),
+                    "application/json");
+    return;
+  }
+
+  // Validate required fields
+  if (!input.contains("text") || !input["text"].is_string() ||
+      input["text"].get<std::string>().empty()) {
+    res.status = 400;
+    res.set_content(json{{"error", "'text' is required"}}.dump(),
+                    "application/json");
+    return;
+  }
+
+  auto text = input["text"].get<std::string>();
+  auto target_lang = input.value("target_lang", "ja"); // Default is Japanese
+
+  // Build the prompt and run inference
+  auto prompt = "Translate the following text to " + target_lang +
+                ". Output only the translation, nothing else.\n\n" + text;
+
+  try {
+    auto translation = llm.chat(prompt);
+    res.set_content(json{{"translation", translation}}.dump(),
+                    "application/json");
+  } catch (const std::exception &e) {
+    res.status = 500;
+    res.set_content(json{{"error", e.what()}}.dump(), "application/json");
+  }
+});
+```
+
+`llm.chat()` can throw exceptions during inference (for example, when the context length is exceeded). By catching them with `try/catch` and returning the error as JSON, we prevent the server from crashing.
+
+## 2.3 Complete Code
+
+Here is the finished code with all the changes so far.
+
+<details>
+<summary data-file="main.cpp">Complete code (main.cpp)</summary>
+
+```cpp
+#include <httplib.h>
+#include <nlohmann/json.hpp>
+#include <cpp-llamalib.h>
+
+#include <csignal>
+#include <iostream>
+
+using json = nlohmann::json;
+
+httplib::Server svr;
+
+// Graceful shutdown on `Ctrl+C`
+void signal_handler(int sig) {
+  if (sig == SIGINT || sig == SIGTERM) {
+    std::cout << "\nReceived signal, shutting down gracefully...\n";
+    svr.stop();
+  }
+}
+
+int main() {
+  // Load the model downloaded in Chapter 1
+  auto llm = llamalib::Llama{"models/gemma-2-2b-it-Q4_K_M.gguf"};
+
+  // LLM inference takes time, so set a longer timeout (default is 5 seconds)
+  svr.set_read_timeout(300);
+  svr.set_write_timeout(300);
+
+  // Log requests and responses
+  svr.set_logger([](const auto &req, const auto &res) {
+    std::cout << req.method << " " << req.path << " -> " << res.status
+              << std::endl;
+  });
+
+  svr.Get("/health", [](const httplib::Request &, httplib::Response &res) {
+    res.set_content(json{{"status", "ok"}}.dump(), "application/json");
+  });
+
+  svr.Post("/translate",
+           [&](const httplib::Request &req, httplib::Response &res) {
+    // Parse JSON (3rd arg `false`: don't throw on failure, check with `is_discarded()`)
+    auto input = json::parse(req.body, nullptr, false);
+    if (input.is_discarded()) {
+      res.status = 400;
+      res.set_content(json{{"error", "Invalid JSON"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    // Validate required fields
+    if (!input.contains("text") || !input["text"].is_string() ||
+        input["text"].get<std::string>().empty()) {
+      res.status = 400;
+      res.set_content(json{{"error", "'text' is required"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    auto text = input["text"].get<std::string>();
+    auto target_lang = input.value("target_lang", "ja"); // Default is Japanese
+
+    // Build the prompt and run inference
+    auto prompt = "Translate the following text to " + target_lang +
+                  ". Output only the translation, nothing else.\n\n" + text;
+
+    try {
+      auto translation = llm.chat(prompt);
+      res.set_content(json{{"translation", translation}}.dump(),
+                      "application/json");
+    } catch (const std::exception &e) {
+      res.status = 500;
+      res.set_content(json{{"error", e.what()}}.dump(), "application/json");
+    }
+  });
+
+  // Dummy implementations to be replaced with real ones in later chapters
+  svr.Get("/models",
+          [](const httplib::Request &, httplib::Response &res) {
+    res.set_content(json{{"models", json::array()}}.dump(), "application/json");
+  });
+
+  svr.Post("/models/select",
+           [](const httplib::Request &, httplib::Response &res) {
+    res.set_content(json{{"status", "TODO"}}.dump(), "application/json");
+  });
+
+  // Allow the server to be stopped with `Ctrl+C` (`SIGINT`) or `kill` (`SIGTERM`)
+  signal(SIGINT, signal_handler);
+  signal(SIGTERM, signal_handler);
+
+  // Start the server (blocks until `stop()` is called)
+  std::cout << "Listening on http://127.0.0.1:8080" << std::endl;
+  svr.listen("127.0.0.1", 8080);
+}
+```
+
+</details>
+
+## 2.4 Testing It Out
+
+Rebuild and start the server, then verify that it now returns actual translation results.
+
+```bash
+cmake --build build -j
+./build/translate-server
+```
+
+```bash
+curl -X POST http://localhost:8080/translate \
+  -H "Content-Type: application/json" \
+  -d '{"text": "I had a great time visiting Tokyo last spring. The cherry blossoms were beautiful.", "target_lang": "ja"}'
+# => {"translation":"去年の春に東京を訪れた。桜が綺麗だった。"}
+```
+
+In Chapter 1 the response was `"TODO"`, but now you get an actual translation back.
+
+## Next Chapter
+
+The REST API we built in this chapter waits for the entire translation to complete before sending the response, so for long texts the user has to wait with no indication of progress.
+
+In the next chapter, we use SSE (Server-Sent Events) to stream tokens back in real time as they are generated.
+
+**Next:** [Adding Token Streaming with SSE](../ch03-sse-streaming)
--- a/docs-src/pages/en/llm-app/ch03-sse-streaming.md
+++ b/docs-src/pages/en/llm-app/ch03-sse-streaming.md
@@ -0,0 +1,264 @@
+---
+title: "3. Adding Token Streaming with SSE"
+order: 3
+
+---
+
+The `/translate` endpoint from Chapter 2 returned the entire translation at once after completion. This is fine for short sentences, but for longer text the user has to wait several seconds with nothing displayed.
+
+In this chapter, we add a `/translate/stream` endpoint that uses SSE (Server-Sent Events) to return tokens in real time as they are generated. This is the same approach used by the ChatGPT and Claude APIs.
+
+## 3.1 What is SSE?
+
+SSE is a way to send HTTP responses as a stream. When a client sends a request, the server keeps the connection open and gradually returns events. The format is simple text.
+
+```text
+data: "去年の"
+data: "春に"
+data: "東京を"
+data: [DONE]
+```
+
+Each line starts with `data:` and events are separated by blank lines. The Content-Type is `text/event-stream`. Tokens are sent as escaped JSON strings, so they appear enclosed in double quotes (we implement this in Section 3.3).
+
+## 3.2 Streaming with cpp-httplib
+
+In cpp-httplib, you can use `set_chunked_content_provider` to send responses incrementally. Each time you write to `sink.os` inside the callback, data is sent to the client.
+
+```cpp
+res.set_chunked_content_provider(
+    "text/event-stream",
+    [](size_t offset, httplib::DataSink &sink) {
+      sink.os << "data: hello\n\n";
+      sink.done();
+      return true;
+    });
+```
+
+Calling `sink.done()` ends the stream. If the client disconnects mid-stream, writing to `sink.os` will fail and `sink.os.fail()` will return `true`. You can use this to detect disconnection and abort unnecessary inference.
+
+## 3.3 The `/translate/stream` Handler
+
+JSON parsing and validation are the same as the `/translate` endpoint from Chapter 2. The only difference is how the response is returned. We combine the streaming callback of `llm.chat()` with `set_chunked_content_provider`.
+
+```cpp
+svr.Post("/translate/stream",
+         [&](const httplib::Request &req, httplib::Response &res) {
+  // ... JSON parsing and validation same as /translate ...
+
+  res.set_chunked_content_provider(
+      "text/event-stream",
+      [&, prompt](size_t, httplib::DataSink &sink) {
+        try {
+          llm.chat(prompt, [&](std::string_view token) {
+            sink.os << "data: "
+                    << json(std::string(token)).dump(
+                         -1, ' ', false, json::error_handler_t::replace)
+                    << "\n\n";
+            return sink.os.good(); // Abort inference on disconnect
+          });
+          sink.os << "data: [DONE]\n\n";
+        } catch (const std::exception &e) {
+          sink.os << "data: " << json({{"error", e.what()}}).dump() << "\n\n";
+        }
+        sink.done();
+        return true;
+      });
+});
+```
+
+A few key points:
+
+- When you pass a callback to `llm.chat()`, it is called each time a token is generated. If the callback returns `false`, generation is aborted
+- After writing to `sink.os`, you can check whether the client is still connected with `sink.os.good()`. If the client has disconnected, it returns `false` to stop inference
+- Each token is escaped as a JSON string using `json(token).dump()` before sending. This is safe even for tokens containing newlines or quotes
+- The first three arguments of `dump(-1, ' ', false, ...)` are the defaults. What matters is the fourth argument, `json::error_handler_t::replace`. Since the LLM returns tokens at the subword level, multi-byte characters (such as Japanese) can be split mid-character across tokens. Passing an incomplete UTF-8 byte sequence directly to `dump()` would throw an exception, so `replace` safely substitutes them. The browser reassembles the bytes on its end, so everything displays correctly
+- The entire lambda is wrapped in `try/catch`. `llm.chat()` can throw exceptions for reasons such as exceeding the context window. If an exception goes uncaught inside the lambda, the server will crash, so we return the error as an SSE event instead
+- `data: [DONE]` follows the OpenAI API convention to signal the end of the stream to the client
+
+## 3.4 Complete Code
+
+Here is the complete code with the `/translate/stream` endpoint added to the code from Chapter 2.
+
+<details>
+<summary data-file="main.cpp">Complete code (main.cpp)</summary>
+
+```cpp
+#include <httplib.h>
+#include <nlohmann/json.hpp>
+#include <cpp-llamalib.h>
+
+#include <csignal>
+#include <iostream>
+
+using json = nlohmann::json;
+
+httplib::Server svr;
+
+// Graceful shutdown on `Ctrl+C`
+void signal_handler(int sig) {
+  if (sig == SIGINT || sig == SIGTERM) {
+    std::cout << "\nReceived signal, shutting down gracefully...\n";
+    svr.stop();
+  }
+}
+
+int main() {
+  // Load the GGUF model
+  auto llm = llamalib::Llama{"models/gemma-2-2b-it-Q4_K_M.gguf"};
+
+  // LLM inference takes time, so set a longer timeout (default is 5 seconds)
+  svr.set_read_timeout(300);
+  svr.set_write_timeout(300);
+
+  // Log requests and responses
+  svr.set_logger([](const auto &req, const auto &res) {
+    std::cout << req.method << " " << req.path << " -> " << res.status
+              << std::endl;
+  });
+
+  svr.Get("/health", [](const httplib::Request &, httplib::Response &res) {
+    res.set_content(json{{"status", "ok"}}.dump(), "application/json");
+  });
+
+  // Standard translation endpoint from Chapter 2
+  svr.Post("/translate",
+           [&](const httplib::Request &req, httplib::Response &res) {
+    // JSON parsing and validation (see Chapter 2 for details)
+    auto input = json::parse(req.body, nullptr, false);
+    if (input.is_discarded()) {
+      res.status = 400;
+      res.set_content(json{{"error", "Invalid JSON"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    if (!input.contains("text") || !input["text"].is_string() ||
+        input["text"].get<std::string>().empty()) {
+      res.status = 400;
+      res.set_content(json{{"error", "'text' is required"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    auto text = input["text"].get<std::string>();
+    auto target_lang = input.value("target_lang", "ja");
+
+    auto prompt = "Translate the following text to " + target_lang +
+                  ". Output only the translation, nothing else.\n\n" + text;
+
+    try {
+      auto translation = llm.chat(prompt);
+      res.set_content(json{{"translation", translation}}.dump(),
+                      "application/json");
+    } catch (const std::exception &e) {
+      res.status = 500;
+      res.set_content(json{{"error", e.what()}}.dump(), "application/json");
+    }
+  });
+
+  // SSE streaming translation endpoint
+  svr.Post("/translate/stream",
+           [&](const httplib::Request &req, httplib::Response &res) {
+    // JSON parsing and validation (same as /translate)
+    auto input = json::parse(req.body, nullptr, false);
+    if (input.is_discarded()) {
+      res.status = 400;
+      res.set_content(json{{"error", "Invalid JSON"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    if (!input.contains("text") || !input["text"].is_string() ||
+        input["text"].get<std::string>().empty()) {
+      res.status = 400;
+      res.set_content(json{{"error", "'text' is required"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    auto text = input["text"].get<std::string>();
+    auto target_lang = input.value("target_lang", "ja");
+
+    auto prompt = "Translate the following text to " + target_lang +
+                  ". Output only the translation, nothing else.\n\n" + text;
+
+    res.set_chunked_content_provider(
+        "text/event-stream",
+        [&, prompt](size_t, httplib::DataSink &sink) {
+          try {
+            llm.chat(prompt, [&](std::string_view token) {
+              sink.os << "data: "
+                      << json(std::string(token)).dump(
+                           -1, ' ', false, json::error_handler_t::replace)
+                      << "\n\n";
+              return sink.os.good(); // Abort inference on disconnect
+            });
+            sink.os << "data: [DONE]\n\n";
+          } catch (const std::exception &e) {
+            sink.os << "data: " << json({{"error", e.what()}}).dump() << "\n\n";
+          }
+          sink.done();
+          return true;
+        });
+  });
+
+  // Dummy implementations to be replaced in later chapters
+  svr.Get("/models",
+          [](const httplib::Request &, httplib::Response &res) {
+    res.set_content(json{{"models", json::array()}}.dump(), "application/json");
+  });
+
+  svr.Post("/models/select",
+           [](const httplib::Request &, httplib::Response &res) {
+    res.set_content(json{{"status", "TODO"}}.dump(), "application/json");
+  });
+
+  // Allow the server to be stopped with `Ctrl+C` (`SIGINT`) or `kill` (`SIGTERM`)
+  signal(SIGINT, signal_handler);
+  signal(SIGTERM, signal_handler);
+
+  // Start the server (blocks until `stop()` is called)
+  std::cout << "Listening on http://127.0.0.1:8080" << std::endl;
+  svr.listen("127.0.0.1", 8080);
+}
+```
+
+</details>
+
+## 3.5 Testing It Out
+
+Build and start the server.
+
+```bash
+cmake --build build -j
+./build/translate-server
+```
+
+Using curl's `-N` option to disable buffering, you can see tokens displayed in real time as they arrive.
+
+```bash
+curl -N -X POST http://localhost:8080/translate/stream \
+  -H "Content-Type: application/json" \
+  -d '{"text": "I had a great time visiting Tokyo last spring. The cherry blossoms were beautiful.", "target_lang": "ja"}'
+```
+
+```text
+data: "去年の"
+data: "春に"
+data: "東京を"
+data: "訪れた"
+data: "。"
+data: "桜が"
+data: "綺麗だった"
+data: "。"
+data: [DONE]
+```
+
+You should see tokens streaming in one by one. The `/translate` endpoint from Chapter 2 continues to work as well.
+
+## Next Chapter
+
+The server's translation functionality is now complete. In the next chapter, we use cpp-httplib's client functionality to add the ability to fetch and manage models from Hugging Face.
+
+**Next:** [Adding Model Download and Management](../ch04-model-management)
--- a/docs-src/pages/en/llm-app/ch04-model-management.md
+++ b/docs-src/pages/en/llm-app/ch04-model-management.md
@@ -0,0 +1,788 @@
+---
+title: "4. Adding Model Download and Management"
+order: 4
+
+---
+
+By the end of Chapter 3, the server's translation functionality was fully in place. However, the only model file available is the one we manually downloaded in Chapter 1. In this chapter, we'll use cpp-httplib's **client functionality** to enable downloading and switching Hugging Face models from within the app.
+
+Once complete, you'll be able to manage models with requests like these:
+
+```bash
+# Get the list of available models
+curl http://localhost:8080/models
+```
+
+```json
+{
+  "models": [
+    {"name": "gemma-2-2b-it", "params": "2B", "size": "1.6 GB", "downloaded": true, "selected": true},
+    {"name": "gemma-2-9b-it", "params": "9B", "size": "5.8 GB", "downloaded": false, "selected": false},
+    {"name": "Llama-3.1-8B-Instruct", "params": "8B", "size": "4.9 GB", "downloaded": false, "selected": false}
+  ]
+}
+```
+
+```bash
+# Select a different model (automatically downloads if not yet available)
+curl -N -X POST http://localhost:8080/models/select \
+  -H "Content-Type: application/json" \
+  -d '{"model": "gemma-2-9b-it"}'
+```
+
+```text
+data: {"status":"downloading","progress":0}
+data: {"status":"downloading","progress":12}
+...
+data: {"status":"downloading","progress":100}
+data: {"status":"loading"}
+data: {"status":"ready"}
+```
+
+## 4.1 httplib::Client Basics
+
+So far we've only used `httplib::Server`, but cpp-httplib also provides client functionality. Since Hugging Face uses HTTPS, we need a TLS-capable client.
+
+```cpp
+#include <httplib.h>
+
+// Including the URL scheme automatically uses SSLClient
+httplib::Client cli("https://huggingface.co");
+
+// Automatically follow redirects (Hugging Face redirects to a CDN)
+cli.set_follow_location(true);
+
+auto res = cli.Get("/api/models");
+if (res && res->status == 200) {
+  std::cout << res->body << std::endl;
+}
+```
+
+To use HTTPS, you need to enable OpenSSL at build time. Add the following to your `CMakeLists.txt`:
+
+```cmake
+find_package(OpenSSL REQUIRED)
+
+target_link_libraries(translate-server PRIVATE OpenSSL::SSL OpenSSL::Crypto)
+target_compile_definitions(translate-server PRIVATE CPPHTTPLIB_OPENSSL_SUPPORT)
+
+# macOS: required for loading system certificates
+if(APPLE)
+  target_link_libraries(translate-server PRIVATE "-framework CoreFoundation" "-framework Security")
+endif()
+```
+
+Defining `CPPHTTPLIB_OPENSSL_SUPPORT` enables `httplib::Client("https://...")` to make TLS connections. On macOS, you also need to link the CoreFoundation and Security frameworks to access the system certificate store. See Section 4.8 for the complete `CMakeLists.txt`.
+
+## 4.2 Defining the Model List
+
+Let's define the list of models that the app can handle. Here are four models we've verified for translation tasks.
+
+```cpp
+struct ModelInfo {
+  std::string name;       // Display name
+  std::string params;     // Parameter count
+  std::string size;       // GGUF Q4 size
+  std::string repo;       // Hugging Face repository
+  std::string filename;   // GGUF filename
+};
+
+const std::vector<ModelInfo> MODELS = {
+  {
+    .name     = "gemma-2-2b-it",
+    .params   = "2B",
+    .size     = "1.6 GB",
+    .repo     = "bartowski/gemma-2-2b-it-GGUF",
+    .filename = "gemma-2-2b-it-Q4_K_M.gguf",
+  },
+  {
+    .name     = "gemma-2-9b-it",
+    .params   = "9B",
+    .size     = "5.8 GB",
+    .repo     = "bartowski/gemma-2-9b-it-GGUF",
+    .filename = "gemma-2-9b-it-Q4_K_M.gguf",
+  },
+  {
+    .name     = "Llama-3.1-8B-Instruct",
+    .params   = "8B",
+    .size     = "4.9 GB",
+    .repo     = "bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",
+    .filename = "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
+  },
+};
+```
+
+## 4.3 Model Storage Location
+
+Up through Chapter 3, we stored models in the `models/` directory within the project. However, when managing multiple models, a dedicated app directory makes more sense. On macOS/Linux we use `~/.translate-app/models/`, and on Windows we use `%APPDATA%\translate-app\models\`.
+
+```cpp
+std::filesystem::path get_models_dir() {
+#ifdef _WIN32
+  auto env = std::getenv("APPDATA");
+  auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
+  return base / "translate-app" / "models";
+#else
+  auto env = std::getenv("HOME");
+  auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
+  return base / ".translate-app" / "models";
+#endif
+}
+```
+
+If the environment variable isn't set, it falls back to the current directory. The app creates this directory at startup (`create_directories` won't error even if it already exists).
+
+## 4.4 Rewriting Model Initialization
+
+We rewrite the model initialization at the beginning of `main()`. In Chapter 1 we hardcoded the path, but from here on we support model switching. We track the currently loaded filename in `selected_model` and load the first entry in `MODELS` at startup. The `GET /models` and `POST /models/select` handlers reference and update this variable.
+
+Since cpp-httplib runs handlers concurrently on a thread pool, reassigning `llm` while another thread is calling `llm.chat()` would crash. We add a `std::mutex` to protect against this.
+
+```cpp
+int main() {
+  auto models_dir = get_models_dir();
+  std::filesystem::create_directories(models_dir);
+
+  std::string selected_model = MODELS[0].filename;
+  auto path = models_dir / selected_model;
+
+  // Automatically download the default model if not yet present
+  if (!std::filesystem::exists(path)) {
+    std::cout << "Downloading " << selected_model << "..." << std::endl;
+    if (!download_model(MODELS[0], [](int pct) {
+          std::cout << "\r" << pct << "%" << std::flush;
+          return true;
+        })) {
+      std::cerr << "\nFailed to download model." << std::endl;
+      return 1;
+    }
+    std::cout << std::endl;
+  }
+  auto llm = llamalib::Llama{path};
+  std::mutex llm_mutex; // Protect access during model switching
+  // ...
+}
+```
+
+This ensures that users don't need to manually download models with curl on first launch. It uses the `download_model` function from Section 4.6 and displays progress on the console.
+
+## 4.5 The `GET /models` Handler
+
+This returns the model list with information about whether each model has been downloaded and whether it's currently selected.
+
+```cpp
+svr.Get("/models",
+        [&](const httplib::Request &, httplib::Response &res) {
+  auto arr = json::array();
+  for (const auto &m : MODELS) {
+    auto path = get_models_dir() / m.filename;
+    arr.push_back({
+      {"name",       m.name},
+      {"params",     m.params},
+      {"size",       m.size},
+      {"downloaded", std::filesystem::exists(path)},
+      {"selected",   m.filename == selected_model},
+    });
+  }
+  res.set_content(json{{"models", arr}}.dump(), "application/json");
+});
+```
+
+## 4.6 Downloading Large Files
+
+GGUF models are several gigabytes, so we can't load the entire file into memory. By passing callbacks to `httplib::Client::Get`, we can receive data chunk by chunk.
+
+```cpp
+// content_receiver: callback that receives data chunks
+// progress: download progress callback
+cli.Get(url,
+  [&](const char *data, size_t len) {       // content_receiver
+    ofs.write(data, len);
+    return true;  // returning false aborts the download
+  },
+  [&](size_t current, size_t total) {        // progress
+    int pct = total ? (int)(current * 100 / total) : 0;
+    std::cout << pct << "%" << std::endl;
+    return true;  // returning false aborts the download
+  });
+```
+
+Let's use this to create a function that downloads models from Hugging Face.
+
+```cpp
+#include <filesystem>
+#include <fstream>
+
+// Download a model and report progress via progress_cb.
+// If progress_cb returns false, the download is aborted.
+bool download_model(const ModelInfo &model,
+                    std::function<bool(int)> progress_cb) {
+  httplib::Client cli("https://huggingface.co");
+  cli.set_follow_location(true);
+  cli.set_read_timeout(std::chrono::hours(1));
+
+  auto url = "/" + model.repo + "/resolve/main/" + model.filename;
+  auto path = get_models_dir() / model.filename;
+  auto tmp_path = std::filesystem::path(path).concat(".tmp");
+
+  std::ofstream ofs(tmp_path, std::ios::binary);
+  if (!ofs) { return false; }
+
+  auto res = cli.Get(url,
+    [&](const char *data, size_t len) {
+      ofs.write(data, len);
+      return ofs.good();
+    },
+    [&](size_t current, size_t total) {
+      return progress_cb(total ? (int)(current * 100 / total) : 0);
+    });
+
+  ofs.close();
+
+  if (!res || res->status != 200) {
+    std::filesystem::remove(tmp_path);
+    return false;
+  }
+
+  // Write to .tmp first, then rename, so that an incomplete file
+  // is never mistaken for a usable model if the download is interrupted
+  std::filesystem::rename(tmp_path, path);
+  return true;
+}
+```
+
+## 4.7 The `/models/select` Handler
+
+This handles model selection requests. We always respond with SSE, reporting status in sequence: download progress, loading, and ready.
+
+```cpp
+svr.Post("/models/select",
+         [&](const httplib::Request &req, httplib::Response &res) {
+  auto input = json::parse(req.body, nullptr, false);
+  if (input.is_discarded() || !input.contains("model")) {
+    res.status = 400;
+    res.set_content(json{{"error", "'model' is required"}}.dump(),
+                    "application/json");
+    return;
+  }
+
+  auto name = input["model"].get<std::string>();
+
+  // Find the model in the list
+  auto it = std::find_if(MODELS.begin(), MODELS.end(),
+    [&](const ModelInfo &m) { return m.name == name; });
+
+  if (it == MODELS.end()) {
+    res.status = 404;
+    res.set_content(json{{"error", "Unknown model"}}.dump(),
+                    "application/json");
+    return;
+  }
+
+  const auto &model = *it;
+
+  // Always respond with SSE (same format whether already downloaded or not)
+  res.set_chunked_content_provider(
+      "text/event-stream",
+      [&, model](size_t, httplib::DataSink &sink) {
+        // SSE event sending helper
+        auto send = [&](const json &event) {
+          sink.os << "data: " << event.dump() << "\n\n";
+        };
+
+        // Download if not yet present (report progress via SSE)
+        auto path = get_models_dir() / model.filename;
+        if (!std::filesystem::exists(path)) {
+          bool ok = download_model(model, [&](int pct) {
+            send({{"status", "downloading"}, {"progress", pct}});
+            return sink.os.good(); // Abort download on client disconnect
+          });
+          if (!ok) {
+            send({{"status", "error"}, {"message", "Download failed"}});
+            sink.done();
+            return true;
+          }
+        }
+
+        // Load and switch to the model
+        send({{"status", "loading"}});
+        {
+          std::lock_guard<std::mutex> lock(llm_mutex);
+          llm = llamalib::Llama{path};
+          selected_model = model.filename;
+        }
+
+        send({{"status", "ready"}});
+        sink.done();
+        return true;
+      });
+});
+```
+
+A few notes:
+
+- We send SSE events directly from the `download_model` progress callback. This is an application of `set_chunked_content_provider` + `sink.os` from Chapter 3
+- Since the callback returns `sink.os.good()`, the download stops if the client disconnects. The cancel button we add in Chapter 5 uses this
+- When we update `selected_model`, it's reflected in the `selected` flag of `GET /models`
+- The `llm` reassignment is protected by `llm_mutex`. The `/translate` and `/translate/stream` handlers also lock the same mutex, so inference can't run during a model switch (see the complete code)
+
+## 4.8 Complete Code
+
+Here is the complete code with model management added to the Chapter 3 code.
+
+<details>
+<summary data-file="CMakeLists.txt">Complete code (CMakeLists.txt)</summary>
+
+```cmake
+cmake_minimum_required(VERSION 3.20)
+project(translate-server CXX)
+set(CMAKE_CXX_STANDARD 20)
+
+include(FetchContent)
+
+# llama.cpp
+FetchContent_Declare(llama
+    GIT_REPOSITORY https://github.com/ggml-org/llama.cpp
+    GIT_TAG        master
+    GIT_SHALLOW    TRUE
+)
+FetchContent_MakeAvailable(llama)
+
+# cpp-httplib
+FetchContent_Declare(httplib
+    GIT_REPOSITORY https://github.com/yhirose/cpp-httplib
+    GIT_TAG        master
+)
+FetchContent_MakeAvailable(httplib)
+
+# nlohmann/json
+FetchContent_Declare(json
+    URL https://github.com/nlohmann/json/releases/download/v3.11.3/json.tar.xz
+)
+FetchContent_MakeAvailable(json)
+
+# cpp-llamalib
+FetchContent_Declare(cpp_llamalib
+    GIT_REPOSITORY https://github.com/yhirose/cpp-llamalib
+    GIT_TAG        main
+)
+FetchContent_MakeAvailable(cpp_llamalib)
+
+find_package(OpenSSL REQUIRED)
+
+add_executable(translate-server src/main.cpp)
+
+target_link_libraries(translate-server PRIVATE
+    httplib::httplib
+    nlohmann_json::nlohmann_json
+    cpp-llamalib
+    OpenSSL::SSL OpenSSL::Crypto
+)
+
+target_compile_definitions(translate-server PRIVATE CPPHTTPLIB_OPENSSL_SUPPORT)
+
+if(APPLE)
+    target_link_libraries(translate-server PRIVATE
+        "-framework CoreFoundation"
+        "-framework Security"
+    )
+endif()
+```
+
+</details>
+
+<details>
+<summary data-file="main.cpp">Complete code (main.cpp)</summary>
+
+```cpp
+#include <httplib.h>
+#include <nlohmann/json.hpp>
+#include <cpp-llamalib.h>
+
+#include <algorithm>
+#include <csignal>
+#include <filesystem>
+#include <fstream>
+#include <iostream>
+#include <mutex>
+
+using json = nlohmann::json;
+
+// -------------------------------------------------------------------------
+// Model definitions
+// -------------------------------------------------------------------------
+
+struct ModelInfo {
+  std::string name;
+  std::string params;
+  std::string size;
+  std::string repo;
+  std::string filename;
+};
+
+const std::vector<ModelInfo> MODELS = {
+  {
+    .name     = "gemma-2-2b-it",
+    .params   = "2B",
+    .size     = "1.6 GB",
+    .repo     = "bartowski/gemma-2-2b-it-GGUF",
+    .filename = "gemma-2-2b-it-Q4_K_M.gguf",
+  },
+  {
+    .name     = "gemma-2-9b-it",
+    .params   = "9B",
+    .size     = "5.8 GB",
+    .repo     = "bartowski/gemma-2-9b-it-GGUF",
+    .filename = "gemma-2-9b-it-Q4_K_M.gguf",
+  },
+  {
+    .name     = "Llama-3.1-8B-Instruct",
+    .params   = "8B",
+    .size     = "4.9 GB",
+    .repo     = "bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",
+    .filename = "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
+  },
+};
+
+// -------------------------------------------------------------------------
+// Model storage directory
+// -------------------------------------------------------------------------
+
+std::filesystem::path get_models_dir() {
+#ifdef _WIN32
+  auto env = std::getenv("APPDATA");
+  auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
+  return base / "translate-app" / "models";
+#else
+  auto env = std::getenv("HOME");
+  auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
+  return base / ".translate-app" / "models";
+#endif
+}
+
+// -------------------------------------------------------------------------
+// Model download
+// -------------------------------------------------------------------------
+
+// If progress_cb returns false, the download is aborted
+bool download_model(const ModelInfo &model,
+                    std::function<bool(int)> progress_cb) {
+  httplib::Client cli("https://huggingface.co");
+  cli.set_follow_location(true);  // Hugging Face redirects to a CDN
+  cli.set_read_timeout(std::chrono::hours(1)); // Set a long timeout for large models
+
+  auto url = "/" + model.repo + "/resolve/main/" + model.filename;
+  auto path = get_models_dir() / model.filename;
+  auto tmp_path = std::filesystem::path(path).concat(".tmp");
+
+  std::ofstream ofs(tmp_path, std::ios::binary);
+  if (!ofs) { return false; }
+
+  auto res = cli.Get(url,
+    // content_receiver: receive data chunk by chunk and write to file
+    [&](const char *data, size_t len) {
+      ofs.write(data, len);
+      return ofs.good();
+    },
+    // progress: report download progress (returning false aborts)
+    [&, last_pct = -1](size_t current, size_t total) mutable {
+      int pct = total ? (int)(current * 100 / total) : 0;
+      if (pct == last_pct) return true; // Skip if same value
+      last_pct = pct;
+      return progress_cb(pct);
+    });
+
+  ofs.close();
+
+  if (!res || res->status != 200) {
+    std::filesystem::remove(tmp_path);
+    return false;
+  }
+
+  // Rename after download completes
+  std::filesystem::rename(tmp_path, path);
+  return true;
+}
+
+// -------------------------------------------------------------------------
+// Server
+// -------------------------------------------------------------------------
+
+httplib::Server svr;
+
+void signal_handler(int sig) {
+  if (sig == SIGINT || sig == SIGTERM) {
+    std::cout << "\nReceived signal, shutting down gracefully...\n";
+    svr.stop();
+  }
+}
+
+int main() {
+  // Create the model storage directory
+  auto models_dir = get_models_dir();
+  std::filesystem::create_directories(models_dir);
+
+  // Automatically download the default model if not yet present
+  std::string selected_model = MODELS[0].filename;
+  auto path = models_dir / selected_model;
+  if (!std::filesystem::exists(path)) {
+    std::cout << "Downloading " << selected_model << "..." << std::endl;
+    if (!download_model(MODELS[0], [](int pct) {
+          std::cout << "\r" << pct << "%" << std::flush;
+          return true;
+        })) {
+      std::cerr << "\nFailed to download model." << std::endl;
+      return 1;
+    }
+    std::cout << std::endl;
+  }
+  auto llm = llamalib::Llama{path};
+  std::mutex llm_mutex; // Protect access during model switching
+
+  // Set a long timeout since LLM inference takes time (default is 5 seconds)
+  svr.set_read_timeout(300);
+  svr.set_write_timeout(300);
+
+  svr.set_logger([](const auto &req, const auto &res) {
+    std::cout << req.method << " " << req.path << " -> " << res.status
+              << std::endl;
+  });
+
+  svr.Get("/health", [](const httplib::Request &, httplib::Response &res) {
+    res.set_content(json{{"status", "ok"}}.dump(), "application/json");
+  });
+
+  // --- Translation endpoint (Chapter 2) ------------------------------------
+
+  svr.Post("/translate",
+           [&](const httplib::Request &req, httplib::Response &res) {
+    // JSON parsing and validation (see Chapter 2 for details)
+    auto input = json::parse(req.body, nullptr, false);
+    if (input.is_discarded()) {
+      res.status = 400;
+      res.set_content(json{{"error", "Invalid JSON"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    if (!input.contains("text") || !input["text"].is_string() ||
+        input["text"].get<std::string>().empty()) {
+      res.status = 400;
+      res.set_content(json{{"error", "'text' is required"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    auto text = input["text"].get<std::string>();
+    auto target_lang = input.value("target_lang", "ja");
+
+    auto prompt = "Translate the following text to " + target_lang +
+                  ". Output only the translation, nothing else.\n\n" + text;
+
+    try {
+      std::lock_guard<std::mutex> lock(llm_mutex);
+      auto translation = llm.chat(prompt);
+      res.set_content(json{{"translation", translation}}.dump(),
+                      "application/json");
+    } catch (const std::exception &e) {
+      res.status = 500;
+      res.set_content(json{{"error", e.what()}}.dump(), "application/json");
+    }
+  });
+
+  // --- SSE streaming translation (Chapter 3) -------------------------------
+
+  svr.Post("/translate/stream",
+           [&](const httplib::Request &req, httplib::Response &res) {
+    auto input = json::parse(req.body, nullptr, false);
+    if (input.is_discarded()) {
+      res.status = 400;
+      res.set_content(json{{"error", "Invalid JSON"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    if (!input.contains("text") || !input["text"].is_string() ||
+        input["text"].get<std::string>().empty()) {
+      res.status = 400;
+      res.set_content(json{{"error", "'text' is required"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    auto text = input["text"].get<std::string>();
+    auto target_lang = input.value("target_lang", "ja");
+
+    auto prompt = "Translate the following text to " + target_lang +
+                  ". Output only the translation, nothing else.\n\n" + text;
+
+    res.set_chunked_content_provider(
+        "text/event-stream",
+        [&, prompt](size_t, httplib::DataSink &sink) {
+          std::lock_guard<std::mutex> lock(llm_mutex);
+          try {
+            llm.chat(prompt, [&](std::string_view token) {
+              sink.os << "data: "
+                      << json(std::string(token)).dump(
+                           -1, ' ', false, json::error_handler_t::replace)
+                      << "\n\n";
+              return sink.os.good(); // Abort inference on disconnect
+            });
+            sink.os << "data: [DONE]\n\n";
+          } catch (const std::exception &e) {
+            sink.os << "data: " << json({{"error", e.what()}}).dump() << "\n\n";
+          }
+          sink.done();
+          return true;
+        });
+  });
+
+  // --- Model list (Chapter 4) ----------------------------------------------
+
+  svr.Get("/models",
+          [&](const httplib::Request &, httplib::Response &res) {
+    auto models_dir = get_models_dir();
+    auto arr = json::array();
+    for (const auto &m : MODELS) {
+      auto path = models_dir / m.filename;
+      arr.push_back({
+        {"name",       m.name},
+        {"params",     m.params},
+        {"size",       m.size},
+        {"downloaded", std::filesystem::exists(path)},
+        {"selected",   m.filename == selected_model},
+      });
+    }
+    res.set_content(json{{"models", arr}}.dump(), "application/json");
+  });
+
+  // --- Model selection (Chapter 4) -----------------------------------------
+
+  svr.Post("/models/select",
+           [&](const httplib::Request &req, httplib::Response &res) {
+    auto input = json::parse(req.body, nullptr, false);
+    if (input.is_discarded() || !input.contains("model")) {
+      res.status = 400;
+      res.set_content(json{{"error", "'model' is required"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    auto name = input["model"].get<std::string>();
+
+    auto it = std::find_if(MODELS.begin(), MODELS.end(),
+      [&](const ModelInfo &m) { return m.name == name; });
+
+    if (it == MODELS.end()) {
+      res.status = 404;
+      res.set_content(json{{"error", "Unknown model"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    const auto &model = *it;
+
+    // Always respond with SSE (same format whether already downloaded or not)
+    res.set_chunked_content_provider(
+        "text/event-stream",
+        [&, model](size_t, httplib::DataSink &sink) {
+          // SSE event sending helper
+          auto send = [&](const json &event) {
+            sink.os << "data: " << event.dump() << "\n\n";
+          };
+
+          // Download if not yet present (report progress via SSE)
+          auto path = get_models_dir() / model.filename;
+          if (!std::filesystem::exists(path)) {
+            bool ok = download_model(model, [&](int pct) {
+              send({{"status", "downloading"}, {"progress", pct}});
+              return sink.os.good(); // Abort download on client disconnect
+            });
+            if (!ok) {
+              send({{"status", "error"}, {"message", "Download failed"}});
+              sink.done();
+              return true;
+            }
+          }
+
+          // Load and switch to the model
+          send({{"status", "loading"}});
+          {
+            std::lock_guard<std::mutex> lock(llm_mutex);
+            llm = llamalib::Llama{path};
+            selected_model = model.filename;
+          }
+
+          send({{"status", "ready"}});
+          sink.done();
+          return true;
+        });
+  });
+
+  // Allow the server to be stopped with `Ctrl+C` (`SIGINT`) or `kill` (`SIGTERM`)
+  signal(SIGINT, signal_handler);
+  signal(SIGTERM, signal_handler);
+
+  std::cout << "Listening on http://127.0.0.1:8080" << std::endl;
+  svr.listen("127.0.0.1", 8080);
+}
+```
+
+</details>
+
+## 4.9 Testing
+
+Since we added OpenSSL configuration to CMakeLists.txt, we need to re-run CMake before building.
+
+```bash
+cmake -B build
+cmake --build build -j
+./build/translate-server
+```
+
+### Checking the Model List
+
+```bash
+curl http://localhost:8080/models
+```
+
+The gemma-2-2b-it model downloaded in Chapter 1 should show `downloaded: true` and `selected: true`.
+
+### Switching to a Different Model
+
+```bash
+curl -N -X POST http://localhost:8080/models/select \
+  -H "Content-Type: application/json" \
+  -d '{"model": "gemma-2-9b-it"}'
+```
+
+Download progress streams via SSE, and `"ready"` appears when it's done.
+
+### Comparing Translations Across Models
+
+Let's translate the same sentence with different models.
+
+```bash
+# Translate with gemma-2-9b-it (the model we just switched to)
+curl -X POST http://localhost:8080/translate \
+  -H "Content-Type: application/json" \
+  -d '{"text": "The quick brown fox jumps over the lazy dog.", "target_lang": "ja"}'
+
+# Switch back to gemma-2-2b-it
+curl -N -X POST http://localhost:8080/models/select \
+  -H "Content-Type: application/json" \
+  -d '{"model": "gemma-2-2b-it"}'
+
+# Translate the same sentence
+curl -X POST http://localhost:8080/translate \
+  -H "Content-Type: application/json" \
+  -d '{"text": "The quick brown fox jumps over the lazy dog.", "target_lang": "ja"}'
+```
+
+Translation results vary depending on the model, even with the same code and the same prompt. Since cpp-llamalib automatically applies the appropriate chat template for each model, no code changes are needed.
+
+## Next Chapter
+
+The server's main features are now complete: REST API, SSE streaming, and model download and switching. In the next chapter, we'll add static file serving and build a Web UI you can use from a browser.
+
+**Next:** [Adding a Web UI](../ch05-web-ui)
--- a/docs-src/pages/en/llm-app/ch05-web-ui.md
+++ b/docs-src/pages/en/llm-app/ch05-web-ui.md
--- a/docs-src/pages/en/llm-app/ch06-desktop-app.md
+++ b/docs-src/pages/en/llm-app/ch06-desktop-app.md
@@ -0,0 +1,724 @@
+---
+title: "6. Turning It into a Desktop App with WebView"
+order: 6
+
+---
+
+In Chapter 5, we completed a translation app you can use from a browser. But every time, you have to start the server, open the URL in a browser... Wouldn't it be nice to just double-click and start using it, like a normal app?
+
+In this chapter, we'll do two things:
+
+1. **WebView integration** — Use [webview/webview](https://github.com/webview/webview) to turn it into a desktop app that runs without a browser
+2. **Single binary packaging** — Use [cpp-embedlib](https://github.com/yhirose/cpp-embedlib) to embed HTML/CSS/JS into the binary, making the distributable a single file
+
+When finished, you'll be able to just run `./translate-app` to open a window and start translating.
+
+![Desktop App](../app.png#large-center)
+
+The model downloads automatically on first launch, so the only thing you need to give users is the single binary.
+
+## 6.1 Introducing webview/webview
+
+[webview/webview](https://github.com/webview/webview) is a library that lets you use the OS's native WebView component (WKWebView on macOS, WebKitGTK on Linux, WebView2 on Windows) from C/C++. Unlike Electron, it doesn't bundle its own browser, so the impact on binary size is negligible.
+
+We'll fetch it with CMake. Add the following to your `CMakeLists.txt`:
+
+```cmake
+# webview/webview
+FetchContent_Declare(webview
+    GIT_REPOSITORY https://github.com/webview/webview
+    GIT_TAG        master
+)
+FetchContent_MakeAvailable(webview)
+```
+
+This makes the `webview::core` CMake target available. When you link it with `target_link_libraries`, it automatically sets up include paths and platform-specific frameworks.
+
+> **macOS**: No additional dependencies are needed. WKWebView is built into the system.
+>
+> **Linux**: WebKitGTK is required. Install it with `sudo apt install libwebkit2gtk-4.1-dev`.
+>
+> **Windows**: The WebView2 runtime is required. It comes pre-installed on Windows 11. For Windows 10, download it from the [official Microsoft website](https://developer.microsoft.com/en-us/microsoft-edge/webview2/).
+
+## 6.2 Running the Server on a Background Thread
+
+Up through Chapter 5, the server's `listen()` was blocking the main thread. To use WebView, we need to run the server on a separate thread and run the WebView event loop on the main thread.
+
+```cpp
+#include "webview/webview.h"
+#include <thread>
+
+int main() {
+  // ... (server setup is the same as Chapter 5) ...
+
+  // Start the server on a background thread
+  auto port = svr.bind_to_any_port("127.0.0.1");
+  std::thread server_thread([&]() { svr.listen_after_bind(); });
+
+  std::cout << "Listening on http://127.0.0.1:" << port << std::endl;
+
+  // Display the UI with WebView
+  webview::webview w(false, nullptr);
+  w.set_title("Translate App");
+  w.set_size(1024, 768, WEBVIEW_HINT_NONE);
+  w.navigate("http://127.0.0.1:" + std::to_string(port));
+  w.run(); // Block until the window is closed
+
+  // Stop the server when the window is closed
+  svr.stop();
+  server_thread.join();
+}
+```
+
+Let's look at the key points:
+
+- **`bind_to_any_port`** — Instead of `listen("127.0.0.1", 8080)`, we let the OS choose an available port. Since desktop apps can be launched multiple times, using a fixed port would cause conflicts
+- **`listen_after_bind`** — Starts accepting requests on the port reserved by `bind_to_any_port`. While `listen()` does bind and listen in one call, we need to know the port number first, so we split the operations
+- **Shutdown order** — When the WebView window is closed, we stop the server with `svr.stop()` and wait for the thread to finish with `server_thread.join()`. If we reversed the order, WebView would lose access to the server
+
+The `signal_handler` from Chapter 5 is no longer needed. In a desktop app, closing the window means terminating the application.
+
+## 6.3 Embedding Static Files with cpp-embedlib
+
+In Chapter 5, we served files from the `public/` directory, so you'd need to distribute `public/` alongside the binary. With [cpp-embedlib](https://github.com/yhirose/cpp-embedlib), you can embed HTML, CSS, and JavaScript into the binary, packaging the distributable into a single file.
+
+### CMakeLists.txt
+
+Fetch cpp-embedlib and embed `public/`:
+
+```cmake
+# cpp-embedlib
+FetchContent_Declare(cpp-embedlib
+    GIT_REPOSITORY https://github.com/yhirose/cpp-embedlib
+    GIT_TAG        main
+)
+FetchContent_MakeAvailable(cpp-embedlib)
+
+# Embed the public/ directory into the binary
+cpp_embedlib_add(WebAssets
+    FOLDER    ${CMAKE_CURRENT_SOURCE_DIR}/public
+    NAMESPACE Web
+)
+
+target_link_libraries(translate-app PRIVATE
+    WebAssets                # Embedded files
+    cpp-embedlib-httplib     # cpp-httplib integration
+)
+```
+
+`cpp_embedlib_add` converts the files under `public/` into binary data at compile time and creates a static library called `WebAssets`. When linked, you can access the embedded files through a `Web::FS` object. `cpp-embedlib-httplib` is a helper library that provides the `httplib::mount()` function.
+
+### Replacing set_mount_point with httplib::mount
+
+Simply replace Chapter 5's `set_mount_point` with cpp-embedlib's `httplib::mount`:
+
+```cpp
+#include <cpp-embedlib-httplib.h>
+#include "WebAssets.h"
+
+// Chapter 5:
+// svr.set_mount_point("/", "./public");
+
+// Chapter 6:
+httplib::mount(svr, Web::FS);
+```
+
+`httplib::mount` registers handlers that serve the files embedded in `Web::FS` over HTTP. MIME types are automatically determined from file extensions, so there's no need to manually set `Content-Type`.
+
+The file contents are directly mapped to the binary's data segment, so no memory copies or heap allocations occur.
+
+## 6.4 macOS: Adding the Edit Menu
+
+If you try to paste text into the input field with `Cmd+V`, you'll find it doesn't work. On macOS, keyboard shortcuts like `Cmd+V` (paste) and `Cmd+C` (copy) are routed through the application's menu bar. Since webview/webview doesn't create one, these shortcuts never reach the WebView. We need to add a macOS Edit menu using the Objective-C runtime:
+
+```cpp
+#ifdef __APPLE__
+#include <objc/objc-runtime.h>
+
+void setup_macos_edit_menu() {
+  auto cls    = [](const char *n) { return (id)objc_getClass(n); };
+  auto sel    = sel_registerName;
+  auto msg    = reinterpret_cast<id (*)(id, SEL)>(objc_msgSend);
+  auto msg_s  = reinterpret_cast<id (*)(id, SEL, const char *)>(objc_msgSend);
+  auto msg_id = reinterpret_cast<id (*)(id, SEL, id)>(objc_msgSend);
+  auto msg_v  = reinterpret_cast<void (*)(id, SEL, id)>(objc_msgSend);
+  auto msg_mi = reinterpret_cast<id (*)(id, SEL, id, SEL, id)>(objc_msgSend);
+
+  auto str = [&](const char *s) {
+    return msg_s(cls("NSString"), sel("stringWithUTF8String:"), s);
+  };
+
+  id app      = msg(cls("NSApplication"), sel("sharedApplication"));
+  id mainMenu = msg(msg(cls("NSMenu"), sel("alloc")), sel("init"));
+  id editItem = msg(msg(cls("NSMenuItem"), sel("alloc")), sel("init"));
+  id editMenu = msg_id(msg(cls("NSMenu"), sel("alloc")),
+                       sel("initWithTitle:"), str("Edit"));
+
+  struct { const char *title; const char *action; const char *key; } items[] = {
+    {"Undo",       "undo:",      "z"},
+    {"Redo",       "redo:",      "Z"},
+    {"Cut",        "cut:",       "x"},
+    {"Copy",       "copy:",      "c"},
+    {"Paste",      "paste:",     "v"},
+    {"Select All", "selectAll:", "a"},
+  };
+
+  for (auto &[title, action, key] : items) {
+    id mi = msg_mi(msg(cls("NSMenuItem"), sel("alloc")),
+                   sel("initWithTitle:action:keyEquivalent:"),
+                   str(title), sel(action), str(key));
+    msg_v(editMenu, sel("addItem:"), mi);
+  }
+
+  msg_v(editItem, sel("setSubmenu:"), editMenu);
+  msg_v(mainMenu, sel("addItem:"), editItem);
+  msg_v(app, sel("setMainMenu:"), mainMenu);
+}
+#endif
+```
+
+Call this before `w.run()`:
+
+```cpp
+#ifdef __APPLE__
+  setup_macos_edit_menu();
+#endif
+  w.run();
+```
+
+On Windows and Linux, keyboard shortcuts are delivered directly to the focused control without going through the menu bar, so this workaround is macOS-specific.
+
+## 6.5 Complete Code
+
+<details>
+<summary data-file="CMakeLists.txt">Complete code (CMakeLists.txt)</summary>
+
+```cmake
+cmake_minimum_required(VERSION 3.20)
+project(translate-app CXX)
+set(CMAKE_CXX_STANDARD 20)
+
+include(FetchContent)
+
+# llama.cpp
+FetchContent_Declare(llama
+    GIT_REPOSITORY https://github.com/ggml-org/llama.cpp
+    GIT_TAG        master
+    GIT_SHALLOW    TRUE
+)
+FetchContent_MakeAvailable(llama)
+
+# cpp-httplib
+FetchContent_Declare(httplib
+    GIT_REPOSITORY https://github.com/yhirose/cpp-httplib
+    GIT_TAG        master
+)
+FetchContent_MakeAvailable(httplib)
+
+# nlohmann/json
+FetchContent_Declare(json
+    URL https://github.com/nlohmann/json/releases/download/v3.11.3/json.tar.xz
+)
+FetchContent_MakeAvailable(json)
+
+# cpp-llamalib
+FetchContent_Declare(cpp_llamalib
+    GIT_REPOSITORY https://github.com/yhirose/cpp-llamalib
+    GIT_TAG        main
+)
+FetchContent_MakeAvailable(cpp_llamalib)
+
+# webview/webview
+FetchContent_Declare(webview
+    GIT_REPOSITORY https://github.com/webview/webview
+    GIT_TAG        master
+)
+FetchContent_MakeAvailable(webview)
+
+# cpp-embedlib
+FetchContent_Declare(cpp-embedlib
+    GIT_REPOSITORY https://github.com/yhirose/cpp-embedlib
+    GIT_TAG        main
+)
+FetchContent_MakeAvailable(cpp-embedlib)
+
+# Embed the public/ directory into the binary
+cpp_embedlib_add(WebAssets
+    FOLDER    ${CMAKE_CURRENT_SOURCE_DIR}/public
+    NAMESPACE Web
+)
+
+find_package(OpenSSL REQUIRED)
+
+add_executable(translate-app src/main.cpp)
+
+target_link_libraries(translate-app PRIVATE
+    httplib::httplib
+    nlohmann_json::nlohmann_json
+    cpp-llamalib
+    OpenSSL::SSL OpenSSL::Crypto
+    WebAssets
+    cpp-embedlib-httplib
+    webview::core
+)
+
+if(APPLE)
+    target_link_libraries(translate-app PRIVATE
+        "-framework CoreFoundation"
+        "-framework Security"
+    )
+endif()
+
+target_compile_definitions(translate-app PRIVATE
+    CPPHTTPLIB_OPENSSL_SUPPORT
+)
+```
+
+</details>
+
+<details>
+<summary data-file="main.cpp">Complete code (main.cpp)</summary>
+
+```cpp
+#include <httplib.h>
+#include <nlohmann/json.hpp>
+#include <cpp-llamalib.h>
+#include <cpp-embedlib-httplib.h>
+#include "WebAssets.h"
+#include "webview/webview.h"
+
+#ifdef __APPLE__
+#include <objc/objc-runtime.h>
+#endif
+
+#include <algorithm>
+#include <filesystem>
+#include <fstream>
+#include <iostream>
+#include <mutex>
+#include <thread>
+
+using json = nlohmann::json;
+
+// -------------------------------------------------------------------------
+// macOS Edit menu (Cmd+C/V/X/A require an Edit menu on macOS)
+// -------------------------------------------------------------------------
+
+#ifdef __APPLE__
+void setup_macos_edit_menu() {
+  auto cls    = [](const char *n) { return (id)objc_getClass(n); };
+  auto sel    = sel_registerName;
+  auto msg    = reinterpret_cast<id (*)(id, SEL)>(objc_msgSend);
+  auto msg_s  = reinterpret_cast<id (*)(id, SEL, const char *)>(objc_msgSend);
+  auto msg_id = reinterpret_cast<id (*)(id, SEL, id)>(objc_msgSend);
+  auto msg_v  = reinterpret_cast<void (*)(id, SEL, id)>(objc_msgSend);
+  auto msg_mi = reinterpret_cast<id (*)(id, SEL, id, SEL, id)>(objc_msgSend);
+
+  auto str = [&](const char *s) {
+    return msg_s(cls("NSString"), sel("stringWithUTF8String:"), s);
+  };
+
+  id app      = msg(cls("NSApplication"), sel("sharedApplication"));
+  id mainMenu = msg(msg(cls("NSMenu"), sel("alloc")), sel("init"));
+  id editItem = msg(msg(cls("NSMenuItem"), sel("alloc")), sel("init"));
+  id editMenu = msg_id(msg(cls("NSMenu"), sel("alloc")),
+                       sel("initWithTitle:"), str("Edit"));
+
+  struct { const char *title; const char *action; const char *key; } items[] = {
+    {"Undo",       "undo:",      "z"},
+    {"Redo",       "redo:",      "Z"},
+    {"Cut",        "cut:",       "x"},
+    {"Copy",       "copy:",      "c"},
+    {"Paste",      "paste:",     "v"},
+    {"Select All", "selectAll:", "a"},
+  };
+
+  for (auto &[title, action, key] : items) {
+    id mi = msg_mi(msg(cls("NSMenuItem"), sel("alloc")),
+                   sel("initWithTitle:action:keyEquivalent:"),
+                   str(title), sel(action), str(key));
+    msg_v(editMenu, sel("addItem:"), mi);
+  }
+
+  msg_v(editItem, sel("setSubmenu:"), editMenu);
+  msg_v(mainMenu, sel("addItem:"), editItem);
+  msg_v(app, sel("setMainMenu:"), mainMenu);
+}
+#endif
+
+// -------------------------------------------------------------------------
+// Model definitions
+// -------------------------------------------------------------------------
+
+struct ModelInfo {
+  std::string name;
+  std::string params;
+  std::string size;
+  std::string repo;
+  std::string filename;
+};
+
+const std::vector<ModelInfo> MODELS = {
+  {
+    .name     = "gemma-2-2b-it",
+    .params   = "2B",
+    .size     = "1.6 GB",
+    .repo     = "bartowski/gemma-2-2b-it-GGUF",
+    .filename = "gemma-2-2b-it-Q4_K_M.gguf",
+  },
+  {
+    .name     = "gemma-2-9b-it",
+    .params   = "9B",
+    .size     = "5.8 GB",
+    .repo     = "bartowski/gemma-2-9b-it-GGUF",
+    .filename = "gemma-2-9b-it-Q4_K_M.gguf",
+  },
+  {
+    .name     = "Llama-3.1-8B-Instruct",
+    .params   = "8B",
+    .size     = "4.9 GB",
+    .repo     = "bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",
+    .filename = "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
+  },
+};
+
+// -------------------------------------------------------------------------
+// Model storage directory
+// -------------------------------------------------------------------------
+
+std::filesystem::path get_models_dir() {
+#ifdef _WIN32
+  auto env = std::getenv("APPDATA");
+  auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
+  return base / "translate-app" / "models";
+#else
+  auto env = std::getenv("HOME");
+  auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
+  return base / ".translate-app" / "models";
+#endif
+}
+
+// -------------------------------------------------------------------------
+// Model download
+// -------------------------------------------------------------------------
+
+// Abort the download if progress_cb returns false
+bool download_model(const ModelInfo &model,
+                    std::function<bool(int)> progress_cb) {
+  httplib::Client cli("https://huggingface.co");
+  cli.set_follow_location(true);  // Hugging Face redirects to a CDN
+  cli.set_read_timeout(std::chrono::hours(1)); // Long timeout for large models
+
+  auto url = "/" + model.repo + "/resolve/main/" + model.filename;
+  auto path = get_models_dir() / model.filename;
+  auto tmp_path = std::filesystem::path(path).concat(".tmp");
+
+  std::ofstream ofs(tmp_path, std::ios::binary);
+  if (!ofs) { return false; }
+
+  auto res = cli.Get(url,
+    // content_receiver: Receive data chunk by chunk and write to file
+    [&](const char *data, size_t len) {
+      ofs.write(data, len);
+      return ofs.good();
+    },
+    // progress: Report download progress (return false to abort)
+    [&, last_pct = -1](size_t current, size_t total) mutable {
+      int pct = total ? (int)(current * 100 / total) : 0;
+      if (pct == last_pct) return true; // Skip if the value hasn't changed
+      last_pct = pct;
+      return progress_cb(pct);
+    });
+
+  ofs.close();
+
+  if (!res || res->status != 200) {
+    std::filesystem::remove(tmp_path);
+    return false;
+  }
+
+  // Rename after download completes
+  std::filesystem::rename(tmp_path, path);
+  return true;
+}
+
+// -------------------------------------------------------------------------
+// Server
+// -------------------------------------------------------------------------
+
+int main() {
+  httplib::Server svr;
+  // Create the model storage directory
+  auto models_dir = get_models_dir();
+  std::filesystem::create_directories(models_dir);
+
+  // Auto-download the default model if not already present
+  std::string selected_model = MODELS[0].filename;
+  auto path = models_dir / selected_model;
+  if (!std::filesystem::exists(path)) {
+    std::cout << "Downloading " << selected_model << "..." << std::endl;
+    if (!download_model(MODELS[0], [](int pct) {
+          std::cout << "\r" << pct << "%" << std::flush;
+          return true;
+        })) {
+      std::cerr << "\nFailed to download model." << std::endl;
+      return 1;
+    }
+    std::cout << std::endl;
+  }
+  auto llm = llamalib::Llama{path};
+  std::mutex llm_mutex; // Protect access during model switching
+
+  // Set a long timeout since LLM inference takes time (default is 5 seconds)
+  svr.set_read_timeout(300);
+  svr.set_write_timeout(300);
+
+  svr.set_logger([](const auto &req, const auto &res) {
+    std::cout << req.method << " " << req.path << " -> " << res.status
+              << std::endl;
+  });
+
+  svr.Get("/health", [](const httplib::Request &, httplib::Response &res) {
+    res.set_content(json{{"status", "ok"}}.dump(), "application/json");
+  });
+
+  // --- Translation endpoint (Chapter 2) ------------------------------------
+
+  svr.Post("/translate",
+           [&](const httplib::Request &req, httplib::Response &res) {
+    auto input = json::parse(req.body, nullptr, false);
+    if (input.is_discarded()) {
+      res.status = 400;
+      res.set_content(json{{"error", "Invalid JSON"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    if (!input.contains("text") || !input["text"].is_string() ||
+        input["text"].get<std::string>().empty()) {
+      res.status = 400;
+      res.set_content(json{{"error", "'text' is required"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    auto text = input["text"].get<std::string>();
+    auto target_lang = input.value("target_lang", "ja");
+
+    auto prompt = "Translate the following text to " + target_lang +
+                  ". Output only the translation, nothing else.\n\n" + text;
+
+    try {
+      std::lock_guard<std::mutex> lock(llm_mutex);
+      auto translation = llm.chat(prompt);
+      res.set_content(json{{"translation", translation}}.dump(),
+                      "application/json");
+    } catch (const std::exception &e) {
+      res.status = 500;
+      res.set_content(json{{"error", e.what()}}.dump(), "application/json");
+    }
+  });
+
+  // --- SSE streaming translation (Chapter 3) -------------------------------
+
+  svr.Post("/translate/stream",
+           [&](const httplib::Request &req, httplib::Response &res) {
+    auto input = json::parse(req.body, nullptr, false);
+    if (input.is_discarded()) {
+      res.status = 400;
+      res.set_content(json{{"error", "Invalid JSON"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    if (!input.contains("text") || !input["text"].is_string() ||
+        input["text"].get<std::string>().empty()) {
+      res.status = 400;
+      res.set_content(json{{"error", "'text' is required"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    auto text = input["text"].get<std::string>();
+    auto target_lang = input.value("target_lang", "ja");
+
+    auto prompt = "Translate the following text to " + target_lang +
+                  ". Output only the translation, nothing else.\n\n" + text;
+
+    res.set_chunked_content_provider(
+        "text/event-stream",
+        [&, prompt](size_t, httplib::DataSink &sink) {
+          std::lock_guard<std::mutex> lock(llm_mutex);
+          try {
+            llm.chat(prompt, [&](std::string_view token) {
+              sink.os << "data: "
+                      << json(std::string(token)).dump(
+                           -1, ' ', false, json::error_handler_t::replace)
+                      << "\n\n";
+              return sink.os.good(); // Abort inference on disconnect
+            });
+            sink.os << "data: [DONE]\n\n";
+          } catch (const std::exception &e) {
+            sink.os << "data: " << json({{"error", e.what()}}).dump() << "\n\n";
+          }
+          sink.done();
+          return true;
+        });
+  });
+
+  // --- Model list (Chapter 4) ----------------------------------------------
+
+  svr.Get("/models",
+          [&](const httplib::Request &, httplib::Response &res) {
+    auto models_dir = get_models_dir();
+    auto arr = json::array();
+    for (const auto &m : MODELS) {
+      auto path = models_dir / m.filename;
+      arr.push_back({
+        {"name",       m.name},
+        {"params",     m.params},
+        {"size",       m.size},
+        {"downloaded", std::filesystem::exists(path)},
+        {"selected",   m.filename == selected_model},
+      });
+    }
+    res.set_content(json{{"models", arr}}.dump(), "application/json");
+  });
+
+  // --- Model selection (Chapter 4) -----------------------------------------
+
+  svr.Post("/models/select",
+           [&](const httplib::Request &req, httplib::Response &res) {
+    auto input = json::parse(req.body, nullptr, false);
+    if (input.is_discarded() || !input.contains("model")) {
+      res.status = 400;
+      res.set_content(json{{"error", "'model' is required"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    auto name = input["model"].get<std::string>();
+
+    auto it = std::find_if(MODELS.begin(), MODELS.end(),
+      [&](const ModelInfo &m) { return m.name == name; });
+
+    if (it == MODELS.end()) {
+      res.status = 404;
+      res.set_content(json{{"error", "Unknown model"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    const auto &model = *it;
+
+    // Always respond with SSE (same format whether downloaded or not)
+    res.set_chunked_content_provider(
+        "text/event-stream",
+        [&, model](size_t, httplib::DataSink &sink) {
+          // SSE event sending helper
+          auto send = [&](const json &event) {
+            sink.os << "data: " << event.dump() << "\n\n";
+          };
+
+          // Download if not yet downloaded (report progress via SSE)
+          auto path = get_models_dir() / model.filename;
+          if (!std::filesystem::exists(path)) {
+            bool ok = download_model(model, [&](int pct) {
+              send({{"status", "downloading"}, {"progress", pct}});
+              return sink.os.good(); // Abort download on client disconnect
+            });
+            if (!ok) {
+              send({{"status", "error"}, {"message", "Download failed"}});
+              sink.done();
+              return true;
+            }
+          }
+
+          // Load and switch to the model
+          send({{"status", "loading"}});
+          {
+            std::lock_guard<std::mutex> lock(llm_mutex);
+            llm = llamalib::Llama{path};
+            selected_model = model.filename;
+          }
+
+          send({{"status", "ready"}});
+          sink.done();
+          return true;
+        });
+  });
+
+  // --- Embedded file serving (Chapter 6) ------------------------------------
+  // Chapter 5: svr.set_mount_point("/", "./public");
+  httplib::mount(svr, Web::FS);
+
+  // Start the server on a background thread
+  auto port = svr.bind_to_any_port("127.0.0.1");
+  std::thread server_thread([&]() { svr.listen_after_bind(); });
+
+  std::cout << "Listening on http://127.0.0.1:" << port << std::endl;
+
+  // Display the UI with WebView
+  webview::webview w(false, nullptr);
+  w.set_title("Translate App");
+  w.set_size(1024, 768, WEBVIEW_HINT_NONE);
+  w.navigate("http://127.0.0.1:" + std::to_string(port));
+
+#ifdef __APPLE__
+  setup_macos_edit_menu();
+#endif
+  w.run(); // Block until the window is closed
+
+  // Stop the server when the window is closed
+  svr.stop();
+  server_thread.join();
+}
+```
+
+</details>
+
+To summarize the changes from Chapter 5:
+
+- `#include <csignal>` replaced with `#include <thread>`, `<cpp-embedlib-httplib.h>`, `"WebAssets.h"`, `"webview/webview.h"`
+- Removed the `signal_handler` function
+- `svr.set_mount_point("/", "./public")` replaced with `httplib::mount(svr, Web::FS)`
+- `svr.listen("127.0.0.1", 8080)` replaced with `bind_to_any_port` + `listen_after_bind` + WebView event loop
+
+Not a single line of handler code has changed. The REST API, SSE streaming, and model management built through Chapter 5 all work as-is.
+
+## 6.6 Building and Testing
+
+```bash
+cmake -B build
+cmake --build build -j
+```
+
+Launch the app:
+
+```bash
+./build/translate-app
+```
+
+No browser is needed. A window opens automatically. The same UI from Chapter 5 appears as-is, and translation and model switching all work just the same.
+
+When you close the window, the server shuts down automatically. There's no need for `Ctrl+C`.
+
+### What Needs to Be Distributed
+
+You only need to distribute:
+
+- The single `translate-app` binary
+
+That's it. You don't need the `public/` directory. HTML, CSS, and JavaScript are embedded in the binary. Model files download automatically on first launch, so there's no need to ask users to prepare anything in advance.
+
+## Next Chapter
+
+Congratulations! 🎉
+
+In Chapter 1, `/health` just returned `{"status":"ok"}`. Now we have a desktop app where you type text and translations stream in real time, pick a different model from a dropdown and it downloads automatically, and closing the window cleanly shuts everything down — all in a single distributable binary.
+
+What we changed in this chapter was just the static file serving and the server startup. Not a single line of handler code changed. The REST API, SSE streaming, and model management we built through Chapter 5 all work as a desktop app, as-is.
+
+In the next chapter, we'll shift perspective and read through the code of llama.cpp's own `llama-server`. Let's compare our simple server with a production-quality one and see what design decisions differ and why.
+
+**Next:** [Reading the llama.cpp Server Source Code](../ch07-code-reading)
--- a/docs-src/pages/en/llm-app/ch07-code-reading.md
+++ b/docs-src/pages/en/llm-app/ch07-code-reading.md
@@ -0,0 +1,154 @@
+---
+title: "7. Reading the llama.cpp Server Source Code"
+order: 7
+
+---
+
+Over the course of six chapters, we built a translation desktop app from scratch. We have a working product, but it's ultimately a "learning-oriented" implementation. So how does "production-quality" code differ? Let's read the source code of `llama-server`, the official server bundled with llama.cpp, and compare.
+
+`llama-server` is located at `llama.cpp/tools/server/`. It uses the same cpp-httplib, so you can read the code the same way as in the previous chapters.
+
+## 7.1 Source Code Location
+
+```ascii
+llama.cpp/tools/server/
+├── server.cpp           # Main server implementation
+├── httplib.h            # cpp-httplib (bundled version)
+└── ...
+```
+
+The code is contained in a single `server.cpp`. It runs to several thousand lines, but once you understand the structure, you can narrow down the parts worth reading.
+
+## 7.2 OpenAI-Compatible API
+
+The biggest difference between the server we built and `llama-server` is the API design.
+
+**Our API:**
+
+```text
+POST /translate          → {"translation": "..."}
+POST /translate/stream   → SSE: data: "token"
+```
+
+**llama-server's API:**
+
+```text
+POST /v1/chat/completions  → OpenAI-compatible JSON
+POST /v1/completions       → OpenAI-compatible JSON
+POST /v1/embeddings        → Text embedding vectors
+```
+
+`llama-server` conforms to [OpenAI's API specification](https://platform.openai.com/docs/api-reference). This means OpenAI's official client libraries (such as the Python `openai` package) work out of the box.
+
+```python
+# Example of connecting to llama-server with the OpenAI client
+from openai import OpenAI
+client = OpenAI(base_url="http://localhost:8080/v1", api_key="dummy")
+
+response = client.chat.completions.create(
+    model="local-model",
+    messages=[{"role": "user", "content": "Hello!"}]
+)
+```
+
+Compatibility with existing tools and libraries is a big design decision. We designed a simple translation-specific API, but if you're building a general-purpose server, OpenAI compatibility has become the de facto standard.
+
+## 7.3 Concurrent Request Handling
+
+Our server processes requests one at a time. If another request arrives while a translation is in progress, it waits until the previous inference finishes. This is fine for a desktop app used by one person, but it becomes a problem for a server shared by multiple users.
+
+`llama-server` handles concurrent requests through a mechanism called **slots**.
+
+![llama-server's slot management](../slots.svg#half)
+
+The key point is that tokens from each slot are not inferred **one by one in sequence**, but rather **all at once in a single batch**. GPUs excel at parallel processing, so processing two users simultaneously takes almost the same time as processing one. This is called "continuous batching."
+
+In our server, cpp-httplib's thread pool assigns one thread per request, but the inference itself runs single-threaded inside `llm.chat()`. `llama-server` consolidates this inference step into a shared batch processing loop.
+
+## 7.4 Differences in SSE Format
+
+The streaming mechanism itself is the same (`set_chunked_content_provider` + SSE), but the data format differs.
+
+**Our format:**
+
+```text
+data: "去年の"
+data: "春に"
+data: [DONE]
+```
+
+**llama-server (OpenAI-compatible):**
+
+```text
+data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"delta":{"content":"去年の"}}]}
+data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"delta":{"content":"春に"}}]}
+data: [DONE]
+```
+
+Our format simply sends the tokens. Because `llama-server` follows the OpenAI specification, even a single token comes wrapped in JSON. It may look verbose, but it includes useful information for clients, like an `id` to identify the request and a `finish_reason` to indicate why generation stopped.
+
+## 7.5 KV Cache Reuse
+
+In our server, we process the entire prompt from scratch on every request. Our translation app's prompt is short ("Translate the following text to ja..." + input text), so this isn't a problem.
+
+`llama-server` reuses the KV cache for the prefix portion when a request shares a common prompt prefix with a previous request.
+
+![KV cache reuse](../kv-cache.svg#half)
+
+For chatbots that send a long system prompt and few-shot examples with every request, this alone dramatically reduces response time. The difference is night and day: processing several thousand tokens of system prompt every time versus reading them from cache in an instant.
+
+For our translation app, where the system prompt is just a single sentence, the benefit is limited. However, it's an optimization worth keeping in mind when applying this to your own applications.
+
+## 7.6 Structured Output
+
+Since our translation API returns plain text, there was no need to constrain the output format. But what if you want the LLM to respond in JSON?
+
+```text
+Prompt: Analyze the sentiment of the following text and return it as JSON.
+LLM output (expected): {"sentiment": "positive", "score": 0.8}
+LLM output (reality): Here are the results of the sentiment analysis. {"sentiment": ...
+```
+
+LLMs sometimes ignore instructions and add extraneous text. `llama-server` solves this problem with **grammar constraints**.
+
+```bash
+curl http://localhost:8080/v1/chat/completions \
+  -d '{
+    "messages": [{"role": "user", "content": "Analyze sentiment..."}],
+    "json_schema": {
+      "type": "object",
+      "properties": {
+        "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
+        "score": {"type": "number"}
+      },
+      "required": ["sentiment", "score"]
+    }
+  }'
+```
+
+When you specify `json_schema`, tokens that don't conform to the grammar are excluded during token generation. This guarantees that the output is always valid JSON, so there's no need to worry about `json::parse` failing.
+
+When embedding LLMs into applications, whether you can reliably parse the output directly impacts reliability. Grammar constraints are unnecessary for free-text output like translation, but they're essential for use cases where you need to return structured data as an API response.
+
+## 7.7 Summary
+
+Let's organize the differences we've covered.
+
+| Aspect | Our Server | llama-server |
+|------|-------------|--------------|
+| API design | Translation-specific | OpenAI-compatible |
+| Concurrent requests | Sequential processing | Slots + continuous batching |
+| SSE format | Tokens only | OpenAI-compatible JSON |
+| KV cache | Cleared each time | Prefix reuse |
+| Structured output | None | JSON Schema / grammar constraints |
+| Code size | ~200 lines | Several thousand lines |
+
+Our code is simple because of the assumption that "one person uses it as a desktop app." If you're building a server for multiple users or one that integrates with the existing ecosystem, `llama-server`'s design serves as a valuable reference.
+
+Conversely, even 200 lines of code is enough to make a fully functional translation app. I hope this code reading exercise has also conveyed the value of "building only what you need."
+
+## Next Chapter
+
+In the next chapter, we'll cover the key points for swapping in your own library and customizing the app to make it truly yours.
+
+**Next:** [Making It Your Own](../ch08-customization)
--- a/docs-src/pages/en/llm-app/ch08-customization.md
+++ b/docs-src/pages/en/llm-app/ch08-customization.md
@@ -0,0 +1,120 @@
+---
+title: "8. Making It Your Own"
+order: 8
+
+---
+
+Through Chapter 7, we've built a translation desktop app and studied how production-quality code differs. In this chapter, let's go over the key points for **turning this app into something entirely your own**.
+
+The translation app was just a vehicle. Replace llama.cpp with your own library, and the same architecture works for any application.
+
+## 8.1 Swapping Out the Build Configuration
+
+First, replace the llama.cpp-related `FetchContent` entries in `CMakeLists.txt` with your own library.
+
+```cmake
+# Remove: llama.cpp and cpp-llamalib FetchContent
+
+# Add: your own library
+FetchContent_Declare(my_lib
+    GIT_REPOSITORY https://github.com/yourname/my-lib
+    GIT_TAG        main
+)
+FetchContent_MakeAvailable(my_lib)
+
+target_link_libraries(my-app PRIVATE
+    httplib::httplib
+    nlohmann_json::nlohmann_json
+    my_lib        # Your library instead of cpp-llamalib
+    # ...
+)
+```
+
+If your library doesn't support CMake, you can place the header and source files directly in `src/` and add them to `add_executable`. Keep cpp-httplib, nlohmann/json, and webview as they are.
+
+## 8.2 Adapting the API to Your Task
+
+Change the translation API's endpoints and parameters to match your task.
+
+| Translation app | Your app (e.g., image processing) |
+|---|---|
+| `POST /translate` | `POST /process` |
+| `{"text": "...", "target_lang": "ja"}` | `{"image": "base64...", "filter": "blur"}` |
+| `POST /translate/stream` | `POST /process/stream` |
+| `GET /models` | `GET /filters` or `GET /presets` |
+
+Then update each handler's implementation. For example, just replace the `llm.chat()` calls with your own library's API.
+
+```cpp
+// Before: LLM translation
+auto translation = llm.chat(prompt);
+res.set_content(json{{"translation", translation}}.dump(), "application/json");
+
+// After: e.g., an image processing library
+auto result = my_lib::process(input_image, options);
+res.set_content(json{{"result", result}}.dump(), "application/json");
+```
+
+The same goes for SSE streaming. If your library has a function that reports progress via a callback, you can use the exact same pattern from Chapter 3 to send incremental responses. SSE isn't limited to LLMs — it's useful for any time-consuming task: image processing progress, data conversion steps, long-running computations.
+
+## 8.3 Design Considerations
+
+### Libraries with Expensive Initialization
+
+In this book, we load the LLM model at the top of `main()` and keep it in a variable. This is intentional. Loading the model on every request would take several seconds, so we load it once at startup and reuse it. If your library has expensive initialization (loading large data files, acquiring GPU resources, etc.), the same approach works well.
+
+### Thread Safety
+
+cpp-httplib processes requests concurrently using a thread pool. In Chapter 4 we protected the `llm` object with a `std::mutex` to prevent crashes during model switching. The same pattern applies when integrating your own library. If your library isn't thread-safe or you need to swap objects at runtime, protect access with a `std::mutex`.
+
+## 8.4 Customizing the UI
+
+Edit the three files in `public/`.
+
+- **`index.html`** — Change the input form layout. Swap `<textarea>` for `<input type="file">`, add parameter fields, etc.
+- **`style.css`** — Adjust the layout and colors. Keep the two-column design or switch to a single column
+- **`script.js`** — Update the `fetch()` target URLs, request bodies, and how responses are displayed
+
+Even without changing any server code, just swapping the HTML makes the app look completely different. Since these are static files, you can iterate quickly — just reload the browser without restarting the server.
+
+This book used plain HTML, CSS, and JavaScript, but combining them with a frontend framework like Vue or React, or a CSS framework, would let you build an even more polished app.
+
+## 8.5 Distribution Considerations
+
+### Licenses
+
+Check the licenses of the libraries you're using. cpp-httplib (MIT), nlohmann/json (MIT), and webview (MIT) all allow commercial use. Don't forget to check the license of your own library and its dependencies too.
+
+### Models and Data Files
+
+The download mechanism we built in Chapter 4 isn't limited to LLM models. If your app needs large data files, the same pattern lets you auto-download them on first launch, keeping the binary small while sparing users the manual setup.
+
+If the data is small, you can embed it directly into the binary with cpp-embedlib.
+
+### Cross-Platform Builds
+
+webview supports macOS, Linux, and Windows. When building for each platform:
+
+- **macOS** — No additional dependencies
+- **Linux** — Requires `libwebkit2gtk-4.1-dev`
+- **Windows** — Requires the WebView2 runtime (pre-installed on Windows 11)
+
+Consider setting up cross-platform builds in CI (e.g., GitHub Actions) too.
+
+## Closing
+
+Thank you so much for reading to the end. 🙏
+
+This book started with `/health` returning `{"status":"ok"}` in Chapter 1. From there we built a REST API, added SSE streaming, downloaded models from Hugging Face, created a browser-based Web UI, and packaged it all into a single-binary desktop app. In Chapter 7 we read through `llama-server`'s code and learned how production-quality servers differ in their design. It's been quite a journey, and I'm truly grateful you stuck with it all the way through.
+
+Looking back, we used several key cpp-httplib features hands-on:
+
+- **Server**: routing, JSON responses, SSE streaming with `set_chunked_content_provider`, static file serving with `set_mount_point`
+- **Client**: HTTPS connections, redirect following, large downloads with content receivers, progress callbacks
+- **WebView integration**: `bind_to_any_port` + `listen_after_bind` for background threading
+
+cpp-httplib offers many more features beyond what we covered here, including multipart file uploads, authentication, timeout control, compression, and range requests. See [A Tour of cpp-httplib](../../tour/) for details.
+
+These patterns aren't limited to a translation app. If you want to add a web API to your C++ library, give it a browser UI, or ship it as an easy-to-distribute desktop app — I hope this book serves as a useful reference.
+
+Take your own library, build your own app, and have fun with it. Happy hacking! 🚀
--- a/docs-src/pages/en/llm-app/index.md
+++ b/docs-src/pages/en/llm-app/index.md
@@ -1,23 +1,26 @@
 ---
 title: "Building a Desktop LLM App with cpp-httplib"
 order: 0
-status: "draft"
+
 ---

-Build an LLM-powered translation desktop app step by step, learning both the server and client sides of cpp-httplib along the way. Translation is just an example — swap it out to build your own summarizer, code generator, chatbot, or any other LLM application.
+Have you ever wanted to add a web API to your own C++ library, or quickly build an Electron-like desktop app? In Rust you might reach for "Tauri + axum," but in C++ it always seemed out of reach.

-## Dependencies
+With [cpp-httplib](https://github.com/yhirose/cpp-httplib), [webview/webview](https://github.com/webview/webview), and [cpp-embedlib](https://github.com/yhirose/cpp-embedlib), you can take the same approach in pure C++ — and produce a small, easy-to-distribute single binary.

- [llama.cpp](https://github.com/ggml-org/llama.cpp) — LLM inference engine
- [nlohmann/json](https://github.com/nlohmann/json) — JSON parser (header-only)
- [webview/webview](https://github.com/webview/webview) — WebView wrapper (header-only)
- [cpp-httplib](https://github.com/yhirose/cpp-httplib) — HTTP server/client (header-only)
+In this tutorial we build an LLM-powered translation app using [llama.cpp](https://github.com/ggml-org/llama.cpp), progressing step by step from "REST API" to "SSE streaming" to "Web UI" to "desktop app." Translation is just the vehicle — replace llama.cpp with your own library and the same architecture works for any application.
+
+![Desktop App](app.png#large-center)
+
+If you know basic C++17 and understand the basics of HTTP / REST APIs, you're ready to start.

 ## Chapters

-1. **Embed llama.cpp and create a REST API** — Start with a simple API that accepts text via POST and returns a translation as JSON
-2. **Add token streaming with SSE** — Stream translation results token by token using the standard LLM API approach
-3. **Add model discovery and download** — Use the client to search and download GGUF models from Hugging Face
-4. **Add a Web UI** — Serve a translation UI with static file hosting, making the app accessible from a browser
-5. **Turn it into a desktop app with WebView** — Wrap the web app with webview/webview to create an Electron-like desktop application
-6. **Code reading: llama.cpp's server implementation** — Compare your implementation with production-quality code and learn from the differences
+1. **[Set up the project](ch01-setup)** — Fetch dependencies, configure the build, write scaffold code
+2. **[Embed llama.cpp and create a REST API](ch02-rest-api)** — Return translation results as JSON
+3. **[Add token streaming with SSE](ch03-sse-streaming)** — Stream responses token by token
+4. **[Add model discovery and management](ch04-model-management)** — Download and switch models from Hugging Face
+5. **[Add a Web UI](ch05-web-ui)** — A browser-based translation interface
+6. **[Turn it into a desktop app with WebView](ch06-desktop-app)** — A single-binary desktop application
+7. **[Reading the llama.cpp server source code](ch07-code-reading)** — Compare with production-quality code
+8. **[Making it your own](ch08-customization)** — Swap in your own library and customize
--- a/docs-src/pages/en/llm-app/kv-cache.svg
+++ b/docs-src/pages/en/llm-app/kv-cache.svg
@@ -0,0 +1,36 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 426 160" font-family="system-ui, sans-serif" font-size="14">
+  <rect x="0" y="0" width="426" height="160" rx="8" fill="#f5f3ef"/>
+
+  <defs>
+    <marker id="arrowhead" markerWidth="7" markerHeight="5" refX="7" refY="2.5" orient="auto">
+      <polygon points="0,0 7,2.5 0,5" fill="#198754"/>
+    </marker>
+  </defs>
+
+  <!-- request 1 (y=16) -->
+  <text x="94" y="35" fill="#333" font-weight="bold" text-anchor="end">Request 1:</text>
+  <rect x="106" y="16" width="138" height="30" rx="4" fill="#d1e7dd" stroke="#198754" stroke-width="1"/>
+  <text x="175" y="36" fill="#333" text-anchor="middle">System prompt</text>
+  <text x="256" y="36" fill="#333" text-anchor="middle">+</text>
+  <rect x="270" y="16" width="138" height="30" rx="4" fill="#cfe2ff" stroke="#0d6efd" stroke-width="1"/>
+  <text x="339" y="36" fill="#333" text-anchor="middle">User question A</text>
+
+  <!-- annotation: cache save -->
+  <text x="175" y="64" fill="#198754" font-size="11" text-anchor="middle">Saved to KV cache</text>
+
+  <!-- arrow -->
+  <line x1="175" y1="70" x2="175" y2="90" stroke="#198754" stroke-width="1.2" marker-end="url(#arrowhead)"/>
+  <text x="188" y="85" fill="#198754" font-size="11">Reuse</text>
+
+  <!-- request 2 (y=96) -->
+  <text x="94" y="115" fill="#333" font-weight="bold" text-anchor="end">Request 2:</text>
+  <rect x="106" y="96" width="138" height="30" rx="4" fill="#d1e7dd" stroke="#198754" stroke-width="1" stroke-dasharray="6,3"/>
+  <text x="175" y="116" fill="#333" text-anchor="middle">System prompt</text>
+  <text x="256" y="116" fill="#333" text-anchor="middle">+</text>
+  <rect x="270" y="96" width="138" height="30" rx="4" fill="#cfe2ff" stroke="#0d6efd" stroke-width="1"/>
+  <text x="339" y="116" fill="#333" text-anchor="middle">User question B</text>
+
+  <!-- bottom labels -->
+  <text x="175" y="144" fill="#198754" font-size="11" text-anchor="middle">No recomputation</text>
+  <text x="339" y="144" fill="#0d6efd" font-size="11" text-anchor="middle">Only this is computed</text>
+</svg>
--- a/docs-src/pages/en/llm-app/slots.svg
+++ b/docs-src/pages/en/llm-app/slots.svg
@@ -0,0 +1,24 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 440 240" font-family="system-ui, sans-serif" font-size="14">
+  <!-- outer box -->
+  <rect x="0" y="0" width="440" height="240" rx="8" fill="#f5f3ef"/>
+  <text x="20" y="28" font-weight="bold" font-size="16" fill="#333">llama-server</text>
+
+  <!-- slot 0 -->
+  <rect x="20" y="46" width="400" height="32" rx="4" fill="#d1e7dd" stroke="#198754" stroke-width="1"/>
+  <text x="32" y="67" fill="#333">Slot 0: User A's request</text>
+
+  <!-- slot 1 -->
+  <rect x="20" y="86" width="400" height="32" rx="4" fill="#d1e7dd" stroke="#198754" stroke-width="1"/>
+  <text x="32" y="107" fill="#333">Slot 1: User B's request</text>
+
+  <!-- slot 2 -->
+  <rect x="20" y="126" width="400" height="32" rx="4" fill="#e9ecef" stroke="#adb5bd" stroke-width="1"/>
+  <text x="32" y="147" fill="#999">Slot 2: (idle)</text>
+
+  <!-- slot 3 -->
+  <rect x="20" y="166" width="400" height="32" rx="4" fill="#e9ecef" stroke="#adb5bd" stroke-width="1"/>
+  <text x="32" y="187" fill="#999">Slot 3: (idle)</text>
+
+  <!-- arrow + label -->
+  <text x="20" y="224" fill="#333" font-size="13">→ Active slots are inferred together in a single batch</text>
+</svg>
--- a/docs-src/pages/en/llm-app/webui.png
+++ b/docs-src/pages/en/llm-app/webui.png
--- a/docs-src/pages/ja/index.md
+++ b/docs-src/pages/ja/index.md
@@ -18,8 +18,8 @@ HTTPSも使えます。OpenSSLやmbedTLSをリンクするだけで、サーバ
 ## ドキュメント

 - [A Tour of cpp-httplib](tour/) — 基本を順を追って学べるチュートリアル。初めての方はここから
+- [Building a Desktop LLM App](llm-app/) — llama.cppを組み込んだデスクトップアプリを段階的に構築する実践ガイド

 ## お楽しみに

 - [Cookbook](cookbook/) — 目的別のレシピ集。必要なトピックから読めます
- [Building a Desktop LLM App](llm-app/) — llama.cpp を組み込んだデスクトップアプリを段階的に構築する実践ガイド
--- a/docs-src/pages/ja/llm-app/ch01-setup.md
+++ b/docs-src/pages/ja/llm-app/ch01-setup.md
@@ -0,0 +1,236 @@
+---
+title: "1. プロジェクト環境を作る"
+order: 1
+
+---
+
+llama.cppを推論エンジンに使って、テキスト翻訳のREST APIサーバーを段階的に作っていきます。最終的にはこんなリクエストで翻訳結果が返ってくるようになります。
+
+```bash
+curl -X POST http://localhost:8080/translate \
+  -H "Content-Type: application/json" \
+  -d '{"text": "The weather is nice today. Shall we go for a walk?", "target_lang": "ja"}'
+```
+
+```json
+{
+  "translation": "今日はいい天気ですね。散歩に行きましょうか？"
+}
+```
+
+「翻訳API」はあくまで一例です。プロンプトを差し替えれば、要約・コード生成・チャットボットなど、お好きなLLMアプリに応用できます。
+
+最終的にサーバーが提供するAPIの一覧です。
+
+| メソッド | パス | 説明 | 章 |
+| -------- | ---- | ---- | -- |
+| `GET` | `/health` | サーバーの状態を返す | 1 |
+| `POST` | `/translate` | テキストを翻訳してJSONで返す | 2 |
+| `POST` | `/translate/stream` | トークン単位でSSEストリーミング | 3 |
+| `GET` | `/models` | モデル一覧（available / downloaded / selected） | 4 |
+| `POST` | `/models/select` | モデルを選択（未ダウンロードなら自動取得） | 4 |
+
+この章では、まずプロジェクトの環境を整えます。依存ライブラリの取得、ディレクトリ構成、ビルド設定、モデルファイルの入手まで済ませて、次の章ですぐにコードを書き始められるようにしましょう。
+
+## 前提条件
+
+- C++20対応コンパイラ（GCC 10+、Clang 10+、MSVC 2019 16.8+）
+- CMake 3.20以上
+- OpenSSL（4章でHTTPSクライアントに使用。macOS: `brew install openssl`、Ubuntu: `sudo apt install libssl-dev`）
+- 十分なディスク容量（モデルファイルが数GBになります）
+
+## 1.1 何を使うか
+
+使うライブラリはこちらです。
+
+| ライブラリ | 役割 |
+| ----------- | ------ |
+| [cpp-httplib](https://github.com/yhirose/cpp-httplib) | HTTPサーバー/クライアント |
+| [nlohmann/json](https://github.com/nlohmann/json) | JSONパーサー |
+| [cpp-llamalib](https://github.com/yhirose/cpp-llamalib) | llama.cppラッパー |
+| [llama.cpp](https://github.com/ggml-org/llama.cpp) | LLM推論エンジン |
+| [webview/webview](https://github.com/webview/webview) | デスクトップWebView（6章で使用） |
+
+cpp-httplib、nlohmann/json、cpp-llamalibはヘッダーオンリーライブラリです。`curl`でヘッダーファイルを1枚ダウンロードして`#include`するだけでも使えますが、本書ではCMakeの`FetchContent`で自動取得します。`CMakeLists.txt`に書いておけば、`cmake -B build`の時点で全ライブラリが自動でダウンロード・ビルドされるので、手作業の手順が減ります。`webview`は6章で使うので、今は気にしなくて大丈夫です。
+
+## 1.2 ディレクトリ構成
+
+最終的にこんな構成になります。
+
+```ascii
+translate-app/
+├── CMakeLists.txt
+├── models/
+│   └── (GGUFファイル)
+└── src/
+    └── main.cpp
+```
+
+ライブラリのソースコードはプロジェクトに含めません。CMakeの`FetchContent`がビルド時に自動で取得するので、必要なのは自分のコードだけです。
+
+プロジェクトディレクトリを作って、gitリポジトリにしましょう。
+
+```bash
+mkdir translate-app && cd translate-app
+mkdir src models
+git init
+```
+
+## 1.3 GGUFモデルファイルを入手する
+
+LLMの推論にはモデルファイルが必要です。GGUFはllama.cppが使うモデル形式で、Hugging Faceにたくさんあります。
+
+まずは小さいモデルで試してみましょう。GoogleのGemma 2 2Bの量子化版（約1.6GB）がおすすめです。軽量ですが多言語に対応していて、翻訳タスクにも向いています。
+
+```bash
+curl -L -o models/gemma-2-2b-it-Q4_K_M.gguf \
+  https://huggingface.co/bartowski/gemma-2-2b-it-GGUF/resolve/main/gemma-2-2b-it-Q4_K_M.gguf
+```
+
+4章で、このダウンロード自体をアプリ内からcpp-httplibのクライアント機能で行えるようにします。
+
+## 1.4 CMakeLists.txt
+
+プロジェクトルートに`CMakeLists.txt`を作ります。`FetchContent`で依存ライブラリを宣言しておくと、CMakeが自動でダウンロード・ビルドしてくれます。
+
+<!-- data-file="CMakeLists.txt" -->
+```cmake
+cmake_minimum_required(VERSION 3.20)
+project(translate-server CXX)
+set(CMAKE_CXX_STANDARD 20)
+
+include(FetchContent)
+
+# llama.cpp（LLM推論エンジン）
+FetchContent_Declare(llama
+    GIT_REPOSITORY https://github.com/ggml-org/llama.cpp
+    GIT_TAG        master
+    GIT_SHALLOW    TRUE
+)
+FetchContent_MakeAvailable(llama)
+
+# cpp-httplib（HTTPサーバー/クライアント）
+FetchContent_Declare(httplib
+    GIT_REPOSITORY https://github.com/yhirose/cpp-httplib
+    GIT_TAG        master
+)
+FetchContent_MakeAvailable(httplib)
+
+# nlohmann/json（JSONパーサー）
+FetchContent_Declare(json
+    URL https://github.com/nlohmann/json/releases/download/v3.11.3/json.tar.xz
+)
+FetchContent_MakeAvailable(json)
+
+# cpp-llamalib（llama.cppヘッダーオンリーラッパー）
+FetchContent_Declare(cpp_llamalib
+    GIT_REPOSITORY https://github.com/yhirose/cpp-llamalib
+    GIT_TAG        main
+)
+FetchContent_MakeAvailable(cpp_llamalib)
+
+add_executable(translate-server src/main.cpp)
+
+target_link_libraries(translate-server PRIVATE
+    httplib::httplib
+    nlohmann_json::nlohmann_json
+    cpp-llamalib
+)
+```
+
+`FetchContent_Declare`でライブラリのソース取得先を宣言し、`FetchContent_MakeAvailable`で実際に取得・ビルドします。初回の`cmake -B build`は全ライブラリのダウンロードとllama.cppのビルドが走るので時間がかかりますが、2回目以降はキャッシュが効きます。
+
+`target_link_libraries`でリンクするだけで、インクルードパスやビルド設定は各ライブラリのCMakeが自動で設定してくれます。
+
+## 1.5 雛形コードの作成
+
+この雛形コードをベースに、章ごとに機能を追加していきます。
+
+<!-- data-file="main.cpp" -->
+```cpp
+// src/main.cpp
+#include <httplib.h>
+#include <nlohmann/json.hpp>
+
+#include <csignal>
+#include <iostream>
+
+using json = nlohmann::json;
+
+httplib::Server svr;
+
+// `Ctrl+C`でgraceful shutdown
+void signal_handler(int sig) {
+  if (sig == SIGINT || sig == SIGTERM) {
+    std::cout << "\nReceived signal, shutting down gracefully...\n";
+    svr.stop();
+  }
+}
+
+int main() {
+  // リクエストとレスポンスをログに記録
+  svr.set_logger([](const auto &req, const auto &res) {
+    std::cout << req.method << " " << req.path << " -> " << res.status
+              << std::endl;
+  });
+
+  // ヘルスチェック
+  svr.Get("/health", [](const auto &, auto &res) {
+    res.set_content(json{{"status", "ok"}}.dump(), "application/json");
+  });
+
+  // 各エンドポイントのダミー実装（以降の章で本物に差し替えていく）
+  svr.Post("/translate",
+           [](const auto &req, auto &res) {
+    res.set_content(json{{"translation", "TODO"}}.dump(), "application/json");
+  });
+
+  svr.Post("/translate/stream",
+           [](const auto &req, auto &res) {
+    res.set_content("data: \"TODO\"\n\ndata: [DONE]\n\n", "text/event-stream");
+  });
+
+  svr.Get("/models",
+          [](const auto &req, auto &res) {
+    res.set_content(json{{"models", json::array()}}.dump(), "application/json");
+  });
+
+  svr.Post("/models/select",
+           [](const auto &req, auto &res) {
+    res.set_content(json{{"status", "TODO"}}.dump(), "application/json");
+  });
+
+  // `Ctrl+C` (`SIGINT`)や`kill` (`SIGTERM`)でサーバーを停止できるようにする
+  signal(SIGINT, signal_handler);
+  signal(SIGTERM, signal_handler);
+
+  // サーバー起動
+  std::cout << "Listening on http://127.0.0.1:8080" << std::endl;
+  svr.listen("127.0.0.1", 8080);
+}
+```
+
+## 1.6 ビルドと動作確認
+
+ビルドしてサーバーを起動し、curlでリクエストが通るか確かめます。
+
+```bash
+cmake -B build
+cmake --build build -j
+./build/translate-server
+```
+
+別のターミナルからcurlで確認してみましょう。
+
+```bash
+curl http://localhost:8080/health
+# => {"status":"ok"}
+```
+
+JSONが返ってくれば環境構築は完了です。
+
+## 次の章へ
+
+環境が整ったので、次の章ではこの雛形に翻訳REST APIを実装します。llama.cppで推論を行い、cpp-httplibでそれをHTTPエンドポイントとして公開します。
+
+**Next:** [llama.cppを組み込んでREST APIを作る](../ch02-rest-api)
--- a/docs-src/pages/ja/llm-app/ch02-rest-api.md
+++ b/docs-src/pages/ja/llm-app/ch02-rest-api.md
@@ -0,0 +1,212 @@
+---
+title: "2. llama.cppを組み込んでREST APIを作る"
+order: 2
+
+---
+
+1章の雛形では`/translate`が`"TODO"`を返すだけでした。この章ではllama.cppの推論を組み込んで、実際に翻訳結果を返すAPIに仕上げます。
+
+llama.cppのAPIを直接扱うとコードが長くなるので、薄いラッパーライブラリ[cpp-llamalib](https://github.com/yhirose/cpp-llamalib)を使います。モデルのロードから推論まで数行で書けるので、cpp-httplibの使い方に集中できます。
+
+## 2.1 LLMの初期化
+
+`llamalib::Llama`にモデルファイルのパスを渡すだけで、モデルのロード・コンテキスト作成・サンプラー設定がすべて済みます。1章で別のモデルをダウンロードした場合は、パスをそのモデルに合わせてください。
+
+```cpp
+#include <cpp-llamalib.h>
+
+int main() {
+  auto llm = llamalib::Llama{"models/gemma-2-2b-it-Q4_K_M.gguf"};
+
+  // LLM推論は時間がかかるのでタイムアウトを長めに設定（デフォルトは5秒）
+  svr.set_read_timeout(300);
+  svr.set_write_timeout(300);
+
+  // ... HTTPサーバーの構築・起動 ...
+}
+```
+
+GPU層数やコンテキスト長などを変えたい場合は`llamalib::Options`で指定できます。
+
+```cpp
+auto llm = llamalib::Llama{"models/gemma-2-2b-it-Q4_K_M.gguf", {
+  .n_gpu_layers = 0,  // CPUのみ
+  .n_ctx = 4096,
+}};
+```
+
+## 2.2 `/translate`ハンドラ
+
+1章ではダミーのJSONを返していたハンドラを、実際の推論に差し替えます。
+
+```cpp
+svr.Post("/translate",
+         [&](const httplib::Request &req, httplib::Response &res) {
+  // JSONパース（第3引数`false`: 失敗時に例外を投げず`is_discarded()`で判定）
+  auto input = json::parse(req.body, nullptr, false);
+  if (input.is_discarded()) {
+    res.status = 400;
+    res.set_content(json{{"error", "Invalid JSON"}}.dump(),
+                    "application/json");
+    return;
+  }
+
+  // 必須フィールドの検証
+  if (!input.contains("text") || !input["text"].is_string() ||
+      input["text"].get<std::string>().empty()) {
+    res.status = 400;
+    res.set_content(json{{"error", "'text' is required"}}.dump(),
+                    "application/json");
+    return;
+  }
+
+  auto text = input["text"].get<std::string>();
+  auto target_lang = input.value("target_lang", "ja"); // デフォルトは日本語
+
+  // プロンプトを組み立てて推論
+  auto prompt = "Translate the following text to " + target_lang +
+                ". Output only the translation, nothing else.\n\n" + text;
+
+  try {
+    auto translation = llm.chat(prompt);
+    res.set_content(json{{"translation", translation}}.dump(),
+                    "application/json");
+  } catch (const std::exception &e) {
+    res.status = 500;
+    res.set_content(json{{"error", e.what()}}.dump(), "application/json");
+  }
+});
+```
+
+`llm.chat()`は推論中に例外を投げることがあります（コンテキスト長の超過など）。`try/catch`で捕捉してエラーをJSONで返すことで、サーバーがクラッシュするのを防ぎます。
+
+## 2.3 全体のコード
+
+ここまでの変更をまとめた完成形です。
+
+<details>
+<summary data-file="main.cpp">全体のコード（main.cpp）</summary>
+
+```cpp
+#include <httplib.h>
+#include <nlohmann/json.hpp>
+#include <cpp-llamalib.h>
+
+#include <csignal>
+#include <iostream>
+
+using json = nlohmann::json;
+
+httplib::Server svr;
+
+// `Ctrl+C`でgraceful shutdown
+void signal_handler(int sig) {
+  if (sig == SIGINT || sig == SIGTERM) {
+    std::cout << "\nReceived signal, shutting down gracefully...\n";
+    svr.stop();
+  }
+}
+
+int main() {
+  // 1章でダウンロードしたモデルをロード
+  auto llm = llamalib::Llama{"models/gemma-2-2b-it-Q4_K_M.gguf"};
+
+  // LLM推論は時間がかかるのでタイムアウトを長めに設定（デフォルトは5秒）
+  svr.set_read_timeout(300);
+  svr.set_write_timeout(300);
+
+  // リクエストとレスポンスをログに記録
+  svr.set_logger([](const auto &req, const auto &res) {
+    std::cout << req.method << " " << req.path << " -> " << res.status
+              << std::endl;
+  });
+
+  svr.Get("/health", [](const httplib::Request &, httplib::Response &res) {
+    res.set_content(json{{"status", "ok"}}.dump(), "application/json");
+  });
+
+  svr.Post("/translate",
+           [&](const httplib::Request &req, httplib::Response &res) {
+    // JSONパース（第3引数`false`: 失敗時に例外を投げず`is_discarded()`で判定）
+    auto input = json::parse(req.body, nullptr, false);
+    if (input.is_discarded()) {
+      res.status = 400;
+      res.set_content(json{{"error", "Invalid JSON"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    // 必須フィールドの検証
+    if (!input.contains("text") || !input["text"].is_string() ||
+        input["text"].get<std::string>().empty()) {
+      res.status = 400;
+      res.set_content(json{{"error", "'text' is required"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    auto text = input["text"].get<std::string>();
+    auto target_lang = input.value("target_lang", "ja"); // デフォルトは日本語
+
+    // プロンプトを組み立てて推論
+    auto prompt = "Translate the following text to " + target_lang +
+                  ". Output only the translation, nothing else.\n\n" + text;
+
+    try {
+      auto translation = llm.chat(prompt);
+      res.set_content(json{{"translation", translation}}.dump(),
+                      "application/json");
+    } catch (const std::exception &e) {
+      res.status = 500;
+      res.set_content(json{{"error", e.what()}}.dump(), "application/json");
+    }
+  });
+
+  // 以降の章で本物に差し替えるダミー実装
+  svr.Get("/models",
+          [](const httplib::Request &, httplib::Response &res) {
+    res.set_content(json{{"models", json::array()}}.dump(), "application/json");
+  });
+
+  svr.Post("/models/select",
+           [](const httplib::Request &, httplib::Response &res) {
+    res.set_content(json{{"status", "TODO"}}.dump(), "application/json");
+  });
+
+  // `Ctrl+C` (`SIGINT`)や`kill` (`SIGTERM`)でサーバーを停止できるようにする
+  signal(SIGINT, signal_handler);
+  signal(SIGTERM, signal_handler);
+
+  // サーバー起動（`stop()`が呼ばれるまでブロック）
+  std::cout << "Listening on http://127.0.0.1:8080" << std::endl;
+  svr.listen("127.0.0.1", 8080);
+}
+```
+
+</details>
+
+## 2.4 動作確認
+
+ビルドし直してサーバーを起動し、今度は実際の翻訳結果が返ってくるか確かめましょう。
+
+```bash
+cmake --build build -j
+./build/translate-server
+```
+
+```bash
+curl -X POST http://localhost:8080/translate \
+  -H "Content-Type: application/json" \
+  -d '{"text": "I had a great time visiting Tokyo last spring. The cherry blossoms were beautiful.", "target_lang": "ja"}'
+# => {"translation":"去年の春に東京を訪れた。桜が綺麗だった。"}
+```
+
+1章では`"TODO"`が返ってきていましたが、今度は実際の翻訳結果が返ってきます。
+
+## 次の章へ
+
+この章で作ったREST APIは、翻訳が完了するまで全文を待つので、長いテキストだとユーザーは進捗がわからないまま待つことになります。
+
+次の章ではSSE（Server-Sent Events）を使って、トークンが生成されるたびにリアルタイムで返す仕組みにします。
+
+**Next:** [SSEでトークンストリーミングを追加する](../ch03-sse-streaming)
--- a/docs-src/pages/ja/llm-app/ch03-sse-streaming.md
+++ b/docs-src/pages/ja/llm-app/ch03-sse-streaming.md
@@ -0,0 +1,264 @@
+---
+title: "3. SSEでトークンストリーミングを追加する"
+order: 3
+
+---
+
+2章の`/translate`は、翻訳が完了してから結果をまとめて返していました。短い文なら問題ありませんが、長い文だとユーザーは何も表示されないまま何秒も待つことになります。
+
+この章ではSSE（Server-Sent Events）を使って、トークンが生成されるたびにリアルタイムで返す`/translate/stream`エンドポイントを追加します。ChatGPTやClaudeのAPIでおなじみの方式です。
+
+## 3.1 SSEとは
+
+SSEはHTTPのレスポンスをストリームとして送る仕組みです。クライアントがリクエストを送ると、サーバーは接続を保ったままイベントを少しずつ返します。フォーマットはシンプルなテキストです。
+
+```text
+data: "去年の"
+data: "春に"
+data: "東京を"
+data: [DONE]
+```
+
+各行は`data:`で始まり、空行で区切ります。Content-Typeは`text/event-stream`です。トークンはJSON文字列としてエスケープして送るので、ダブルクォートで囲んだ形式になります（3.3節で実装します）。
+
+## 3.2 cpp-httplibでのストリーミング
+
+cpp-httplibでは`set_chunked_content_provider`を使うと、レスポンスを少しずつ送れます。コールバックの中で`sink.os`に書き込むたびにデータがクライアントに届きます。
+
+```cpp
+res.set_chunked_content_provider(
+    "text/event-stream",
+    [](size_t offset, httplib::DataSink &sink) {
+      sink.os << "data: hello\n\n";
+      sink.done();
+      return true;
+    });
+```
+
+`sink.done()`を呼ぶとストリームが終了します。クライアントが途中で接続を切った場合、`sink.os`の書き込みが失敗して`sink.os.fail()`が`true`になります。これを使って切断を検知し、不要な推論を中断できます。
+
+## 3.3 `/translate/stream`ハンドラ
+
+JSONパースとバリデーションは2章の`/translate`と同じです。違うのはレスポンスの返し方だけ。`llm.chat()`のストリーミングコールバックと`set_chunked_content_provider`を組み合わせます。
+
+```cpp
+svr.Post("/translate/stream",
+         [&](const httplib::Request &req, httplib::Response &res) {
+  // ... JSONパース・バリデーションは/translateと同じ ...
+
+  res.set_chunked_content_provider(
+      "text/event-stream",
+      [&, prompt](size_t, httplib::DataSink &sink) {
+        try {
+          llm.chat(prompt, [&](std::string_view token) {
+            sink.os << "data: "
+                    << json(std::string(token)).dump(
+                         -1, ' ', false, json::error_handler_t::replace)
+                    << "\n\n";
+            return sink.os.good(); // 切断されたらfalse→推論を中断
+          });
+          sink.os << "data: [DONE]\n\n";
+        } catch (const std::exception &e) {
+          sink.os << "data: " << json({{"error", e.what()}}).dump() << "\n\n";
+        }
+        sink.done();
+        return true;
+      });
+});
+```
+
+ポイントをいくつか。
+
+- `llm.chat()`にコールバックを渡すと、トークンが生成されるたびに呼ばれます。コールバックが`false`を返すと生成を中断します
+- `sink.os`に書き込んだ後、`sink.os.good()`でクライアントがまだ接続しているかを確認できます。切断されていたら`false`を返して推論を止めます
+- 各トークンは`json(token).dump()`でJSON文字列としてエスケープしてから送ります。改行やクォートを含むトークンでも安全です
+- `dump(-1, ' ', false, ...)`の最初の3つの引数はデフォルトと同じです。重要なのは第4引数の`json::error_handler_t::replace`です。LLMはトークンをサブワード単位で返すため、マルチバイト文字（日本語など）の途中でトークンが切れることがあります。不完全なUTF-8バイト列をそのまま`dump()`に渡すと例外が飛ぶので、`replace`で安全に置換します。ブラウザ側で結合されるため、表示上の問題はありません
+- `try/catch`でラムダ全体を囲んでいます。`llm.chat()`はコンテキストウィンドウの超過などで例外を投げることがあります。ラムダ内で例外が未捕捉だとサーバーがクラッシュするので、エラーをSSEイベントとして返します
+- `data: [DONE]`はOpenAI APIと同じ慣習で、ストリームの終了をクライアントに伝えます
+
+## 3.4 全体のコード
+
+2章のコードに`/translate/stream`を追加した完成形です。
+
+<details>
+<summary data-file="main.cpp">全体のコード（main.cpp）</summary>
+
+```cpp
+#include <httplib.h>
+#include <nlohmann/json.hpp>
+#include <cpp-llamalib.h>
+
+#include <csignal>
+#include <iostream>
+
+using json = nlohmann::json;
+
+httplib::Server svr;
+
+// `Ctrl+C`でgraceful shutdown
+void signal_handler(int sig) {
+  if (sig == SIGINT || sig == SIGTERM) {
+    std::cout << "\nReceived signal, shutting down gracefully...\n";
+    svr.stop();
+  }
+}
+
+int main() {
+  // GGUFモデルをロード
+  auto llm = llamalib::Llama{"models/gemma-2-2b-it-Q4_K_M.gguf"};
+
+  // LLM推論は時間がかかるのでタイムアウトを長めに設定（デフォルトは5秒）
+  svr.set_read_timeout(300);
+  svr.set_write_timeout(300);
+
+  // リクエストとレスポンスをログに記録
+  svr.set_logger([](const auto &req, const auto &res) {
+    std::cout << req.method << " " << req.path << " -> " << res.status
+              << std::endl;
+  });
+
+  svr.Get("/health", [](const httplib::Request &, httplib::Response &res) {
+    res.set_content(json{{"status", "ok"}}.dump(), "application/json");
+  });
+
+  // 2章で作った通常の翻訳エンドポイント
+  svr.Post("/translate",
+           [&](const httplib::Request &req, httplib::Response &res) {
+    // JSONパース・バリデーション（詳細は2章を参照）
+    auto input = json::parse(req.body, nullptr, false);
+    if (input.is_discarded()) {
+      res.status = 400;
+      res.set_content(json{{"error", "Invalid JSON"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    if (!input.contains("text") || !input["text"].is_string() ||
+        input["text"].get<std::string>().empty()) {
+      res.status = 400;
+      res.set_content(json{{"error", "'text' is required"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    auto text = input["text"].get<std::string>();
+    auto target_lang = input.value("target_lang", "ja");
+
+    auto prompt = "Translate the following text to " + target_lang +
+                  ". Output only the translation, nothing else.\n\n" + text;
+
+    try {
+      auto translation = llm.chat(prompt);
+      res.set_content(json{{"translation", translation}}.dump(),
+                      "application/json");
+    } catch (const std::exception &e) {
+      res.status = 500;
+      res.set_content(json{{"error", e.what()}}.dump(), "application/json");
+    }
+  });
+
+  // SSEストリーミング翻訳エンドポイント
+  svr.Post("/translate/stream",
+           [&](const httplib::Request &req, httplib::Response &res) {
+    // JSONパース・バリデーション（/translateと同じ）
+    auto input = json::parse(req.body, nullptr, false);
+    if (input.is_discarded()) {
+      res.status = 400;
+      res.set_content(json{{"error", "Invalid JSON"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    if (!input.contains("text") || !input["text"].is_string() ||
+        input["text"].get<std::string>().empty()) {
+      res.status = 400;
+      res.set_content(json{{"error", "'text' is required"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    auto text = input["text"].get<std::string>();
+    auto target_lang = input.value("target_lang", "ja");
+
+    auto prompt = "Translate the following text to " + target_lang +
+                  ". Output only the translation, nothing else.\n\n" + text;
+
+    res.set_chunked_content_provider(
+        "text/event-stream",
+        [&, prompt](size_t, httplib::DataSink &sink) {
+          try {
+            llm.chat(prompt, [&](std::string_view token) {
+              sink.os << "data: "
+                      << json(std::string(token)).dump(
+                           -1, ' ', false, json::error_handler_t::replace)
+                      << "\n\n";
+              return sink.os.good(); // 切断されたら推論を中断
+            });
+            sink.os << "data: [DONE]\n\n";
+          } catch (const std::exception &e) {
+            sink.os << "data: " << json({{"error", e.what()}}).dump() << "\n\n";
+          }
+          sink.done();
+          return true;
+        });
+  });
+
+  // 以降の章で本物に差し替えるダミー実装
+  svr.Get("/models",
+          [](const httplib::Request &, httplib::Response &res) {
+    res.set_content(json{{"models", json::array()}}.dump(), "application/json");
+  });
+
+  svr.Post("/models/select",
+           [](const httplib::Request &, httplib::Response &res) {
+    res.set_content(json{{"status", "TODO"}}.dump(), "application/json");
+  });
+
+  // `Ctrl+C` (`SIGINT`)や`kill` (`SIGTERM`)でサーバーを停止できるようにする
+  signal(SIGINT, signal_handler);
+  signal(SIGTERM, signal_handler);
+
+  // サーバー起動（`stop()`が呼ばれるまでブロック）
+  std::cout << "Listening on http://127.0.0.1:8080" << std::endl;
+  svr.listen("127.0.0.1", 8080);
+}
+```
+
+</details>
+
+## 3.5 動作確認
+
+ビルドしてサーバーを起動します。
+
+```bash
+cmake --build build -j
+./build/translate-server
+```
+
+curlの`-N`オプションでバッファリングを無効にすると、トークンが届くたびにリアルタイムで表示されます。
+
+```bash
+curl -N -X POST http://localhost:8080/translate/stream \
+  -H "Content-Type: application/json" \
+  -d '{"text": "I had a great time visiting Tokyo last spring. The cherry blossoms were beautiful.", "target_lang": "ja"}'
+```
+
+```text
+data: "去年の"
+data: "春に"
+data: "東京を"
+data: "訪れた"
+data: "。"
+data: "桜が"
+data: "綺麗だった"
+data: "。"
+data: [DONE]
+```
+
+トークンがひとつずつ流れてくるのが確認できるはずです。2章の`/translate`も引き続き使えます。
+
+## 次の章へ
+
+サーバーの翻訳機能はこれで一通り揃いました。次の章では、cpp-httplibのクライアント機能を使ってHugging Faceからモデルを取得・管理する機能を追加します。
+
+**Next:** [モデルの取得・管理機能を追加する](../ch04-model-management)
--- a/docs-src/pages/ja/llm-app/ch04-model-management.md
+++ b/docs-src/pages/ja/llm-app/ch04-model-management.md
@@ -0,0 +1,788 @@
+---
+title: "4. モデルの取得・管理機能を追加する"
+order: 4
+
+---
+
+3章まででサーバーの翻訳機能は一通り揃いました。しかし、モデルファイルは1章で手動ダウンロードした1つだけです。この章ではcpp-httplibの**クライアント機能**を使い、アプリ内からHugging Faceのモデルをダウンロード・切り替えできるようにします。
+
+完成すると、こんなリクエストでモデルを管理できるようになります。
+
+```bash
+# 利用可能なモデル一覧を取得
+curl http://localhost:8080/models
+```
+
+```json
+{
+  "models": [
+    {"name": "gemma-2-2b-it", "params": "2B", "size": "1.6 GB", "downloaded": true, "selected": true},
+    {"name": "gemma-2-9b-it", "params": "9B", "size": "5.8 GB", "downloaded": false, "selected": false},
+    {"name": "Llama-3.1-8B-Instruct", "params": "8B", "size": "4.9 GB", "downloaded": false, "selected": false}
+  ]
+}
+```
+
+```bash
+# 別のモデルを選択（未ダウンロードなら自動で取得）
+curl -N -X POST http://localhost:8080/models/select \
+  -H "Content-Type: application/json" \
+  -d '{"model": "gemma-2-9b-it"}'
+```
+
+```text
+data: {"status":"downloading","progress":0}
+data: {"status":"downloading","progress":12}
+...
+data: {"status":"downloading","progress":100}
+data: {"status":"loading"}
+data: {"status":"ready"}
+```
+
+## 4.1 httplib::Clientの基本
+
+これまでは`httplib::Server`だけを使ってきましたが、cpp-httplibはクライアント機能も備えています。Hugging FaceはHTTPSなので、TLS対応のクライアントが必要です。
+
+```cpp
+#include <httplib.h>
+
+// URLスキームを含めると自動でSSLClientが使われる
+httplib::Client cli("https://huggingface.co");
+
+// リダイレクト先を自動で追従（Hugging FaceはCDNにリダイレクトする）
+cli.set_follow_location(true);
+
+auto res = cli.Get("/api/models");
+if (res && res->status == 200) {
+  std::cout << res->body << std::endl;
+}
+```
+
+HTTPSを使うには、ビルド時にOpenSSLを有効にする必要があります。`CMakeLists.txt`に以下を追加しましょう。
+
+```cmake
+find_package(OpenSSL REQUIRED)
+
+target_link_libraries(translate-server PRIVATE OpenSSL::SSL OpenSSL::Crypto)
+target_compile_definitions(translate-server PRIVATE CPPHTTPLIB_OPENSSL_SUPPORT)
+
+# macOS: システム証明書の読み込みに必要
+if(APPLE)
+  target_link_libraries(translate-server PRIVATE "-framework CoreFoundation" "-framework Security")
+endif()
+```
+
+`CPPHTTPLIB_OPENSSL_SUPPORT`を定義すると、`httplib::Client("https://...")`がTLS接続を行います。macOSではシステム証明書ストアにアクセスするため、CoreFoundationとSecurityフレームワークのリンクも必要です。完全な`CMakeLists.txt`は4.8節にあります。
+
+## 4.2 モデル一覧を定義する
+
+アプリが扱えるモデルの一覧を定義します。翻訳タスクで検証済みの4モデルを用意しました。
+
+```cpp
+struct ModelInfo {
+  std::string name;       // 表示名
+  std::string params;     // パラメータ数
+  std::string size;       // GGUF Q4サイズ
+  std::string repo;       // Hugging Faceリポジトリ
+  std::string filename;   // GGUFファイル名
+};
+
+const std::vector<ModelInfo> MODELS = {
+  {
+    .name     = "gemma-2-2b-it",
+    .params   = "2B",
+    .size     = "1.6 GB",
+    .repo     = "bartowski/gemma-2-2b-it-GGUF",
+    .filename = "gemma-2-2b-it-Q4_K_M.gguf",
+  },
+  {
+    .name     = "gemma-2-9b-it",
+    .params   = "9B",
+    .size     = "5.8 GB",
+    .repo     = "bartowski/gemma-2-9b-it-GGUF",
+    .filename = "gemma-2-9b-it-Q4_K_M.gguf",
+  },
+  {
+    .name     = "Llama-3.1-8B-Instruct",
+    .params   = "8B",
+    .size     = "4.9 GB",
+    .repo     = "bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",
+    .filename = "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
+  },
+};
+```
+
+## 4.3 モデルの保存場所
+
+3章まではプロジェクトディレクトリ内の`models/`にモデルを置いていました。しかし複数モデルを管理するなら、アプリ専用のディレクトリに保存する方が適切です。macOS/Linuxでは`~/.translate-app/models/`、Windowsでは`%APPDATA%\translate-app\models\`を使います。
+
+```cpp
+std::filesystem::path get_models_dir() {
+#ifdef _WIN32
+  auto env = std::getenv("APPDATA");
+  auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
+  return base / "translate-app" / "models";
+#else
+  auto env = std::getenv("HOME");
+  auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
+  return base / ".translate-app" / "models";
+#endif
+}
+```
+
+環境変数が未設定の場合はカレントディレクトリにフォールバックします。このディレクトリはアプリ起動時に自動作成します（`create_directories`は既に存在していてもエラーになりません）。
+
+## 4.4 モデルの初期化を書き換える
+
+モデルの初期化を`main()`の先頭で書き換えます。1章ではパスをハードコードしていましたが、ここからはモデルの切り替えに対応します。現在ロード中のファイル名は`selected_model`変数で管理します。起動時は`MODELS`の先頭エントリーをロードします。この変数は`GET /models`や`POST /models/select`のハンドラから参照・更新します。
+
+cpp-httplibはスレッドプールでハンドラを並行実行します。そのため、モデル切り替え中（`llm`の上書き中）に別スレッドで`llm.chat()`が走るとクラッシュします。`std::mutex`で排他制御を入れておきます。
+
+```cpp
+int main() {
+  auto models_dir = get_models_dir();
+  std::filesystem::create_directories(models_dir);
+
+  std::string selected_model = MODELS[0].filename;
+  auto path = models_dir / selected_model;
+
+  // デフォルトモデルが未ダウンロードなら起動時に自動取得
+  if (!std::filesystem::exists(path)) {
+    std::cout << "Downloading " << selected_model << "..." << std::endl;
+    if (!download_model(MODELS[0], [](int pct) {
+          std::cout << "\r" << pct << "%" << std::flush;
+          return true;
+        })) {
+      std::cerr << "\nFailed to download model." << std::endl;
+      return 1;
+    }
+    std::cout << std::endl;
+  }
+  auto llm = llamalib::Llama{path};
+  std::mutex llm_mutex; // モデル切り替え中のアクセスを保護する
+  // ...
+}
+```
+
+初回起動時にユーザーが`curl`で手動ダウンロードしなくても済むようにしています。4.6節の`download_model`関数を使い、進捗をコンソールに表示します。
+
+## 4.5 `GET /models`ハンドラ
+
+モデル一覧に「ダウンロード済みか」「選択中か」の情報を付けて返します。
+
+```cpp
+svr.Get("/models",
+        [&](const httplib::Request &, httplib::Response &res) {
+  auto arr = json::array();
+  for (const auto &m : MODELS) {
+    auto path = get_models_dir() / m.filename;
+    arr.push_back({
+      {"name",       m.name},
+      {"params",     m.params},
+      {"size",       m.size},
+      {"downloaded", std::filesystem::exists(path)},
+      {"selected",   m.filename == selected_model},
+    });
+  }
+  res.set_content(json{{"models", arr}}.dump(), "application/json");
+});
+```
+
+## 4.6 大きなファイルをダウンロードする
+
+GGUFモデルは数GBあるため、全体をメモリに載せるわけにはいきません。`httplib::Client::Get`にコールバックを渡すと、チャンクごとにデータを受け取れます。
+
+```cpp
+// content_receiver: データチャンクを受け取るコールバック
+// progress: ダウンロード進捗コールバック
+cli.Get(url,
+  [&](const char *data, size_t len) {       // content_receiver
+    ofs.write(data, len);
+    return true;  // falseを返すと中断
+  },
+  [&](size_t current, size_t total) {        // progress
+    int pct = total ? (int)(current * 100 / total) : 0;
+    std::cout << pct << "%" << std::endl;
+    return true;  // falseを返すと中断
+  });
+```
+
+これを使ってHugging Faceからモデルをダウンロードする関数を作ります。
+
+```cpp
+#include <filesystem>
+#include <fstream>
+
+// モデルをダウンロードし、進捗をprogress_cbで通知する
+// progress_cbがfalseを返すとダウンロードを中断する
+bool download_model(const ModelInfo &model,
+                    std::function<bool(int)> progress_cb) {
+  httplib::Client cli("https://huggingface.co");
+  cli.set_follow_location(true);
+  cli.set_read_timeout(std::chrono::hours(1));
+
+  auto url = "/" + model.repo + "/resolve/main/" + model.filename;
+  auto path = get_models_dir() / model.filename;
+  auto tmp_path = std::filesystem::path(path).concat(".tmp");
+
+  std::ofstream ofs(tmp_path, std::ios::binary);
+  if (!ofs) { return false; }
+
+  auto res = cli.Get(url,
+    [&](const char *data, size_t len) {
+      ofs.write(data, len);
+      return ofs.good();
+    },
+    [&](size_t current, size_t total) {
+      return progress_cb(total ? (int)(current * 100 / total) : 0);
+    });
+
+  ofs.close();
+
+  if (!res || res->status != 200) {
+    std::filesystem::remove(tmp_path);
+    return false;
+  }
+
+  // .tmpに書いてからリネームすることで、DLが途中で止まっても
+  // 不完全なファイルがモデルとして使われるのを防ぐ
+  std::filesystem::rename(tmp_path, path);
+  return true;
+}
+```
+
+## 4.7 `/models/select`ハンドラ
+
+モデルの選択リクエストを処理します。レスポンスは常にSSEで返し、ダウンロード進捗 → ロード → 完了のステータスを順に通知します。
+
+```cpp
+svr.Post("/models/select",
+         [&](const httplib::Request &req, httplib::Response &res) {
+  auto input = json::parse(req.body, nullptr, false);
+  if (input.is_discarded() || !input.contains("model")) {
+    res.status = 400;
+    res.set_content(json{{"error", "'model' is required"}}.dump(),
+                    "application/json");
+    return;
+  }
+
+  auto name = input["model"].get<std::string>();
+
+  // モデル一覧から探す
+  auto it = std::find_if(MODELS.begin(), MODELS.end(),
+    [&](const ModelInfo &m) { return m.name == name; });
+
+  if (it == MODELS.end()) {
+    res.status = 404;
+    res.set_content(json{{"error", "Unknown model"}}.dump(),
+                    "application/json");
+    return;
+  }
+
+  const auto &model = *it;
+
+  // 常にSSEで応答する（DL済みでも未DLでも同じ形式）
+  res.set_chunked_content_provider(
+      "text/event-stream",
+      [&, model](size_t, httplib::DataSink &sink) {
+        // SSEイベント送信ヘルパー
+        auto send = [&](const json &event) {
+          sink.os << "data: " << event.dump() << "\n\n";
+        };
+
+        // 未ダウンロードならダウンロード（進捗をSSEで通知）
+        auto path = get_models_dir() / model.filename;
+        if (!std::filesystem::exists(path)) {
+          bool ok = download_model(model, [&](int pct) {
+            send({{"status", "downloading"}, {"progress", pct}});
+            return sink.os.good(); // クライアント切断時にダウンロードを中断
+          });
+          if (!ok) {
+            send({{"status", "error"}, {"message", "Download failed"}});
+            sink.done();
+            return true;
+          }
+        }
+
+        // モデルをロードして切り替え
+        send({{"status", "loading"}});
+        {
+          std::lock_guard<std::mutex> lock(llm_mutex);
+          llm = llamalib::Llama{path};
+          selected_model = model.filename;
+        }
+
+        send({{"status", "ready"}});
+        sink.done();
+        return true;
+      });
+});
+```
+
+いくつか補足します。
+
+- `download_model`の進捗コールバックから直接SSEイベントを送っています。3章の`set_chunked_content_provider` + `sink.os`の応用です
+- コールバックが`sink.os.good()`を返すので、クライアントが接続を切るとダウンロードも中断します。5章で追加するキャンセルボタンで使います
+- `selected_model`を更新すると、`GET /models`の`selected`フラグに反映されます
+- `llm`の上書きを`llm_mutex`で保護しています。`/translate`や`/translate/stream`のハンドラも同じ`mutex`でロックするので、モデル切り替え中に推論が走ることはありません（全体コードを参照）
+
+## 4.8 全体のコード
+
+3章のコードにモデル管理機能を追加した完成形です。
+
+<details>
+<summary data-file="CMakeLists.txt">全体のコード（CMakeLists.txt）</summary>
+
+```cmake
+cmake_minimum_required(VERSION 3.20)
+project(translate-server CXX)
+set(CMAKE_CXX_STANDARD 20)
+
+include(FetchContent)
+
+# llama.cpp
+FetchContent_Declare(llama
+    GIT_REPOSITORY https://github.com/ggml-org/llama.cpp
+    GIT_TAG        master
+    GIT_SHALLOW    TRUE
+)
+FetchContent_MakeAvailable(llama)
+
+# cpp-httplib
+FetchContent_Declare(httplib
+    GIT_REPOSITORY https://github.com/yhirose/cpp-httplib
+    GIT_TAG        master
+)
+FetchContent_MakeAvailable(httplib)
+
+# nlohmann/json
+FetchContent_Declare(json
+    URL https://github.com/nlohmann/json/releases/download/v3.11.3/json.tar.xz
+)
+FetchContent_MakeAvailable(json)
+
+# cpp-llamalib
+FetchContent_Declare(cpp_llamalib
+    GIT_REPOSITORY https://github.com/yhirose/cpp-llamalib
+    GIT_TAG        main
+)
+FetchContent_MakeAvailable(cpp_llamalib)
+
+find_package(OpenSSL REQUIRED)
+
+add_executable(translate-server src/main.cpp)
+
+target_link_libraries(translate-server PRIVATE
+    httplib::httplib
+    nlohmann_json::nlohmann_json
+    cpp-llamalib
+    OpenSSL::SSL OpenSSL::Crypto
+)
+
+target_compile_definitions(translate-server PRIVATE CPPHTTPLIB_OPENSSL_SUPPORT)
+
+if(APPLE)
+    target_link_libraries(translate-server PRIVATE
+        "-framework CoreFoundation"
+        "-framework Security"
+    )
+endif()
+```
+
+</details>
+
+<details>
+<summary data-file="main.cpp">全体のコード（main.cpp）</summary>
+
+```cpp
+#include <httplib.h>
+#include <nlohmann/json.hpp>
+#include <cpp-llamalib.h>
+
+#include <algorithm>
+#include <csignal>
+#include <filesystem>
+#include <fstream>
+#include <iostream>
+#include <mutex>
+
+using json = nlohmann::json;
+
+// -------------------------------------------------------------------------
+// モデル定義
+// -------------------------------------------------------------------------
+
+struct ModelInfo {
+  std::string name;
+  std::string params;
+  std::string size;
+  std::string repo;
+  std::string filename;
+};
+
+const std::vector<ModelInfo> MODELS = {
+  {
+    .name     = "gemma-2-2b-it",
+    .params   = "2B",
+    .size     = "1.6 GB",
+    .repo     = "bartowski/gemma-2-2b-it-GGUF",
+    .filename = "gemma-2-2b-it-Q4_K_M.gguf",
+  },
+  {
+    .name     = "gemma-2-9b-it",
+    .params   = "9B",
+    .size     = "5.8 GB",
+    .repo     = "bartowski/gemma-2-9b-it-GGUF",
+    .filename = "gemma-2-9b-it-Q4_K_M.gguf",
+  },
+  {
+    .name     = "Llama-3.1-8B-Instruct",
+    .params   = "8B",
+    .size     = "4.9 GB",
+    .repo     = "bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",
+    .filename = "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
+  },
+};
+
+// -------------------------------------------------------------------------
+// モデル保存ディレクトリ
+// -------------------------------------------------------------------------
+
+std::filesystem::path get_models_dir() {
+#ifdef _WIN32
+  auto env = std::getenv("APPDATA");
+  auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
+  return base / "translate-app" / "models";
+#else
+  auto env = std::getenv("HOME");
+  auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
+  return base / ".translate-app" / "models";
+#endif
+}
+
+// -------------------------------------------------------------------------
+// モデルダウンロード
+// -------------------------------------------------------------------------
+
+// progress_cbがfalseを返したらダウンロードを中断する
+bool download_model(const ModelInfo &model,
+                    std::function<bool(int)> progress_cb) {
+  httplib::Client cli("https://huggingface.co");
+  cli.set_follow_location(true);  // Hugging FaceはCDNにリダイレクトする
+  cli.set_read_timeout(std::chrono::hours(1)); // 大きなモデルに備えて長めに
+
+  auto url = "/" + model.repo + "/resolve/main/" + model.filename;
+  auto path = get_models_dir() / model.filename;
+  auto tmp_path = std::filesystem::path(path).concat(".tmp");
+
+  std::ofstream ofs(tmp_path, std::ios::binary);
+  if (!ofs) { return false; }
+
+  auto res = cli.Get(url,
+    // content_receiver: チャンクごとにデータを受け取ってファイルに書き込む
+    [&](const char *data, size_t len) {
+      ofs.write(data, len);
+      return ofs.good();
+    },
+    // progress: ダウンロード進捗を通知（falseを返すと中断）
+    [&, last_pct = -1](size_t current, size_t total) mutable {
+      int pct = total ? (int)(current * 100 / total) : 0;
+      if (pct == last_pct) return true; // 同じ値なら通知をスキップ
+      last_pct = pct;
+      return progress_cb(pct);
+    });
+
+  ofs.close();
+
+  if (!res || res->status != 200) {
+    std::filesystem::remove(tmp_path);
+    return false;
+  }
+
+  // ダウンロード完了後にリネーム
+  std::filesystem::rename(tmp_path, path);
+  return true;
+}
+
+// -------------------------------------------------------------------------
+// サーバー
+// -------------------------------------------------------------------------
+
+httplib::Server svr;
+
+void signal_handler(int sig) {
+  if (sig == SIGINT || sig == SIGTERM) {
+    std::cout << "\nReceived signal, shutting down gracefully...\n";
+    svr.stop();
+  }
+}
+
+int main() {
+  // モデル保存ディレクトリを作成
+  auto models_dir = get_models_dir();
+  std::filesystem::create_directories(models_dir);
+
+  // デフォルトモデルが未ダウンロードなら自動取得
+  std::string selected_model = MODELS[0].filename;
+  auto path = models_dir / selected_model;
+  if (!std::filesystem::exists(path)) {
+    std::cout << "Downloading " << selected_model << "..." << std::endl;
+    if (!download_model(MODELS[0], [](int pct) {
+          std::cout << "\r" << pct << "%" << std::flush;
+          return true;
+        })) {
+      std::cerr << "\nFailed to download model." << std::endl;
+      return 1;
+    }
+    std::cout << std::endl;
+  }
+  auto llm = llamalib::Llama{path};
+  std::mutex llm_mutex; // モデル切り替え中のアクセスを保護する
+
+  // LLM推論は時間がかかるのでタイムアウトを長めに設定（デフォルトは5秒）
+  svr.set_read_timeout(300);
+  svr.set_write_timeout(300);
+
+  svr.set_logger([](const auto &req, const auto &res) {
+    std::cout << req.method << " " << req.path << " -> " << res.status
+              << std::endl;
+  });
+
+  svr.Get("/health", [](const httplib::Request &, httplib::Response &res) {
+    res.set_content(json{{"status", "ok"}}.dump(), "application/json");
+  });
+
+  // --- 翻訳エンドポイント（2章） -----------------------------------------
+
+  svr.Post("/translate",
+           [&](const httplib::Request &req, httplib::Response &res) {
+    // JSONパース・バリデーション（詳細は2章を参照）
+    auto input = json::parse(req.body, nullptr, false);
+    if (input.is_discarded()) {
+      res.status = 400;
+      res.set_content(json{{"error", "Invalid JSON"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    if (!input.contains("text") || !input["text"].is_string() ||
+        input["text"].get<std::string>().empty()) {
+      res.status = 400;
+      res.set_content(json{{"error", "'text' is required"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    auto text = input["text"].get<std::string>();
+    auto target_lang = input.value("target_lang", "ja");
+
+    auto prompt = "Translate the following text to " + target_lang +
+                  ". Output only the translation, nothing else.\n\n" + text;
+
+    try {
+      std::lock_guard<std::mutex> lock(llm_mutex);
+      auto translation = llm.chat(prompt);
+      res.set_content(json{{"translation", translation}}.dump(),
+                      "application/json");
+    } catch (const std::exception &e) {
+      res.status = 500;
+      res.set_content(json{{"error", e.what()}}.dump(), "application/json");
+    }
+  });
+
+  // --- SSEストリーミング翻訳（3章）--------------------------------------
+
+  svr.Post("/translate/stream",
+           [&](const httplib::Request &req, httplib::Response &res) {
+    auto input = json::parse(req.body, nullptr, false);
+    if (input.is_discarded()) {
+      res.status = 400;
+      res.set_content(json{{"error", "Invalid JSON"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    if (!input.contains("text") || !input["text"].is_string() ||
+        input["text"].get<std::string>().empty()) {
+      res.status = 400;
+      res.set_content(json{{"error", "'text' is required"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    auto text = input["text"].get<std::string>();
+    auto target_lang = input.value("target_lang", "ja");
+
+    auto prompt = "Translate the following text to " + target_lang +
+                  ". Output only the translation, nothing else.\n\n" + text;
+
+    res.set_chunked_content_provider(
+        "text/event-stream",
+        [&, prompt](size_t, httplib::DataSink &sink) {
+          std::lock_guard<std::mutex> lock(llm_mutex);
+          try {
+            llm.chat(prompt, [&](std::string_view token) {
+              sink.os << "data: "
+                      << json(std::string(token)).dump(
+                           -1, ' ', false, json::error_handler_t::replace)
+                      << "\n\n";
+              return sink.os.good(); // 切断されたら推論を中断
+            });
+            sink.os << "data: [DONE]\n\n";
+          } catch (const std::exception &e) {
+            sink.os << "data: " << json({{"error", e.what()}}).dump() << "\n\n";
+          }
+          sink.done();
+          return true;
+        });
+  });
+
+  // --- モデル一覧（4章） -------------------------------------------------
+
+  svr.Get("/models",
+          [&](const httplib::Request &, httplib::Response &res) {
+    auto models_dir = get_models_dir();
+    auto arr = json::array();
+    for (const auto &m : MODELS) {
+      auto path = models_dir / m.filename;
+      arr.push_back({
+        {"name",       m.name},
+        {"params",     m.params},
+        {"size",       m.size},
+        {"downloaded", std::filesystem::exists(path)},
+        {"selected",   m.filename == selected_model},
+      });
+    }
+    res.set_content(json{{"models", arr}}.dump(), "application/json");
+  });
+
+  // --- モデル選択（4章） -------------------------------------------------
+
+  svr.Post("/models/select",
+           [&](const httplib::Request &req, httplib::Response &res) {
+    auto input = json::parse(req.body, nullptr, false);
+    if (input.is_discarded() || !input.contains("model")) {
+      res.status = 400;
+      res.set_content(json{{"error", "'model' is required"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    auto name = input["model"].get<std::string>();
+
+    auto it = std::find_if(MODELS.begin(), MODELS.end(),
+      [&](const ModelInfo &m) { return m.name == name; });
+
+    if (it == MODELS.end()) {
+      res.status = 404;
+      res.set_content(json{{"error", "Unknown model"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    const auto &model = *it;
+
+    // 常にSSEで応答する（DL済みでも未DLでも同じ形式）
+    res.set_chunked_content_provider(
+        "text/event-stream",
+        [&, model](size_t, httplib::DataSink &sink) {
+          // SSEイベント送信ヘルパー
+          auto send = [&](const json &event) {
+            sink.os << "data: " << event.dump() << "\n\n";
+          };
+
+          // 未ダウンロードならダウンロード（進捗をSSEで通知）
+          auto path = get_models_dir() / model.filename;
+          if (!std::filesystem::exists(path)) {
+            bool ok = download_model(model, [&](int pct) {
+              send({{"status", "downloading"}, {"progress", pct}});
+              return sink.os.good(); // クライアント切断時にダウンロードを中断
+            });
+            if (!ok) {
+              send({{"status", "error"}, {"message", "Download failed"}});
+              sink.done();
+              return true;
+            }
+          }
+
+          // モデルをロードして切り替え
+          send({{"status", "loading"}});
+          {
+            std::lock_guard<std::mutex> lock(llm_mutex);
+            llm = llamalib::Llama{path};
+            selected_model = model.filename;
+          }
+
+          send({{"status", "ready"}});
+          sink.done();
+          return true;
+        });
+  });
+
+  // `Ctrl+C` (`SIGINT`)や`kill` (`SIGTERM`)でサーバーを停止できるようにする
+  signal(SIGINT, signal_handler);
+  signal(SIGTERM, signal_handler);
+
+  std::cout << "Listening on http://127.0.0.1:8080" << std::endl;
+  svr.listen("127.0.0.1", 8080);
+}
+```
+
+</details>
+
+## 4.9 動作確認
+
+CMakeLists.txtにOpenSSLの設定を追加したので、CMakeを再実行してからビルドします。
+
+```bash
+cmake -B build
+cmake --build build -j
+./build/translate-server
+```
+
+### モデル一覧の確認
+
+```bash
+curl http://localhost:8080/models
+```
+
+1章でダウンロードした`gemma-2-2b-it`が`downloaded: true`、`selected: true`になっているはずです。
+
+### 別のモデルに切り替える
+
+```bash
+curl -N -X POST http://localhost:8080/models/select \
+  -H "Content-Type: application/json" \
+  -d '{"model": "gemma-2-9b-it"}'
+```
+
+SSEでダウンロード進捗が流れ、完了すると`"ready"`が返ります。
+
+### 複数モデルで翻訳を比較する
+
+同じ例文を異なるモデルで翻訳してみましょう。
+
+```bash
+# gemma-2-9b-itで翻訳（先ほど切り替えたモデル）
+curl -X POST http://localhost:8080/translate \
+  -H "Content-Type: application/json" \
+  -d '{"text": "The quick brown fox jumps over the lazy dog.", "target_lang": "ja"}'
+
+# gemma-2-2b-itに戻す
+curl -N -X POST http://localhost:8080/models/select \
+  -H "Content-Type: application/json" \
+  -d '{"model": "gemma-2-2b-it"}'
+
+# 同じ文を翻訳
+curl -X POST http://localhost:8080/translate \
+  -H "Content-Type: application/json" \
+  -d '{"text": "The quick brown fox jumps over the lazy dog.", "target_lang": "ja"}'
+```
+
+同じコード・同じプロンプトでもモデルによって翻訳結果が変わることがわかります。cpp-llamalibがモデルごとのチャットテンプレートを自動適用するので、コード側の変更は不要です。
+
+## 次の章へ
+
+これでサーバーの主要な機能が揃いました。REST API、SSEストリーミング、モデルのダウンロードと切り替え。次の章では静的ファイル配信を追加して、ブラウザから操作できるWeb UIを作ります。
+
+**Next:** [Web UIを追加する](../ch05-web-ui)
--- a/docs-src/pages/ja/llm-app/ch05-web-ui.md
+++ b/docs-src/pages/ja/llm-app/ch05-web-ui.md
--- a/docs-src/pages/ja/llm-app/ch06-desktop-app.md
+++ b/docs-src/pages/ja/llm-app/ch06-desktop-app.md
@@ -0,0 +1,724 @@
+---
+title: "6. WebViewでデスクトップアプリ化する"
+order: 6
+
+---
+
+5章で、ブラウザから操作できる翻訳アプリが完成しました。でも使うたびに「サーバーを起動して、ブラウザでURLを開いて…」という手順が必要です。普通のアプリのように、ダブルクリックで起動してすぐ使えるようにしたいですよね。
+
+この章では2つのことをやります。
+
+1. **WebView化** — [webview/webview](https://github.com/webview/webview)でブラウザなしで動くデスクトップアプリにする
+2. **シングルバイナリ化** — [cpp-embedlib](https://github.com/yhirose/cpp-embedlib)でHTML/CSS/JSをバイナリに埋め込み、配布物を1ファイルにする
+
+完成すると、`./translate-app`を実行するだけでウインドウが開き、翻訳が使えるようになります。
+
+![Desktop App](../app.png#large-center)
+
+モデルは初回起動時に自動ダウンロードされるので、ユーザーに渡すのはバイナリ1つだけです。
+
+## 6.1 webview/webview を導入する
+
+[webview/webview](https://github.com/webview/webview)は、OS標準のWebViewコンポーネント（macOSならWKWebView、LinuxならWebKitGTK、WindowsならWebView2）をC/C++から使えるようにするライブラリです。Electronのように独自ブラウザを同梱するわけではないので、バイナリサイズへの影響はほぼありません。
+
+CMakeで取得します。`CMakeLists.txt`に以下を追加してください。
+
+```cmake
+# webview/webview
+FetchContent_Declare(webview
+    GIT_REPOSITORY https://github.com/webview/webview
+    GIT_TAG        master
+)
+FetchContent_MakeAvailable(webview)
+```
+
+これで`webview::core`というCMakeターゲットが使えるようになります。`target_link_libraries`でリンクすると、インクルードパスやプラットフォーム固有のフレームワークを自動で設定してくれます。
+
+> **macOS**: 追加の依存は不要です。WKWebViewはシステムに組み込まれています。
+>
+> **Linux**: WebKitGTKが必要です。`sudo apt install libwebkit2gtk-4.1-dev`でインストールしてください。
+>
+> **Windows**: WebView2ランタイムが必要です。Windows 11には標準搭載されています。Windows 10の場合は[Microsoft公式サイト](https://developer.microsoft.com/en-us/microsoft-edge/webview2/)から入手してください。
+
+## 6.2 サーバーをバックグラウンドスレッドで動かす
+
+5章まではサーバーの`listen()`がメインスレッドをブロックしていました。WebViewを使うには、サーバーを別スレッドで動かし、メインスレッドでWebViewのイベントループを回す必要があります。
+
+```cpp
+#include "webview/webview.h"
+#include <thread>
+
+int main() {
+  // ... (サーバーのセットアップは5章と同じ) ...
+
+  // サーバーをバックグラウンドスレッドで起動
+  auto port = svr.bind_to_any_port("127.0.0.1");
+  std::thread server_thread([&]() { svr.listen_after_bind(); });
+
+  std::cout << "Listening on http://127.0.0.1:" << port << std::endl;
+
+  // WebViewでUIを表示
+  webview::webview w(false, nullptr);
+  w.set_title("Translate App");
+  w.set_size(1024, 768, WEBVIEW_HINT_NONE);
+  w.navigate("http://127.0.0.1:" + std::to_string(port));
+  w.run(); // ウインドウが閉じるまでブロック
+
+  // ウインドウが閉じたらサーバーも停止
+  svr.stop();
+  server_thread.join();
+}
+```
+
+ポイントを見ていきましょう。
+
+- **`bind_to_any_port`** — `listen("127.0.0.1", 8080)`の代わりに、OSに空いているポートを選んでもらいます。デスクトップアプリは複数起動されることがあるので、ポートを固定するとぶつかります
+- **`listen_after_bind`** — `bind_to_any_port`で確保したポートでリクエストの受付を開始します。`listen()`はbindとlistenを一度にやりますが、ポート番号を先に知る必要があるので分けています
+- **シャットダウン順序** — WebViewのウインドウが閉じたら`svr.stop()`でサーバーを止め、`server_thread.join()`でスレッドの終了を待ちます。逆順だとWebViewがサーバーにアクセスできなくなります
+
+5章の`signal_handler`は不要になります。デスクトップアプリではウインドウを閉じることがアプリの終了を意味するからです。
+
+## 6.3 cpp-embedlib で静的ファイルを埋め込む
+
+5章では`public/`ディレクトリからファイルを配信していました。これだと配布時に`public/`も一緒に渡す必要があります。[cpp-embedlib](https://github.com/yhirose/cpp-embedlib)を使うと、HTML・CSS・JavaScriptをバイナリに埋め込んで、配布物をバイナリ1つにまとめられます。
+
+### CMakeLists.txt
+
+cpp-embedlibを取得し、`public/`を埋め込みます。
+
+```cmake
+# cpp-embedlib
+FetchContent_Declare(cpp-embedlib
+    GIT_REPOSITORY https://github.com/yhirose/cpp-embedlib
+    GIT_TAG        main
+)
+FetchContent_MakeAvailable(cpp-embedlib)
+
+# public/ ディレクトリをバイナリに埋め込む
+cpp_embedlib_add(WebAssets
+    FOLDER    ${CMAKE_CURRENT_SOURCE_DIR}/public
+    NAMESPACE Web
+)
+
+target_link_libraries(translate-app PRIVATE
+    WebAssets                # 埋め込みファイル
+    cpp-embedlib-httplib     # cpp-httplib連携
+)
+```
+
+`cpp_embedlib_add`は、`public/`配下のファイルをコンパイル時にバイナリに変換し、`WebAssets`という静的ライブラリを作ります。リンクすると`Web::FS`というオブジェクトから埋め込みファイルにアクセスできます。`cpp-embedlib-httplib`は`httplib::mount()`関数を提供するヘルパーライブラリです。
+
+### set_mount_point を httplib::mount に置き換える
+
+5章の`set_mount_point`をcpp-embedlibの`httplib::mount`に置き換えるだけです。
+
+```cpp
+#include <cpp-embedlib-httplib.h>
+#include "WebAssets.h"
+
+// 5章:
+// svr.set_mount_point("/", "./public");
+
+// 6章:
+httplib::mount(svr, Web::FS);
+```
+
+`httplib::mount`は、`Web::FS`に埋め込まれたファイルをHTTPで配信するハンドラを登録します。MIMEタイプはファイルの拡張子から自動判定するので、`Content-Type`を手動で設定する必要はありません。
+
+ファイルの中身はバイナリのデータセグメントに直接マップしているので、メモリコピーもヒープ割り当ても発生しません。
+
+## 6.4 macOS: Editメニューの追加
+
+入力欄に`Cmd+V`でテキストをペーストしようとすると、動かないことに気づくはずです。macOSでは、`Cmd+V`（ペースト）や`Cmd+C`（コピー）などのキーボードショートカットは、アプリケーションのメニューバーを経由してWebViewに届きます。webview/webviewはメニューバーを作らないので、これらのショートカットが効きません。Objective-CランタイムAPIを使ってEditメニューを追加する必要があります。
+
+```cpp
+#ifdef __APPLE__
+#include <objc/objc-runtime.h>
+
+void setup_macos_edit_menu() {
+  auto cls    = [](const char *n) { return (id)objc_getClass(n); };
+  auto sel    = sel_registerName;
+  auto msg    = reinterpret_cast<id (*)(id, SEL)>(objc_msgSend);
+  auto msg_s  = reinterpret_cast<id (*)(id, SEL, const char *)>(objc_msgSend);
+  auto msg_id = reinterpret_cast<id (*)(id, SEL, id)>(objc_msgSend);
+  auto msg_v  = reinterpret_cast<void (*)(id, SEL, id)>(objc_msgSend);
+  auto msg_mi = reinterpret_cast<id (*)(id, SEL, id, SEL, id)>(objc_msgSend);
+
+  auto str = [&](const char *s) {
+    return msg_s(cls("NSString"), sel("stringWithUTF8String:"), s);
+  };
+
+  id app      = msg(cls("NSApplication"), sel("sharedApplication"));
+  id mainMenu = msg(msg(cls("NSMenu"), sel("alloc")), sel("init"));
+  id editItem = msg(msg(cls("NSMenuItem"), sel("alloc")), sel("init"));
+  id editMenu = msg_id(msg(cls("NSMenu"), sel("alloc")),
+                       sel("initWithTitle:"), str("Edit"));
+
+  struct { const char *title; const char *action; const char *key; } items[] = {
+    {"Undo",       "undo:",      "z"},
+    {"Redo",       "redo:",      "Z"},
+    {"Cut",        "cut:",       "x"},
+    {"Copy",       "copy:",      "c"},
+    {"Paste",      "paste:",     "v"},
+    {"Select All", "selectAll:", "a"},
+  };
+
+  for (auto &[title, action, key] : items) {
+    id mi = msg_mi(msg(cls("NSMenuItem"), sel("alloc")),
+                   sel("initWithTitle:action:keyEquivalent:"),
+                   str(title), sel(action), str(key));
+    msg_v(editMenu, sel("addItem:"), mi);
+  }
+
+  msg_v(editItem, sel("setSubmenu:"), editMenu);
+  msg_v(mainMenu, sel("addItem:"), editItem);
+  msg_v(app, sel("setMainMenu:"), mainMenu);
+}
+#endif
+```
+
+`w.run()`の前に呼び出します。
+
+```cpp
+#ifdef __APPLE__
+  setup_macos_edit_menu();
+#endif
+  w.run();
+```
+
+WindowsとLinuxでは、キーボードショートカットはメニューバーを介さずフォーカスのあるコントロールに直接届くので、この対処はmacOS固有です。
+
+## 6.5 全体のコード
+
+<details>
+<summary data-file="CMakeLists.txt">全体のコード（CMakeLists.txt）</summary>
+
+```cmake
+cmake_minimum_required(VERSION 3.20)
+project(translate-app CXX)
+set(CMAKE_CXX_STANDARD 20)
+
+include(FetchContent)
+
+# llama.cpp
+FetchContent_Declare(llama
+    GIT_REPOSITORY https://github.com/ggml-org/llama.cpp
+    GIT_TAG        master
+    GIT_SHALLOW    TRUE
+)
+FetchContent_MakeAvailable(llama)
+
+# cpp-httplib
+FetchContent_Declare(httplib
+    GIT_REPOSITORY https://github.com/yhirose/cpp-httplib
+    GIT_TAG        master
+)
+FetchContent_MakeAvailable(httplib)
+
+# nlohmann/json
+FetchContent_Declare(json
+    URL https://github.com/nlohmann/json/releases/download/v3.11.3/json.tar.xz
+)
+FetchContent_MakeAvailable(json)
+
+# cpp-llamalib
+FetchContent_Declare(cpp_llamalib
+    GIT_REPOSITORY https://github.com/yhirose/cpp-llamalib
+    GIT_TAG        main
+)
+FetchContent_MakeAvailable(cpp_llamalib)
+
+# webview/webview
+FetchContent_Declare(webview
+    GIT_REPOSITORY https://github.com/webview/webview
+    GIT_TAG        master
+)
+FetchContent_MakeAvailable(webview)
+
+# cpp-embedlib
+FetchContent_Declare(cpp-embedlib
+    GIT_REPOSITORY https://github.com/yhirose/cpp-embedlib
+    GIT_TAG        main
+)
+FetchContent_MakeAvailable(cpp-embedlib)
+
+# public/ ディレクトリをバイナリに埋め込む
+cpp_embedlib_add(WebAssets
+    FOLDER    ${CMAKE_CURRENT_SOURCE_DIR}/public
+    NAMESPACE Web
+)
+
+find_package(OpenSSL REQUIRED)
+
+add_executable(translate-app src/main.cpp)
+
+target_link_libraries(translate-app PRIVATE
+    httplib::httplib
+    nlohmann_json::nlohmann_json
+    cpp-llamalib
+    OpenSSL::SSL OpenSSL::Crypto
+    WebAssets
+    cpp-embedlib-httplib
+    webview::core
+)
+
+if(APPLE)
+    target_link_libraries(translate-app PRIVATE
+        "-framework CoreFoundation"
+        "-framework Security"
+    )
+endif()
+
+target_compile_definitions(translate-app PRIVATE
+    CPPHTTPLIB_OPENSSL_SUPPORT
+)
+```
+
+</details>
+
+<details>
+<summary data-file="main.cpp">全体のコード（main.cpp）</summary>
+
+```cpp
+#include <httplib.h>
+#include <nlohmann/json.hpp>
+#include <cpp-llamalib.h>
+#include <cpp-embedlib-httplib.h>
+#include "WebAssets.h"
+#include "webview/webview.h"
+
+#ifdef __APPLE__
+#include <objc/objc-runtime.h>
+#endif
+
+#include <algorithm>
+#include <filesystem>
+#include <fstream>
+#include <iostream>
+#include <mutex>
+#include <thread>
+
+using json = nlohmann::json;
+
+// -------------------------------------------------------------------------
+// macOS Editメニュー（Cmd+C/V/X/AにはEditメニューが必要）
+// -------------------------------------------------------------------------
+
+#ifdef __APPLE__
+void setup_macos_edit_menu() {
+  auto cls    = [](const char *n) { return (id)objc_getClass(n); };
+  auto sel    = sel_registerName;
+  auto msg    = reinterpret_cast<id (*)(id, SEL)>(objc_msgSend);
+  auto msg_s  = reinterpret_cast<id (*)(id, SEL, const char *)>(objc_msgSend);
+  auto msg_id = reinterpret_cast<id (*)(id, SEL, id)>(objc_msgSend);
+  auto msg_v  = reinterpret_cast<void (*)(id, SEL, id)>(objc_msgSend);
+  auto msg_mi = reinterpret_cast<id (*)(id, SEL, id, SEL, id)>(objc_msgSend);
+
+  auto str = [&](const char *s) {
+    return msg_s(cls("NSString"), sel("stringWithUTF8String:"), s);
+  };
+
+  id app      = msg(cls("NSApplication"), sel("sharedApplication"));
+  id mainMenu = msg(msg(cls("NSMenu"), sel("alloc")), sel("init"));
+  id editItem = msg(msg(cls("NSMenuItem"), sel("alloc")), sel("init"));
+  id editMenu = msg_id(msg(cls("NSMenu"), sel("alloc")),
+                       sel("initWithTitle:"), str("Edit"));
+
+  struct { const char *title; const char *action; const char *key; } items[] = {
+    {"Undo",       "undo:",      "z"},
+    {"Redo",       "redo:",      "Z"},
+    {"Cut",        "cut:",       "x"},
+    {"Copy",       "copy:",      "c"},
+    {"Paste",      "paste:",     "v"},
+    {"Select All", "selectAll:", "a"},
+  };
+
+  for (auto &[title, action, key] : items) {
+    id mi = msg_mi(msg(cls("NSMenuItem"), sel("alloc")),
+                   sel("initWithTitle:action:keyEquivalent:"),
+                   str(title), sel(action), str(key));
+    msg_v(editMenu, sel("addItem:"), mi);
+  }
+
+  msg_v(editItem, sel("setSubmenu:"), editMenu);
+  msg_v(mainMenu, sel("addItem:"), editItem);
+  msg_v(app, sel("setMainMenu:"), mainMenu);
+}
+#endif
+
+// -------------------------------------------------------------------------
+// モデル定義
+// -------------------------------------------------------------------------
+
+struct ModelInfo {
+  std::string name;
+  std::string params;
+  std::string size;
+  std::string repo;
+  std::string filename;
+};
+
+const std::vector<ModelInfo> MODELS = {
+  {
+    .name     = "gemma-2-2b-it",
+    .params   = "2B",
+    .size     = "1.6 GB",
+    .repo     = "bartowski/gemma-2-2b-it-GGUF",
+    .filename = "gemma-2-2b-it-Q4_K_M.gguf",
+  },
+  {
+    .name     = "gemma-2-9b-it",
+    .params   = "9B",
+    .size     = "5.8 GB",
+    .repo     = "bartowski/gemma-2-9b-it-GGUF",
+    .filename = "gemma-2-9b-it-Q4_K_M.gguf",
+  },
+  {
+    .name     = "Llama-3.1-8B-Instruct",
+    .params   = "8B",
+    .size     = "4.9 GB",
+    .repo     = "bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",
+    .filename = "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
+  },
+};
+
+// -------------------------------------------------------------------------
+// モデル保存ディレクトリ
+// -------------------------------------------------------------------------
+
+std::filesystem::path get_models_dir() {
+#ifdef _WIN32
+  auto env = std::getenv("APPDATA");
+  auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
+  return base / "translate-app" / "models";
+#else
+  auto env = std::getenv("HOME");
+  auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
+  return base / ".translate-app" / "models";
+#endif
+}
+
+// -------------------------------------------------------------------------
+// モデルダウンロード
+// -------------------------------------------------------------------------
+
+// progress_cbがfalseを返したらダウンロードを中断する
+bool download_model(const ModelInfo &model,
+                    std::function<bool(int)> progress_cb) {
+  httplib::Client cli("https://huggingface.co");
+  cli.set_follow_location(true);  // Hugging FaceはCDNにリダイレクトする
+  cli.set_read_timeout(std::chrono::hours(1)); // 大きなモデルに備えて長めに
+
+  auto url = "/" + model.repo + "/resolve/main/" + model.filename;
+  auto path = get_models_dir() / model.filename;
+  auto tmp_path = std::filesystem::path(path).concat(".tmp");
+
+  std::ofstream ofs(tmp_path, std::ios::binary);
+  if (!ofs) { return false; }
+
+  auto res = cli.Get(url,
+    // content_receiver: チャンクごとにデータを受け取ってファイルに書き込む
+    [&](const char *data, size_t len) {
+      ofs.write(data, len);
+      return ofs.good();
+    },
+    // progress: ダウンロード進捗を通知（falseを返すと中断）
+    [&, last_pct = -1](size_t current, size_t total) mutable {
+      int pct = total ? (int)(current * 100 / total) : 0;
+      if (pct == last_pct) return true; // 同じ値なら通知をスキップ
+      last_pct = pct;
+      return progress_cb(pct);
+    });
+
+  ofs.close();
+
+  if (!res || res->status != 200) {
+    std::filesystem::remove(tmp_path);
+    return false;
+  }
+
+  // ダウンロード完了後にリネーム
+  std::filesystem::rename(tmp_path, path);
+  return true;
+}
+
+// -------------------------------------------------------------------------
+// サーバー
+// -------------------------------------------------------------------------
+
+int main() {
+  httplib::Server svr;
+  // モデル保存ディレクトリを作成
+  auto models_dir = get_models_dir();
+  std::filesystem::create_directories(models_dir);
+
+  // デフォルトモデルが未ダウンロードなら自動取得
+  std::string selected_model = MODELS[0].filename;
+  auto path = models_dir / selected_model;
+  if (!std::filesystem::exists(path)) {
+    std::cout << "Downloading " << selected_model << "..." << std::endl;
+    if (!download_model(MODELS[0], [](int pct) {
+          std::cout << "\r" << pct << "%" << std::flush;
+          return true;
+        })) {
+      std::cerr << "\nFailed to download model." << std::endl;
+      return 1;
+    }
+    std::cout << std::endl;
+  }
+  auto llm = llamalib::Llama{path};
+  std::mutex llm_mutex; // モデル切り替え中のアクセスを保護する
+
+  // LLM推論は時間がかかるのでタイムアウトを長めに設定（デフォルトは5秒）
+  svr.set_read_timeout(300);
+  svr.set_write_timeout(300);
+
+  svr.set_logger([](const auto &req, const auto &res) {
+    std::cout << req.method << " " << req.path << " -> " << res.status
+              << std::endl;
+  });
+
+  svr.Get("/health", [](const httplib::Request &, httplib::Response &res) {
+    res.set_content(json{{"status", "ok"}}.dump(), "application/json");
+  });
+
+  // --- 翻訳エンドポイント（2章） -----------------------------------------
+
+  svr.Post("/translate",
+           [&](const httplib::Request &req, httplib::Response &res) {
+    auto input = json::parse(req.body, nullptr, false);
+    if (input.is_discarded()) {
+      res.status = 400;
+      res.set_content(json{{"error", "Invalid JSON"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    if (!input.contains("text") || !input["text"].is_string() ||
+        input["text"].get<std::string>().empty()) {
+      res.status = 400;
+      res.set_content(json{{"error", "'text' is required"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    auto text = input["text"].get<std::string>();
+    auto target_lang = input.value("target_lang", "ja");
+
+    auto prompt = "Translate the following text to " + target_lang +
+                  ". Output only the translation, nothing else.\n\n" + text;
+
+    try {
+      std::lock_guard<std::mutex> lock(llm_mutex);
+      auto translation = llm.chat(prompt);
+      res.set_content(json{{"translation", translation}}.dump(),
+                      "application/json");
+    } catch (const std::exception &e) {
+      res.status = 500;
+      res.set_content(json{{"error", e.what()}}.dump(), "application/json");
+    }
+  });
+
+  // --- SSEストリーミング翻訳（3章）--------------------------------------
+
+  svr.Post("/translate/stream",
+           [&](const httplib::Request &req, httplib::Response &res) {
+    auto input = json::parse(req.body, nullptr, false);
+    if (input.is_discarded()) {
+      res.status = 400;
+      res.set_content(json{{"error", "Invalid JSON"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    if (!input.contains("text") || !input["text"].is_string() ||
+        input["text"].get<std::string>().empty()) {
+      res.status = 400;
+      res.set_content(json{{"error", "'text' is required"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    auto text = input["text"].get<std::string>();
+    auto target_lang = input.value("target_lang", "ja");
+
+    auto prompt = "Translate the following text to " + target_lang +
+                  ". Output only the translation, nothing else.\n\n" + text;
+
+    res.set_chunked_content_provider(
+        "text/event-stream",
+        [&, prompt](size_t, httplib::DataSink &sink) {
+          std::lock_guard<std::mutex> lock(llm_mutex);
+          try {
+            llm.chat(prompt, [&](std::string_view token) {
+              sink.os << "data: "
+                      << json(std::string(token)).dump(
+                           -1, ' ', false, json::error_handler_t::replace)
+                      << "\n\n";
+              return sink.os.good(); // 切断されたら推論を中断
+            });
+            sink.os << "data: [DONE]\n\n";
+          } catch (const std::exception &e) {
+            sink.os << "data: " << json({{"error", e.what()}}).dump() << "\n\n";
+          }
+          sink.done();
+          return true;
+        });
+  });
+
+  // --- モデル一覧（4章） -------------------------------------------------
+
+  svr.Get("/models",
+          [&](const httplib::Request &, httplib::Response &res) {
+    auto models_dir = get_models_dir();
+    auto arr = json::array();
+    for (const auto &m : MODELS) {
+      auto path = models_dir / m.filename;
+      arr.push_back({
+        {"name",       m.name},
+        {"params",     m.params},
+        {"size",       m.size},
+        {"downloaded", std::filesystem::exists(path)},
+        {"selected",   m.filename == selected_model},
+      });
+    }
+    res.set_content(json{{"models", arr}}.dump(), "application/json");
+  });
+
+  // --- モデル選択（4章） -------------------------------------------------
+
+  svr.Post("/models/select",
+           [&](const httplib::Request &req, httplib::Response &res) {
+    auto input = json::parse(req.body, nullptr, false);
+    if (input.is_discarded() || !input.contains("model")) {
+      res.status = 400;
+      res.set_content(json{{"error", "'model' is required"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    auto name = input["model"].get<std::string>();
+
+    auto it = std::find_if(MODELS.begin(), MODELS.end(),
+      [&](const ModelInfo &m) { return m.name == name; });
+
+    if (it == MODELS.end()) {
+      res.status = 404;
+      res.set_content(json{{"error", "Unknown model"}}.dump(),
+                      "application/json");
+      return;
+    }
+
+    const auto &model = *it;
+
+    // 常にSSEで応答する（DL済みでも未DLでも同じ形式）
+    res.set_chunked_content_provider(
+        "text/event-stream",
+        [&, model](size_t, httplib::DataSink &sink) {
+          // SSEイベント送信ヘルパー
+          auto send = [&](const json &event) {
+            sink.os << "data: " << event.dump() << "\n\n";
+          };
+
+          // 未ダウンロードならダウンロード（進捗をSSEで通知）
+          auto path = get_models_dir() / model.filename;
+          if (!std::filesystem::exists(path)) {
+            bool ok = download_model(model, [&](int pct) {
+              send({{"status", "downloading"}, {"progress", pct}});
+              return sink.os.good(); // クライアント切断時にダウンロードを中断
+            });
+            if (!ok) {
+              send({{"status", "error"}, {"message", "Download failed"}});
+              sink.done();
+              return true;
+            }
+          }
+
+          // モデルをロードして切り替え
+          send({{"status", "loading"}});
+          {
+            std::lock_guard<std::mutex> lock(llm_mutex);
+            llm = llamalib::Llama{path};
+            selected_model = model.filename;
+          }
+
+          send({{"status", "ready"}});
+          sink.done();
+          return true;
+        });
+  });
+
+  // --- 埋め込みファイル配信（6章） ---------------------------------------
+  // 5章: svr.set_mount_point("/", "./public");
+  httplib::mount(svr, Web::FS);
+
+  // サーバーをバックグラウンドスレッドで起動
+  auto port = svr.bind_to_any_port("127.0.0.1");
+  std::thread server_thread([&]() { svr.listen_after_bind(); });
+
+  std::cout << "Listening on http://127.0.0.1:" << port << std::endl;
+
+  // WebViewでUIを表示
+  webview::webview w(false, nullptr);
+  w.set_title("Translate App");
+  w.set_size(1024, 768, WEBVIEW_HINT_NONE);
+  w.navigate("http://127.0.0.1:" + std::to_string(port));
+
+#ifdef __APPLE__
+  setup_macos_edit_menu();
+#endif
+  w.run(); // ウインドウが閉じるまでブロック
+
+  // ウインドウが閉じたらサーバーも停止
+  svr.stop();
+  server_thread.join();
+}
+```
+
+</details>
+
+5章からの変更点をまとめると:
+
+- `#include <csignal>` → `#include <thread>`, `<cpp-embedlib-httplib.h>`, `"WebAssets.h"`, `"webview/webview.h"`
+- `signal_handler`関数を削除
+- `svr.set_mount_point("/", "./public")` → `httplib::mount(svr, Web::FS)`
+- `svr.listen("127.0.0.1", 8080)` → `bind_to_any_port` + `listen_after_bind` + WebViewのイベントループ
+
+ハンドラのコードは1行も変わっていません。5章まで作ってきたREST API・SSEストリーミング・モデル管理がそのまま動きます。
+
+## 6.6 ビルドと動作確認
+
+```bash
+cmake -B build
+cmake --build build -j
+```
+
+起動します。
+
+```bash
+./build/translate-app
+```
+
+ブラウザは不要です。ウインドウが自動で開きます。5章と同じUIがそのまま表示され、翻訳やモデル切り替えがすべてそのまま動きます。
+
+ウインドウを閉じるとサーバーも自動で終了します。`Ctrl+C`は不要です。
+
+### 何が配布に必要か
+
+配布に必要なのは:
+
+- `translate-app`バイナリ1つ
+
+これだけです。`public/`ディレクトリは不要です。HTML・CSS・JavaScriptはバイナリに埋め込まれています。モデルファイルは初回起動時に自動ダウンロードするので、ユーザーに事前準備を求める必要もありません。
+
+## 次の章へ
+
+お疲れさまでした！🎉
+
+1章では`/health`が`{"status":"ok"}`を返すだけでした。それが今、テキストを入力すればリアルタイムで翻訳が流れ、ドロップダウンからモデルを切り替えれば自動でダウンロードが始まり、ウインドウを閉じればサーバーも一緒に終了する―そんなデスクトップアプリになりました。しかもバイナリ1つで配れます。
+
+6章で変えたのは、静的ファイルの配信方法とサーバーの起動方法だけです。ハンドラのコードは1行も変わっていません。5章までに積み上げてきたREST API・SSEストリーミング・モデル管理が、そのままデスクトップアプリとして動いています。
+
+次の章では視点を変えて、llama.cpp本家の`llama-server`のコードを読みます。本書のシンプルなサーバーと、プロダクション品質のサーバーを比較して、設計判断の違いとその理由を学びましょう。
+
+**Next:** [llama.cpp本家のサーバー実装をコードリーディング](../ch07-code-reading)
--- a/docs-src/pages/ja/llm-app/ch07-code-reading.md
+++ b/docs-src/pages/ja/llm-app/ch07-code-reading.md
@@ -0,0 +1,154 @@
+---
+title: "7. llama.cpp本家のサーバー実装をコードリーディング"
+order: 7
+
+---
+
+6章かけてゼロから翻訳デスクトップアプリを作りました。動くものは完成しましたが、あくまで「学習用」の実装です。では「プロダクション品質」のコードはどう違うのか？ llama.cppに同梱されている公式サーバー`llama-server`のソースコードを読んで、比較してみましょう。
+
+`llama-server`は`llama.cpp/tools/server/`にあります。同じcpp-httplibを使っているので、コードの読み方はこれまでの章と同じです。
+
+## 7.1 ソースコードの場所
+
+```ascii
+llama.cpp/tools/server/
+├── server.cpp           # メインのサーバー実装
+├── httplib.h            # cpp-httplib（同梱版）
+└── ...
+```
+
+ファイルは1つの`server.cpp`にまとまっています。数千行ありますが、構造を知っていれば読むべき箇所は絞れます。
+
+## 7.2 OpenAI互換API
+
+ここまで作ってきたサーバーと`llama-server`の最も大きな違いはAPIの設計です。
+
+**私たちのAPI:**
+
+```text
+POST /translate          → {"translation": "..."}
+POST /translate/stream   → SSE: data: "token"
+```
+
+**llama-serverのAPI:**
+
+```text
+POST /v1/chat/completions  → OpenAI互換のJSON
+POST /v1/completions       → OpenAI互換のJSON
+POST /v1/embeddings        → テキスト埋め込みベクトル
+```
+
+`llama-server`は[OpenAIのAPI仕様](https://platform.openai.com/docs/api-reference)に合わせています。つまり、OpenAIの公式クライアントライブラリ（Pythonの`openai`パッケージなど）がそのまま動きます。
+
+```python
+# OpenAIクライアントでllama-serverに接続する例
+from openai import OpenAI
+client = OpenAI(base_url="http://localhost:8080/v1", api_key="dummy")
+
+response = client.chat.completions.create(
+    model="local-model",
+    messages=[{"role": "user", "content": "Hello!"}]
+)
+```
+
+既存のツールやライブラリとの互換性を持たせるかどうかは、大きな設計判断です。私たちは翻訳専用のAPIをシンプルに設計しましたが、汎用のサーバーを作るならOpenAI互換が事実上の標準になっています。
+
+## 7.3 同時リクエスト処理
+
+私たちのサーバーはリクエストを1つずつ処理します。翻訳中に別のリクエストが来ると、前の推論が終わるまで待ちます。1人で使うデスクトップアプリなら問題ありませんが、複数人で共有するサーバーでは困ります。
+
+`llama-server`は**スロット**という仕組みで同時リクエストを処理します。
+
+![llama-serverのスロット管理](../slots.svg#half)
+
+ポイントは、各スロットのトークンを**1つずつ順番に**ではなく、**まとめて1回のバッチ**で推論することです。GPUは並列処理が得意なので、2人分を同時に処理しても1人分とほとんど変わらない時間で済みます。これを「連続バッチ処理（continuous batching）」と呼びます。
+
+私たちのサーバーではcpp-httplibのスレッドプールが各リクエストに1スレッドを割り当てますが、推論自体は`llm.chat()`の中でシングルスレッドです。`llama-server`はこの推論部分を共有のバッチ処理ループに集約しています。
+
+## 7.4 SSEフォーマットの違い
+
+ストリーミングの仕組み自体は同じ（`set_chunked_content_provider` + SSE）ですが、送るデータのフォーマットが違います。
+
+**私たちの形式:**
+
+```text
+data: "去年の"
+data: "春に"
+data: [DONE]
+```
+
+**llama-server（OpenAI互換）:**
+
+```text
+data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"delta":{"content":"去年の"}}]}
+data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"delta":{"content":"春に"}}]}
+data: [DONE]
+```
+
+私たちの形式はトークンだけを送るシンプルなものです。`llama-server`はOpenAI互換のため、1つのトークンにもJSONのラッパーが付きます。冗長に見えますが、`id`でリクエストを識別したり、`finish_reason`で停止理由を返せたりと、クライアントにとって便利な情報が含まれています。
+
+## 7.5 KVキャッシュの再利用
+
+私たちのサーバーでは、リクエストのたびにプロンプト全体をゼロから処理しています。翻訳アプリのプロンプトは短い（"Translate the following text to ja..." + 入力テキスト）ので、これで問題ありません。
+
+`llama-server`は、前のリクエストと共通するプロンプトのprefixがある場合、その部分のKVキャッシュを再利用します。
+
+![KVキャッシュの再利用](../kv-cache.svg#half)
+
+長いシステムプロンプトやfew-shot例を毎回送るチャットボットでは、これだけで応答時間が大幅に短縮されます。数千トークンのシステムプロンプトを毎回処理するのと、キャッシュから一瞬で読むのとでは、体感が全く違います。
+
+翻訳アプリではシステムプロンプトが1文だけなので効果は限定的ですが、自分のアプリに応用するときは意識したい最適化です。
+
+## 7.6 構造化出力
+
+翻訳APIはプレーンテキストを返すので、出力形式を制約する必要がありませんでした。でも、LLMにJSONで返させたい場合はどうでしょう？
+
+```text
+プロンプト: 以下の文の感情を分析してJSONで返してください。
+LLMの出力（期待）: {"sentiment": "positive", "score": 0.8}
+LLMの出力（現実）: 感情分析の結果は以下の通りです。{"sentiment": ...
+```
+
+LLMは指示を無視して余計なテキストを付けることがあります。`llama-server`はこの問題を**文法制約（grammar）**で解決しています。
+
+```bash
+curl http://localhost:8080/v1/chat/completions \
+  -d '{
+    "messages": [{"role": "user", "content": "Analyze sentiment..."}],
+    "json_schema": {
+      "type": "object",
+      "properties": {
+        "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
+        "score": {"type": "number"}
+      },
+      "required": ["sentiment", "score"]
+    }
+  }'
+```
+
+`json_schema`を指定すると、LLMのトークン生成時に文法に合わないトークンを除外します。出力が必ず有効なJSONになるので、`json::parse`が失敗する心配がありません。
+
+LLMをアプリに組み込むとき、出力を確実にパースできるかどうかは信頼性に直結します。翻訳のようなフリーテキスト出力では不要ですが、APIのレスポンスとして構造化データを返す用途では必須の機能です。
+
+## 7.7 まとめ
+
+ここまでの違いを整理します。
+
+| 観点 | 私たちのサーバー | llama-server |
+|------|-------------|--------------|
+| API設計 | 翻訳専用 | OpenAI互換 |
+| 同時リクエスト | 1つずつ処理 | スロット+連続バッチ |
+| SSEフォーマット | トークンのみ | OpenAI互換JSON |
+| KVキャッシュ | 毎回クリア | prefixを再利用 |
+| 構造化出力 | なし | JSON Schema/文法制約 |
+| コード量 | 約200行 | 数千行 |
+
+私たちのコードがシンプルなのは、「デスクトップアプリで1人が使う」という前提があるからです。複数人に提供するサーバーや、既存のエコシステムと連携するサーバーを作るなら、`llama-server`の設計が参考になります。
+
+逆に言えば、200行のコードでも翻訳アプリとしては十分に動きます。「必要な分だけ作る」ことの価値も、このコードリーディングから感じてもらえたら嬉しいです。
+
+## 次の章へ
+
+次の章では、ここまで作ったアプリを自分のライブラリに差し替えてカスタマイズするためのポイントをまとめます。
+
+**Next:** [自分だけのアプリにカスタマイズする](../ch08-customization)
--- a/docs-src/pages/ja/llm-app/ch08-customization.md
+++ b/docs-src/pages/ja/llm-app/ch08-customization.md
@@ -0,0 +1,120 @@
+---
+title: "8. 自分だけのアプリにカスタマイズする"
+order: 8
+
+---
+
+7章までで翻訳デスクトップアプリが完成し、プロダクション品質のコードとの違いも学びました。この章では、ここまで作ったアプリを**自分だけのアプリに作り変える**ためのポイントをまとめます。
+
+翻訳アプリはあくまで題材です。llama.cppを自分のライブラリに差し替えれば、同じ構成でどんなアプリでも作れます。
+
+## 8.1 ビルド設定を差し替える
+
+まず`CMakeLists.txt`で、llama.cpp関連の`FetchContent`を自分のライブラリに置き換えます。
+
+```cmake
+# 削除: llama.cpp と cpp-llamalib の FetchContent
+
+# 追加: 自分のライブラリ
+FetchContent_Declare(my_lib
+    GIT_REPOSITORY https://github.com/yourname/my-lib
+    GIT_TAG        main
+)
+FetchContent_MakeAvailable(my_lib)
+
+target_link_libraries(my-app PRIVATE
+    httplib::httplib
+    nlohmann_json::nlohmann_json
+    my_lib        # cpp-llamalib の代わりに自分のライブラリ
+    # ...
+)
+```
+
+ライブラリがCMakeに対応していない場合は、ヘッダーファイルとソースファイルを直接`src/`に置いて`add_executable`に追加すればOKです。cpp-httplibやnlohmann/json、webviewはそのまま残します。
+
+## 8.2 APIを自分のタスクに合わせる
+
+翻訳APIのエンドポイントとパラメータを、自分のタスクに合わせて変更します。
+
+| 翻訳アプリ | 自分のアプリ（例: 画像処理） |
+|---|---|
+| `POST /translate` | `POST /process` |
+| `{"text": "...", "target_lang": "ja"}` | `{"image": "base64...", "filter": "blur"}` |
+| `POST /translate/stream` | `POST /process/stream` |
+| `GET /models` | `GET /filters`や`GET /presets` |
+
+個々のハンドラの中身も書き換えます。例えば`llm.chat()`を呼んでいた箇所を、自分のライブラリのAPIに差し替えるだけです。
+
+```cpp
+// Before: LLM翻訳
+auto translation = llm.chat(prompt);
+res.set_content(json{{"translation", translation}}.dump(), "application/json");
+
+// After: 例えば画像処理ライブラリの場合
+auto result = my_lib::process(input_image, options);
+res.set_content(json{{"result", result}}.dump(), "application/json");
+```
+
+SSEストリーミングも同じです。コールバックで進捗を返す関数があれば、3章と同じパターンで逐次レスポンスを返せます。LLMに限らず、処理に時間がかかるタスクならどれでも使えます。画像処理の進捗、データ変換のステップ、長時間の計算結果など、用途は様々です。
+
+## 8.3 設計上の注意点
+
+### 初期化コストが高いライブラリ
+
+本書ではLLMモデルを`main()`の先頭でロードし、変数に保持しています。これは意図的な設計です。リクエストのたびにモデルをロードすると数秒かかるので、起動時に1回だけロードして使い回しています。大きなデータファイルの読み込みやGPUリソースの確保など、初期化が重いライブラリでも同じアプローチが使えます。
+
+### スレッド安全性
+
+cpp-httplibはスレッドプールでリクエストを並行処理します。4章ではモデル切り替え時に`llm`オブジェクトが上書きされる問題を`std::mutex`で保護しました。自分のライブラリを組み込む場合も同じパターンが使えます。ライブラリがスレッドセーフでない場合や、オブジェクトの差し替えが発生する場合は`std::mutex`で保護してください。
+
+## 8.4 UIをカスタマイズする
+
+`public/`の3ファイルを編集します。
+
+- **`index.html`** — 入力フォームの構成を変えます。`<textarea>`を`<input type="file">`にしたり、パラメータの入力欄を追加したり
+- **`style.css`** — レイアウトやカラーを変更します。2カラムのままでも、1カラムに変えても
+- **`script.js`** — `fetch()`の送信先URLとリクエストボディ、レスポンスの表示方法を書き換えます
+
+サーバー側のコードは変えなくても、HTMLを差し替えるだけで全く別のアプリに見えます。静的ファイルなのでサーバーの再起動なしにブラウザをリロードするだけで確認でき、試行錯誤しやすいです。
+
+本書では素のHTML・CSS・JavaScriptで書きましたが、VueやReactなどのフロントエンドフレームワークやCSSフレームワークを組み合わせれば、さらに使い勝手の良いアプリに仕上げることができます。
+
+## 8.5 配布するときの注意点
+
+### ライセンス
+
+使っているライブラリのライセンスを確認してください。cpp-httplib（MIT）、nlohmann/json（MIT）、webview（MIT）はいずれも商用利用可能です。自分のライブラリや、それが依存するライブラリのライセンスも忘れずに確認しましょう。
+
+### モデルやデータファイル
+
+4章で作ったダウンロード機能は、LLMモデルに限らず使えます。大きなデータファイルが必要なアプリなら、同じパターンで初回起動時に自動ダウンロードさせると、バイナリサイズを抑えつつユーザーの手間を省けます。
+
+データが小さければ、cpp-embedlibでバイナリに埋め込んでしまうのも手です。
+
+### クロスプラットフォームビルド
+
+webviewはmacOS・Linux・Windowsに対応しています。各プラットフォーム向けにビルドする場合:
+
+- **macOS** — 追加の依存なし
+- **Linux** — `libwebkit2gtk-4.1-dev`が必要
+- **Windows** — WebView2ランタイムが必要（Windows 11は標準搭載）
+
+CI（GitHub Actionsなど）でクロスプラットフォームビルドを自動化するのもおすすめです。
+
+## おわりに
+
+最後まで読んでくださり、ありがとうございます。🙏
+
+この本は、1章の`/health`が`{"status":"ok"}`を返すところから始まりました。そこからREST API、SSEストリーミング、Hugging Faceからのモデルダウンロード、ブラウザで動くWeb UI、そしてシングルバイナリのデスクトップアプリへ。7章では`llama-server`のコードを読んで、プロダクション品質のサーバーとの設計の違いを学びました。長い道のりでしたが、ここまで付き合ってくださったことに心から感謝します。
+
+振り返ると、cpp-httplibのいくつかの主要な機能を実際に使いました。
+
+- **サーバー**: ルーティング、JSONレスポンス、`set_chunked_content_provider`によるSSEストリーミング、`set_mount_point`による静的ファイル配信
+- **クライアント**: HTTPS接続、リダイレクト追従、コンテンツレシーバーによる大容量ダウンロード、進捗コールバック
+- **WebView連携**: `bind_to_any_port` + `listen_after_bind`でバックグラウンドスレッド化
+
+cpp-httplibにはこの他にも、マルチパートによるファイルアップロード、認証、タイムアウト制御、圧縮、レンジリクエストなど便利な機能があります。詳しくは [A Tour of cpp-httplib](../../tour/) をご覧ください。
+
+これらのパターンは翻訳アプリに限りません。自分のC++ライブラリにWeb APIを付けたい、ブラウザUIで操作できるようにしたい、配布しやすいデスクトップアプリにしたい―そんなときに、この本がリファレンスになれば嬉しいです。
+
+あなたのライブラリで、あなただけのアプリを作ってみてください。Happy hacking! 🚀
--- a/docs-src/pages/ja/llm-app/index.md
+++ b/docs-src/pages/ja/llm-app/index.md
@@ -1,23 +1,26 @@
 ---
 title: "Building a Desktop LLM App with cpp-httplib"
 order: 0
-status: "draft"
+
 ---

-llama.cpp を組み込んだ LLM 翻訳デスクトップアプリを段階的に構築しながら、cpp-httplib のサーバー・クライアント両面の使い方を実践的に学びます。翻訳は一例であり、この部分を差し替えることで要約・コード生成・チャットボットなど自分のアプリに応用できます。
+自分のC++ライブラリにWeb APIを追加したい、Electronライクなデスクトップアプリをサクッと作りたい―そう思ったことはありませんか？ Rustなら「Tauri + axum」という選択肢がありますが、C++では難しいと諦めていませんか？

-## 依存ライブラリ
+[cpp-httplib](https://github.com/yhirose/cpp-httplib)と[webview/webview](https://github.com/webview/webview)、そして[cpp-embedlib](https://github.com/yhirose/cpp-embedlib)を組み合わせれば、C++だけで同じアプローチが取れます。しかも配布しやすい、小さなシングルバイナリーのアプリケーションを作れます。

- [llama.cpp](https://github.com/ggml-org/llama.cpp) — LLM 推論エンジン
- [nlohmann/json](https://github.com/nlohmann/json) — JSON パーサー（ヘッダーオンリー）
- [webview/webview](https://github.com/webview/webview) — WebView ラッパー（ヘッダーオンリー）
- [cpp-httplib](https://github.com/yhirose/cpp-httplib) — HTTP サーバー/クライアント（ヘッダーオンリー）
+今回は、[llama.cpp](https://github.com/ggml-org/llama.cpp)を組み込んだLLM翻訳アプリを題材に、「REST API → SSEストリーミング → Web UI→デスクトップアプリ」と段階的に構築しながら、そのやり方を学んでいきましょう。もちろん、翻訳はあくまで題材です。llama.cppを自分のライブラリに置き換えれば、同じ構成で自分だけのアプリが作れます。

-## 章立て
+![Desktop App](app.png#large-center)

-1. **llama.cpp を組み込んで REST API を作る** — テキストを POST すると翻訳結果を JSON で返すシンプルな API から始める
-2. **SSE でトークンストリーミングを追加する** — 翻訳結果をトークン単位で逐次返す LLM API 標準の方式を実装する
-3. **モデルの取得・管理機能を追加する** — Hugging Face から GGUF モデルを検索・ダウンロードするクライアント機能を実装する
-4. **Web UI を追加する** — 静的ファイル配信で翻訳 UI をホストし、ブラウザから操作できるようにする
-5. **WebView でデスクトップアプリ化する** — webview/webview で包み、Electron 的なデスクトップアプリとして動作させる
-6. **llama.cpp 本家のサーバー実装をコードリーディング** — 自分で作ったものとプロダクション品質のコードを比較して学ぶ
+C++17の基本文法とHTTP（REST API）の基本がわかれば、すぐに始められます。🚀
+
+## 目次
+
+1. **[プロジェクト環境を作る](ch01-setup)** — 依存ライブラリの取得、ビルド設定、雛形コード
+2. **[llama.cppを組み込んでREST APIを作る](ch02-rest-api)** — JSONで翻訳結果を返すAPIの実装
+3. **[SSEでトークンストリーミングを追加する](ch03-sse-streaming)** — トークン単位の逐次レスポンス
+4. **[モデルの取得・管理機能を追加する](ch04-model-management)** — Hugging Faceからのダウンロードと切り替え
+5. **[Web UIを追加する](ch05-web-ui)** — ブラウザから操作できる翻訳画面
+6. **[WebViewでデスクトップアプリ化する](ch06-desktop-app)** — シングルバイナリのデスクトップアプリ
+7. **[llama.cpp本家のサーバー実装をコードリーディング](ch07-code-reading)** — プロダクション品質のコードとの比較
+8. **[自分だけのアプリにカスタマイズする](ch08-customization)** — 自分のライブラリへの差し替えと応用
--- a/docs-src/pages/ja/llm-app/kv-cache.svg
+++ b/docs-src/pages/ja/llm-app/kv-cache.svg
@@ -0,0 +1,36 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 426 160" font-family="system-ui, sans-serif" font-size="14">
+  <rect x="0" y="0" width="426" height="160" rx="8" fill="#f5f3ef"/>
+
+  <defs>
+    <marker id="arrowhead" markerWidth="7" markerHeight="5" refX="7" refY="2.5" orient="auto">
+      <polygon points="0,0 7,2.5 0,5" fill="#198754"/>
+    </marker>
+  </defs>
+
+  <!-- request 1 (y=16) -->
+  <text x="94" y="35" fill="#333" font-weight="bold" text-anchor="end">リクエスト1:</text>
+  <rect x="106" y="16" width="138" height="30" rx="4" fill="#d1e7dd" stroke="#198754" stroke-width="1"/>
+  <text x="175" y="36" fill="#333" text-anchor="middle">システムプロンプト</text>
+  <text x="256" y="36" fill="#333" text-anchor="middle">+</text>
+  <rect x="270" y="16" width="138" height="30" rx="4" fill="#cfe2ff" stroke="#0d6efd" stroke-width="1"/>
+  <text x="339" y="36" fill="#333" text-anchor="middle">ユーザーの質問A</text>
+
+  <!-- annotation: cache save -->
+  <text x="175" y="64" fill="#198754" font-size="11" text-anchor="middle">KV キャッシュに保存</text>
+
+  <!-- arrow -->
+  <line x1="175" y1="70" x2="175" y2="90" stroke="#198754" stroke-width="1.2" marker-end="url(#arrowhead)"/>
+  <text x="188" y="85" fill="#198754" font-size="11">再利用</text>
+
+  <!-- request 2 (y=96) -->
+  <text x="94" y="115" fill="#333" font-weight="bold" text-anchor="end">リクエスト2:</text>
+  <rect x="106" y="96" width="138" height="30" rx="4" fill="#d1e7dd" stroke="#198754" stroke-width="1" stroke-dasharray="6,3"/>
+  <text x="175" y="116" fill="#333" text-anchor="middle">システムプロンプト</text>
+  <text x="256" y="116" fill="#333" text-anchor="middle">+</text>
+  <rect x="270" y="96" width="138" height="30" rx="4" fill="#cfe2ff" stroke="#0d6efd" stroke-width="1"/>
+  <text x="339" y="116" fill="#333" text-anchor="middle">ユーザーの質問B</text>
+
+  <!-- bottom labels -->
+  <text x="175" y="144" fill="#198754" font-size="11" text-anchor="middle">再計算しない</text>
+  <text x="339" y="144" fill="#0d6efd" font-size="11" text-anchor="middle">ここだけ計算</text>
+</svg>
--- a/docs-src/pages/ja/llm-app/slots.svg
+++ b/docs-src/pages/ja/llm-app/slots.svg
@@ -0,0 +1,24 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 440 240" font-family="system-ui, sans-serif" font-size="14">
+  <!-- outer box -->
+  <rect x="0" y="0" width="440" height="240" rx="8" fill="#f5f3ef"/>
+  <text x="20" y="28" font-weight="bold" font-size="16" fill="#333">llama-server</text>
+
+  <!-- slot 0 -->
+  <rect x="20" y="46" width="400" height="32" rx="4" fill="#d1e7dd" stroke="#198754" stroke-width="1"/>
+  <text x="32" y="67" fill="#333">スロット0: ユーザーAのリクエスト</text>
+
+  <!-- slot 1 -->
+  <rect x="20" y="86" width="400" height="32" rx="4" fill="#d1e7dd" stroke="#198754" stroke-width="1"/>
+  <text x="32" y="107" fill="#333">スロット1: ユーザーBのリクエスト</text>
+
+  <!-- slot 2 -->
+  <rect x="20" y="126" width="400" height="32" rx="4" fill="#e9ecef" stroke="#adb5bd" stroke-width="1"/>
+  <text x="32" y="147" fill="#999">スロット2: (空き)</text>
+
+  <!-- slot 3 -->
+  <rect x="20" y="166" width="400" height="32" rx="4" fill="#e9ecef" stroke="#adb5bd" stroke-width="1"/>
+  <text x="32" y="187" fill="#999">スロット3: (空き)</text>
+
+  <!-- arrow + label -->
+  <text x="20" y="224" fill="#333" font-size="13">→ アクティブなスロットをまとめて1回のバッチで推論</text>
+</svg>
--- a/docs-util/llm-app/build_desktop_app.sh
+++ b/docs-util/llm-app/build_desktop_app.sh
@@ -0,0 +1,33 @@
+#!/bin/bash
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+OUT_DIR="$SCRIPT_DIR/build/desktop-app"
+DOCS_DIR="$SCRIPT_DIR/../../docs-src/pages/ja/llm-app"
+
+source "$SCRIPT_DIR/extract_code.sh"
+
+echo "=== Setting up Desktop App (Chapter 6) ==="
+
+mkdir -p "$OUT_DIR"/{src,public}
+cd "$OUT_DIR"
+
+# --- Extract source files from book ---
+echo "Extracting source from book..."
+CH05="$DOCS_DIR/ch05-web-ui.md"
+CH06="$DOCS_DIR/ch06-desktop-app.md"
+
+extract_code "$CH06" "CMakeLists.txt" > CMakeLists.txt
+extract_code "$CH06" "main.cpp"       > src/main.cpp
+extract_code "$CH05" "index.html"     > public/index.html
+extract_code "$CH05" "style.css"      > public/style.css
+extract_code "$CH05" "script.js"      > public/script.js
+
+# --- Build ---
+echo "Building..."
+cmake -B build 2>&1 | tail -1
+cmake --build build -j 2>&1 | tail -1
+
+echo ""
+echo "=== Done ==="
+echo "Run: cd $OUT_DIR && ./build/translate-app"
--- a/docs-util/llm-app/build_web_app.sh
+++ b/docs-util/llm-app/build_web_app.sh
@@ -0,0 +1,33 @@
+#!/bin/bash
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+OUT_DIR="$SCRIPT_DIR/build/web-app"
+DOCS_DIR="$SCRIPT_DIR/../../docs-src/pages/ja/llm-app"
+
+source "$SCRIPT_DIR/extract_code.sh"
+
+echo "=== Setting up Web App (Chapter 5) ==="
+
+mkdir -p "$OUT_DIR"/{src,public}
+cd "$OUT_DIR"
+
+# --- Extract source files from book ---
+echo "Extracting source from book..."
+CH04="$DOCS_DIR/ch04-model-management.md"
+CH05="$DOCS_DIR/ch05-web-ui.md"
+
+extract_code "$CH04" "CMakeLists.txt" > CMakeLists.txt
+extract_code "$CH05" "main.cpp"       > src/main.cpp
+extract_code "$CH05" "index.html"     > public/index.html
+extract_code "$CH05" "style.css"      > public/style.css
+extract_code "$CH05" "script.js"      > public/script.js
+
+# --- Build ---
+echo "Building..."
+cmake -B build 2>&1 | tail -1
+cmake --build build -j 2>&1 | tail -1
+
+echo ""
+echo "=== Done ==="
+echo "Run: cd $OUT_DIR && ./build/translate-server"
--- a/docs-util/llm-app/extract_code.sh
+++ b/docs-util/llm-app/extract_code.sh
@@ -0,0 +1,18 @@
+# Extract code block from a <details> section identified by data-file attribute.
+# Usage: extract_code <file> <data-file>
+# Example: extract_code ch01.md "main.cpp"
+extract_code() {
+  local file="$1" name="$2"
+  local output
+  output=$(awk -v name="$name" '
+    $0 ~ "data-file=\"" name "\"" { found=1; next }
+    found && /^```/ && !inside { inside=1; next }
+    inside && /^```/ { exit }
+    inside { print }
+  ' "$file")
+  if [ -z "$output" ]; then
+    echo "ERROR: extract_code: no match for data-file=\"$name\" in $file" >&2
+    return 1
+  fi
+  printf '%s\n' "$output"
+}
--- a/docs-util/llm-app/generate_desktop_app_project.sh
+++ b/docs-util/llm-app/generate_desktop_app_project.sh
@@ -0,0 +1,38 @@
+#!/bin/bash
+# Generate the desktop app project by extracting source from the cpp-httplib book.
+# Usage: generate_desktop_app_project.sh <output-dir>
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+OUT_DIR="${1:?Usage: $0 <output-dir>}"
+
+BASE_URL="https://raw.githubusercontent.com/yhirose/cpp-httplib/master/docs-src/pages/ja/llm-app"
+CACHE_DIR="$SCRIPT_DIR/.cache"
+
+source "$SCRIPT_DIR/extract_code.sh"
+
+# --- Helper: download markdown files (always fetch latest) ---
+fetch_md() {
+  local name="$1"
+  local path="$CACHE_DIR/$name"
+  curl -sfL "$BASE_URL/$name" -o "$path" || { echo "ERROR: Failed to download $name" >&2; return 1; }
+  echo "$path"
+}
+
+# --- Main ---
+echo "=== Generating desktop app project ==="
+
+mkdir -p "$CACHE_DIR" "$OUT_DIR/src" "$OUT_DIR/public"
+
+CH05=$(fetch_md "ch05-web-ui.md")
+CH06=$(fetch_md "ch06-desktop-app.md")
+
+echo "Extracting source files..."
+extract_code "$CH06" "CMakeLists.txt" > "$OUT_DIR/CMakeLists.txt"
+extract_code "$CH06" "main.cpp"       > "$OUT_DIR/src/main.cpp"
+extract_code "$CH05" "index.html"     > "$OUT_DIR/public/index.html"
+extract_code "$CH05" "style.css"      > "$OUT_DIR/public/style.css"
+extract_code "$CH05" "script.js"      > "$OUT_DIR/public/script.js"
+
+echo "=== Done ==="
+echo "Generated files in: $OUT_DIR"
--- a/docs-util/llm-app/justfile
+++ b/docs-util/llm-app/justfile
@@ -0,0 +1,28 @@
+# List available targets
+default:
+    @just --list
+
+# Remove build artifacts
+clean:
+    rm -rf build
+
+# Test the Book
+test-book:
+    bash test_book.sh
+
+# Build Web App (Chapter 5)
+build-web-app:
+    bash build_web_app.sh
+
+# Stop any running server, then run Web App
+run-web-app: build-web-app
+    -lsof -ti :8080 | xargs kill 2>/dev/null
+    cd build/web-app && ./build/translate-server
+
+# Build Desktop App (Chapter 6)
+build-desktop-app:
+    bash build_desktop_app.sh
+
+# Run Desktop App
+run-desktop-app: build-desktop-app
+    cd build/desktop-app && ./build/translate-app
--- a/docs-util/llm-app/test_book.sh
+++ b/docs-util/llm-app/test_book.sh
@@ -0,0 +1,561 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# =============================================================================
+# test_book.sh — LLM App Tutorial (Ch1–Ch5) E2E Test
+#
+# Code is extracted from the doc markdown files (<!-- test:full-code --> and
+# <!-- test:cmake --> markers), so tests always stay in sync with the docs.
+# =============================================================================
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+PROJECT_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+DOCS_DIR="$PROJECT_ROOT/docs-src/pages/ja/llm-app"
+WORKDIR=$(mktemp -d)
+MODEL_NAME="gemma-2-2b-it-Q4_K_M.gguf"
+MODEL_URL="https://huggingface.co/bartowski/gemma-2-2b-it-GGUF/resolve/main/${MODEL_NAME}"
+PORT=18080
+GECKODRIVER_PORT=4444
+SERVER_PID=""
+GECKODRIVER_PID=""
+PASS_COUNT=0
+FAIL_COUNT=0
+
+# ---------------------------------------------------------------------------
+# Cleanup
+# ---------------------------------------------------------------------------
+cleanup() {
+  if [[ -n "$SERVER_PID" ]]; then
+    kill "$SERVER_PID" 2>/dev/null || true
+    wait "$SERVER_PID" 2>/dev/null || true
+  fi
+  if [[ -n "$GECKODRIVER_PID" ]]; then
+    kill "$GECKODRIVER_PID" 2>/dev/null || true
+    wait "$GECKODRIVER_PID" 2>/dev/null || true
+  fi
+  rm -rf "$WORKDIR"
+}
+trap cleanup EXIT
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+log()  { echo "=== $*"; }
+pass() { echo "  PASS: $*"; PASS_COUNT=$((PASS_COUNT + 1)); }
+fail() { echo "  FAIL: $*"; FAIL_COUNT=$((FAIL_COUNT + 1)); }
+
+source "$SCRIPT_DIR/extract_code.sh"
+
+wait_for_server() {
+  local max_wait=30
+  local i=0
+  while ! curl -s -o /dev/null "http://127.0.0.1:${PORT}/" 2>/dev/null; do
+    sleep 1
+    i=$((i + 1))
+    if [[ $i -ge $max_wait ]]; then
+      fail "Server did not start within ${max_wait}s"
+      return 1
+    fi
+  done
+}
+
+stop_server() {
+  if [[ -n "$SERVER_PID" ]]; then
+    kill "$SERVER_PID" 2>/dev/null || true
+    wait "$SERVER_PID" 2>/dev/null || true
+    SERVER_PID=""
+  fi
+}
+
+# Make an HTTP request and capture status + body
+# Usage: http_request METHOD PATH [DATA]
+# Sets: HTTP_STATUS, HTTP_BODY
+http_request() {
+  local method="$1" path="$2" data="${3:-}"
+  local tmp
+  tmp=$(mktemp)
+  if [[ -n "$data" ]]; then
+    HTTP_STATUS=$(curl -s -o "$tmp" -w '%{http_code}' \
+      -X "$method" "http://127.0.0.1:${PORT}${path}" \
+      -H "Content-Type: application/json" \
+      -d "$data")
+  else
+    HTTP_STATUS=$(curl -s -o "$tmp" -w '%{http_code}' \
+      -X "$method" "http://127.0.0.1:${PORT}${path}")
+  fi
+  HTTP_BODY=$(cat "$tmp")
+  rm -f "$tmp"
+}
+
+# Make an SSE request and capture the raw stream
+# Usage: http_sse PATH DATA
+# Sets: HTTP_STATUS, HTTP_BODY
+http_sse() {
+  local path="$1" data="$2"
+  HTTP_SSE_FILE=$(mktemp)
+  HTTP_STATUS=$(curl -s -N -o "$HTTP_SSE_FILE" -w '%{http_code}' \
+    -X POST "http://127.0.0.1:${PORT}${path}" \
+    -H "Content-Type: application/json" \
+    -d "$data")
+  HTTP_BODY=$(cat "$HTTP_SSE_FILE")
+  rm -f "$HTTP_SSE_FILE"
+}
+
+assert_status() {
+  local expected="$1" label="$2"
+  if [[ "$HTTP_STATUS" == "$expected" ]]; then
+    pass "$label (status=$HTTP_STATUS)"
+  else
+    fail "$label (expected=$expected, got=$HTTP_STATUS)"
+    echo "    body: $HTTP_BODY"
+  fi
+}
+
+assert_json_field() {
+  local field="$1" label="$2"
+  if echo "$HTTP_BODY" | python3 -c "import sys,json; d=json.load(sys.stdin); assert '$field' in d" 2>/dev/null; then
+    pass "$label (field '$field' exists)"
+  else
+    fail "$label (field '$field' missing in response)"
+    echo "    body: $HTTP_BODY"
+  fi
+}
+
+assert_json_value() {
+  local field="$1" expected="$2" label="$3"
+  local actual
+  actual=$(echo "$HTTP_BODY" | python3 -c "import sys,json; print(json.load(sys.stdin)['$field'])" 2>/dev/null || echo "")
+  if [[ "$actual" == "$expected" ]]; then
+    pass "$label ($field='$actual')"
+  else
+    fail "$label (expected $field='$expected', got='$actual')"
+  fi
+}
+
+assert_json_nonempty() {
+  local field="$1" label="$2"
+  local val
+  val=$(echo "$HTTP_BODY" | python3 -c "import sys,json; v=json.load(sys.stdin)['$field']; assert len(str(v))>0; print(v)" 2>/dev/null || echo "")
+  if [[ -n "$val" ]]; then
+    pass "$label ($field is non-empty)"
+  else
+    fail "$label ($field is empty or missing)"
+    echo "    body: $HTTP_BODY"
+  fi
+}
+
+# Patch port number in extracted source code (8080 -> test port)
+patch_port() {
+  sed "s/127\.0\.0\.1\", 8080/127.0.0.1\", ${PORT}/g; s/127\.0\.0\.1:8080/127.0.0.1:${PORT}/g"
+}
+
+# Patch model path in extracted source code
+patch_model() {
+  sed "s|models/gemma-2-2b-it-Q4_K_M.gguf|models/${MODEL_NAME}|g"
+}
+
+# =============================================================================
+# Ch1: Skeleton Server
+# =============================================================================
+test_ch1() {
+  log "Ch1: Project Setup & Skeleton Server"
+
+  local APP_DIR="$WORKDIR/translate-app"
+  mkdir -p "$APP_DIR/src" "$APP_DIR/models"
+  cd "$APP_DIR"
+
+  # Copy httplib.h from project root (test current version)
+  cp "$PROJECT_ROOT/httplib.h" .
+
+  # Download json.hpp into nlohmann/ directory to match #include <nlohmann/json.hpp>
+  mkdir -p nlohmann
+  curl -sL -o nlohmann/json.hpp \
+    https://github.com/nlohmann/json/releases/latest/download/json.hpp
+
+  # CMakeLists.txt — ch1 doesn't need llama.cpp, so use a minimal version
+  # (the doc's cmake includes llama.cpp which isn't cloned yet in ch1)
+  cat > CMakeLists.txt << 'CMAKE_EOF'
+cmake_minimum_required(VERSION 3.16)
+project(translate-server LANGUAGES CXX)
+set(CMAKE_CXX_STANDARD 17)
+
+add_executable(translate-server src/main.cpp)
+target_include_directories(translate-server PRIVATE ${CMAKE_SOURCE_DIR})
+CMAKE_EOF
+
+  # Extract main.cpp from ch1 doc and patch port
+  extract_code "$DOCS_DIR/ch01-setup.md" "main.cpp" | patch_port > src/main.cpp
+
+  # Build
+  log "Ch1: Building..."
+  cmake -B build -DCMAKE_BUILD_TYPE=Release 2>&1 | tail -1
+  cmake --build build -j 2>&1 | tail -3
+
+  # Start server
+  ./build/translate-server &
+  SERVER_PID=$!
+  wait_for_server
+
+  # Tests
+  http_request POST /translate '{"text":"hello","target_lang":"ja"}'
+  assert_status 200 "Ch1 POST /translate"
+  assert_json_value translation "TODO" "Ch1 POST /translate returns TODO"
+
+  http_request GET /models
+  assert_status 200 "Ch1 GET /models"
+  assert_json_field models "Ch1 GET /models"
+
+  http_request POST /models/select '{"model":"test"}'
+  assert_status 200 "Ch1 POST /models/select"
+  assert_json_value status "TODO" "Ch1 POST /models/select returns TODO"
+
+  stop_server
+  log "Ch1: Done"
+}
+
+# =============================================================================
+# Ch2: REST API with llama.cpp
+# =============================================================================
+test_ch2() {
+  log "Ch2: REST API with llama.cpp"
+
+  local APP_DIR="$WORKDIR/translate-app"
+  cd "$APP_DIR"
+
+  # Clone llama.cpp
+  if [[ ! -d llama.cpp ]]; then
+    log "Ch2: Cloning llama.cpp..."
+    git clone --depth 1 https://github.com/ggml-org/llama.cpp.git 2>&1 | tail -1
+  fi
+
+  # Download cpp-llamalib.h
+  if [[ ! -f cpp-llamalib.h ]]; then
+    curl -sL -o cpp-llamalib.h \
+      https://raw.githubusercontent.com/yhirose/cpp-llamalib/main/cpp-llamalib.h
+  fi
+
+  # Download model
+  if [[ ! -f "models/$MODEL_NAME" ]]; then
+    log "Ch2: Downloading model ${MODEL_NAME} (~1.6GB)..."
+    curl -L -o "models/$MODEL_NAME" "$MODEL_URL"
+  fi
+
+  # CMakeLists.txt from ch1 doc (includes llama.cpp)
+  extract_code "$DOCS_DIR/ch01-setup.md" "CMakeLists.txt" > CMakeLists.txt
+
+  # Extract main.cpp from ch2 doc and patch port + model path
+  extract_code "$DOCS_DIR/ch02-rest-api.md" "main.cpp" | patch_port | patch_model > src/main.cpp
+
+  # Build (clean rebuild needed — cmake config changed)
+  log "Ch2: Building (this may take a while for llama.cpp)..."
+  rm -rf build
+  cmake -B build -DCMAKE_BUILD_TYPE=Release 2>&1 | tail -1
+  cmake --build build -j 2>&1 | tail -3
+
+  # Start server
+  ./build/translate-server &
+  SERVER_PID=$!
+  wait_for_server
+
+  # Tests — normal request
+  http_request POST /translate \
+    '{"text":"I had a great time visiting Tokyo last spring. The cherry blossoms were beautiful.","target_lang":"ja"}'
+  assert_status 200 "Ch2 POST /translate normal"
+  assert_json_nonempty translation "Ch2 POST /translate has translation"
+
+  # Tests — invalid JSON
+  http_request POST /translate 'not json'
+  assert_status 400 "Ch2 POST /translate invalid JSON"
+
+  # Tests — missing text
+  http_request POST /translate '{"target_lang":"ja"}'
+  assert_status 400 "Ch2 POST /translate missing text"
+
+  # Tests — empty text
+  http_request POST /translate '{"text":""}'
+  assert_status 400 "Ch2 POST /translate empty text"
+
+  stop_server
+  log "Ch2: Done"
+}
+
+# =============================================================================
+# Ch3: SSE Streaming
+# =============================================================================
+test_ch3() {
+  log "Ch3: SSE Streaming"
+
+  local APP_DIR="$WORKDIR/translate-app"
+  cd "$APP_DIR"
+
+  # Extract main.cpp from ch3 doc and patch port + model path
+  extract_code "$DOCS_DIR/ch03-sse-streaming.md" "main.cpp" | patch_port | patch_model > src/main.cpp
+
+  # Build (incremental — only main.cpp changed)
+  log "Ch3: Building..."
+  cmake --build build -j 2>&1 | tail -3
+
+  # Start server
+  ./build/translate-server &
+  SERVER_PID=$!
+  wait_for_server
+
+  # Tests — /translate still works
+  http_request POST /translate \
+    '{"text":"Hello world","target_lang":"ja"}'
+  assert_status 200 "Ch3 POST /translate still works"
+
+  # Tests — SSE streaming
+  http_sse /translate/stream \
+    '{"text":"I had a great time visiting Tokyo last spring. The cherry blossoms were beautiful.","target_lang":"ja"}'
+  assert_status 200 "Ch3 POST /translate/stream status"
+
+  # Check SSE format: has data: lines and ends with [DONE]
+  local data_lines
+  data_lines=$(echo "$HTTP_BODY" | grep -c '^data: ' || true)
+  if [[ $data_lines -ge 2 ]]; then
+    pass "Ch3 SSE has multiple data: lines ($data_lines)"
+  else
+    fail "Ch3 SSE expected multiple data: lines, got $data_lines"
+    echo "    body: $HTTP_BODY"
+  fi
+
+  if echo "$HTTP_BODY" | grep -q 'data: \[DONE\]'; then
+    pass "Ch3 SSE ends with data: [DONE]"
+  else
+    fail "Ch3 SSE missing data: [DONE]"
+    echo "    body: $HTTP_BODY"
+  fi
+
+  # Tests — SSE invalid JSON
+  http_sse /translate/stream 'not json'
+  assert_status 400 "Ch3 POST /translate/stream invalid JSON"
+
+  stop_server
+  log "Ch3: Done"
+}
+
+# =============================================================================
+# Ch4: Model Management
+# =============================================================================
+test_ch4() {
+  log "Ch4: Model Management"
+
+  local APP_DIR="$WORKDIR/translate-app"
+  cd "$APP_DIR"
+
+  # Ch4+ uses ~/.translate-app/models/ — symlink model there
+  local MODELS_HOME="$HOME/.translate-app/models"
+  mkdir -p "$MODELS_HOME"
+  ln -sf "$APP_DIR/models/$MODEL_NAME" "$MODELS_HOME/$MODEL_NAME"
+
+  # CMakeLists.txt from ch4 (adds OpenSSL)
+  extract_code "$DOCS_DIR/ch04-model-management.md" "CMakeLists.txt" > CMakeLists.txt
+
+  # Extract main.cpp from ch4 doc
+  extract_code "$DOCS_DIR/ch04-model-management.md" "main.cpp" | patch_port > src/main.cpp
+
+  # Build (reconfigure for OpenSSL, incremental — reuses llama.cpp objects)
+  log "Ch4: Building..."
+  cmake -B build -DCMAKE_BUILD_TYPE=Release 2>&1 | tail -1
+  cmake --build build -j 2>&1 | tail -3
+
+  # Start server
+  ./build/translate-server &
+  SERVER_PID=$!
+  wait_for_server
+
+  # Tests — GET /models
+  http_request GET /models
+  assert_status 200 "Ch4 GET /models"
+  assert_json_field models "Ch4 GET /models has models array"
+
+  # デフォルトモデルがdownloaded+selectedであること
+  local selected
+  selected=$(echo "$HTTP_BODY" | python3 -c "
+import sys, json
+models = json.load(sys.stdin)['models']
+sel = [m for m in models if m['selected']]
+print(sel[0]['downloaded'] if sel else '')
+" 2>/dev/null || echo "")
+  if [[ "$selected" == "True" ]]; then
+    pass "Ch4 GET /models default model is downloaded and selected"
+  else
+    fail "Ch4 GET /models default model state unexpected"
+    echo "    body: $HTTP_BODY"
+  fi
+
+  # Tests — POST /models/select with already-downloaded model (SSE)
+  http_sse /models/select '{"model": "gemma-2-2b-it"}'
+  assert_status 200 "Ch4 POST /models/select already downloaded"
+
+  if echo "$HTTP_BODY" | grep -q '"ready"'; then
+    pass "Ch4 POST /models/select returns ready"
+  else
+    fail "Ch4 POST /models/select missing ready status"
+    echo "    body: $HTTP_BODY"
+  fi
+
+  # Tests — POST /models/select unknown model
+  http_request POST /models/select '{"model": "nonexistent"}'
+  assert_status 404 "Ch4 POST /models/select unknown model"
+
+  # Tests — POST /models/select missing model field
+  http_request POST /models/select '{"foo": "bar"}'
+  assert_status 400 "Ch4 POST /models/select missing model field"
+
+  # Tests — /translate still works after model select
+  http_request POST /translate '{"text": "Hello", "target_lang": "ja"}'
+  assert_status 200 "Ch4 POST /translate still works"
+  assert_json_nonempty translation "Ch4 POST /translate has translation"
+
+  # Tests — switch model via symlink (avoids downloading a second model)
+  # Place a symlink so the server sees Llama-3.1-8B-Instruct as "downloaded"
+  ln -sf "$MODELS_HOME/$MODEL_NAME" "$MODELS_HOME/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"
+
+  http_sse /models/select '{"model": "Llama-3.1-8B-Instruct"}'
+  assert_status 200 "Ch4 POST /models/select switch model"
+
+  if echo "$HTTP_BODY" | grep -q '"ready"'; then
+    pass "Ch4 model switch returns ready"
+  else
+    fail "Ch4 model switch missing ready"
+    echo "    body: $HTTP_BODY"
+  fi
+
+  # Translate with the switched model
+  http_request POST /translate \
+    '{"text": "The weather is nice today.", "target_lang": "ja"}'
+  assert_status 200 "Ch4 POST /translate after model switch"
+  assert_json_nonempty translation "Ch4 POST /translate switched model has translation"
+
+  # Verify model list reflects the switch
+  http_request GET /models
+  local new_selected
+  new_selected=$(echo "$HTTP_BODY" | python3 -c "
+import sys, json
+models = json.load(sys.stdin)['models']
+sel = [m for m in models if m['selected']]
+print(sel[0]['name'] if sel else '')
+" 2>/dev/null || echo "")
+  if [[ "$new_selected" == "Llama-3.1-8B-Instruct" ]]; then
+    pass "Ch4 GET /models reflects model switch"
+  else
+    fail "Ch4 GET /models expected Llama-3.1-8B-Instruct selected, got '$new_selected'"
+  fi
+
+  stop_server
+  log "Ch4: Done"
+}
+
+# =============================================================================
+# Ch5: Web UI (browser tests via geckodriver + webdriver.h)
+# =============================================================================
+
+start_geckodriver() {
+  geckodriver --port "$GECKODRIVER_PORT" &>/dev/null &
+  GECKODRIVER_PID=$!
+  # Wait for geckodriver to be ready
+  local i=0
+  while ! curl -s -o /dev/null "http://127.0.0.1:${GECKODRIVER_PORT}/status" 2>/dev/null; do
+    sleep 0.5
+    i=$((i + 1))
+    if [[ $i -ge 20 ]]; then
+      fail "geckodriver did not start within 10s"
+      return 1
+    fi
+  done
+}
+
+stop_geckodriver() {
+  if [[ -n "$GECKODRIVER_PID" ]]; then
+    kill "$GECKODRIVER_PID" 2>/dev/null || true
+    wait "$GECKODRIVER_PID" 2>/dev/null || true
+    GECKODRIVER_PID=""
+  fi
+}
+
+test_ch5() {
+  log "Ch5: Web UI (browser tests)"
+
+  # Check for geckodriver
+  if ! command -v geckodriver &>/dev/null; then
+    log "Ch5: Skipping browser tests (geckodriver not found)"
+    log "Ch5: Install with: brew install geckodriver"
+    return 0
+  fi
+
+  local APP_DIR="$WORKDIR/translate-app"
+  cd "$APP_DIR"
+
+  # Extract source files from ch05
+  extract_code "$DOCS_DIR/ch05-web-ui.md" "main.cpp" \
+    | patch_port > src/main.cpp
+
+  mkdir -p public
+  extract_code "$DOCS_DIR/ch05-web-ui.md" "index.html"  > public/index.html
+  extract_code "$DOCS_DIR/ch05-web-ui.md" "style.css"   > public/style.css
+  extract_code "$DOCS_DIR/ch05-web-ui.md" "script.js"   > public/script.js
+
+  # Build (incremental — only main.cpp changed)
+  log "Ch5: Building server..."
+  cmake --build build -j 2>&1 | tail -3
+
+  # Build browser test program
+  log "Ch5: Building browser test..."
+  g++ -std=c++17 \
+    -I"$APP_DIR" \
+    -I"$SCRIPT_DIR" \
+    -o "$APP_DIR/build/test_webui" \
+    "$SCRIPT_DIR/test_webui.cpp" \
+    -pthread
+
+  # Start server
+  ./build/translate-server &
+  SERVER_PID=$!
+  wait_for_server
+
+  # Start geckodriver
+  start_geckodriver
+
+  # Run browser tests
+  log "Ch5: Running browser tests..."
+  local test_exit=0
+  "$APP_DIR/build/test_webui" "$PORT" || test_exit=$?
+
+  # Parse pass/fail from test output and add to totals
+  # (test_webui prints its own pass/fail, but we track via exit code)
+  if [[ $test_exit -ne 0 ]]; then
+    fail "Ch5 browser tests had failures"
+  else
+    pass "Ch5 browser tests all passed"
+  fi
+
+  stop_geckodriver
+  stop_server
+  log "Ch5: Done"
+}
+
+# =============================================================================
+# Main
+# =============================================================================
+
+log "LLM App Tutorial E2E Test"
+log "Working directory: $WORKDIR"
+echo ""
+
+test_ch1
+echo ""
+test_ch2
+echo ""
+test_ch3
+echo ""
+test_ch4
+echo ""
+test_ch5
+
+log "Results: $PASS_COUNT passed, $FAIL_COUNT failed"
+
+if [[ $FAIL_COUNT -gt 0 ]]; then
+  exit 1
+fi
--- a/docs-util/llm-app/test_webui.cpp
+++ b/docs-util/llm-app/test_webui.cpp
@@ -0,0 +1,300 @@
+// test_webui.cpp — Browser-based E2E tests for Ch5 Web UI.
+// Uses webdriver.h (cpp-httplib + json.hpp) to control headless Firefox.
+//
+// Usage: test_webui <port>
+//   port: the translate-server port (e.g. 18080)
+
+#include "webdriver.h"
+
+#include <cstdlib>
+#include <iostream>
+#include <string>
+
+// ─── Test framework (minimal) ────────────────────────────────
+
+static int pass_count = 0;
+static int fail_count = 0;
+
+#define PASS(label)                                                            \
+  do {                                                                         \
+    std::cout << "  PASS: " << (label) << "\n";                                \
+    ++pass_count;                                                              \
+  } while (0)
+
+#define FAIL(label, detail)                                                    \
+  do {                                                                         \
+    std::cout << "  FAIL: " << (label) << "\n";                                \
+    std::cout << "    " << (detail) << "\n";                                   \
+    ++fail_count;                                                              \
+  } while (0)
+
+#define ASSERT_TRUE(cond, label)                                               \
+  do {                                                                         \
+    if (cond) {                                                                \
+      PASS(label);                                                             \
+    } else {                                                                   \
+      FAIL(label, "condition was false");                                      \
+    }                                                                          \
+  } while (0)
+
+#define ASSERT_CONTAINS(haystack, needle, label)                               \
+  do {                                                                         \
+    if (std::string(haystack).find(needle) != std::string::npos) {             \
+      PASS(label);                                                             \
+    } else {                                                                   \
+      FAIL(label, "'" + std::string(haystack) + "' does not contain '" +       \
+                      std::string(needle) + "'");                              \
+    }                                                                          \
+  } while (0)
+
+#define ASSERT_ELEMENT_EXISTS(session, selector)                               \
+  do {                                                                         \
+    try {                                                                      \
+      (session).css(selector);                                                 \
+      PASS("Element " selector " exists");                                     \
+    } catch (...) { FAIL("Element " selector " exists", "not found"); }        \
+  } while (0)
+
+// ─── Helpers ─────────────────────────────────────────────────
+
+static std::string base_url;
+
+void navigate_and_wait_for_models(webdriver::Session &session) {
+  session.navigate(base_url);
+  session.wait_until(
+      "return document.querySelectorAll('#model-select option').length > 0",
+      5000);
+}
+
+void test_page_loads(webdriver::Session &session) {
+  std::cout << "=== TC1: Page loads with correct structure\n";
+
+  session.navigate(base_url);
+
+  auto title = session.title();
+  ASSERT_CONTAINS(title, "Translate", "Page title contains 'Translate'");
+
+  // Verify main DOM elements exist
+  ASSERT_ELEMENT_EXISTS(session, "#model-select");
+  ASSERT_ELEMENT_EXISTS(session, "#input-text");
+  ASSERT_ELEMENT_EXISTS(session, "#output-text");
+  ASSERT_ELEMENT_EXISTS(session, "#target-lang");
+}
+
+void test_model_dropdown(webdriver::Session &session) {
+  std::cout << "=== TC2: Model dropdown is populated\n";
+
+  navigate_and_wait_for_models(session);
+
+  // Note: WebDriver findElements cannot find <option> elements directly
+  // in geckodriver/Firefox, so we use JS to count them.
+  auto option_count = session.execute_script(
+      "return document.querySelectorAll('#model-select option').length");
+  ASSERT_TRUE(option_count != "0" && option_count != "null",
+              "Model dropdown has options (count=" + option_count + ")");
+
+  // Check that at least one option has a selected attribute
+  auto selected_val = session.execute_script(
+      "return document.querySelector('#model-select').value");
+  ASSERT_TRUE(selected_val != "null" && !selected_val.empty(),
+              "A model is selected (value='" + selected_val + "')");
+}
+
+void test_translation_sse(webdriver::Session &session) {
+  std::cout << "=== TC3: Translation with SSE streaming\n";
+
+  navigate_and_wait_for_models(session);
+
+  // Clear and type input — debounce auto-translate triggers after 300ms
+  auto input = session.css("#input-text");
+  input.clear();
+  input.send_keys("Hello world");
+
+  // Wait for output to appear (debounce 300ms + LLM inference)
+  bool has_output = session.wait_until(
+      "return document.querySelector('#output-text').textContent.length > 0",
+      120000);
+  ASSERT_TRUE(has_output, "Translation output appeared");
+
+  auto output_text = session.execute_script(
+      "return document.querySelector('#output-text').textContent");
+  ASSERT_TRUE(!output_text.empty() && output_text != "null",
+              "Output text is non-empty ('" + output_text.substr(0, 50) +
+                  "...')");
+
+  // Wait for busy state to be cleared after completion
+  bool busy_cleared = session.wait_until(
+      "return !document.body.classList.contains('busy')", 120000);
+  ASSERT_TRUE(busy_cleared, "Busy state cleared after translation");
+}
+
+void test_busy_state(webdriver::Session &session) {
+  std::cout << "=== TC4: Busy state during translation\n";
+
+  navigate_and_wait_for_models(session);
+
+  auto input = session.css("#input-text");
+  input.clear();
+
+  // Clear previous output
+  session.execute_script(
+      "document.querySelector('#output-text').textContent = ''");
+
+  input.send_keys(
+      "I had a great time visiting Tokyo last spring. "
+      "The cherry blossoms were beautiful and the food was amazing.");
+
+  // Check busy state (debounce 300ms then translation starts)
+  bool went_busy = session.wait_until(
+      "return document.body.classList.contains('busy')", 5000);
+  ASSERT_TRUE(went_busy, "Body gets 'busy' class during translation");
+
+  // Wait for completion
+  session.wait_until("return !document.body.classList.contains('busy')",
+                     120000);
+  PASS("Busy class removed after completion");
+}
+
+void test_empty_input(webdriver::Session &session) {
+  std::cout << "=== TC5: Empty input does nothing\n";
+
+  navigate_and_wait_for_models(session);
+
+  // Clear input and output
+  auto input = session.css("#input-text");
+  input.clear();
+  session.execute_script(
+      "document.querySelector('#output-text').textContent = ''");
+
+  // Trigger input event on empty textarea
+  session.execute_script("document.querySelector('#input-text').dispatchEvent("
+                         "  new Event('input'));");
+
+  // Wait longer than debounce (300ms) — nothing should happen
+  std::this_thread::sleep_for(std::chrono::milliseconds(1000));
+
+  auto output_text = session.execute_script(
+      "return document.querySelector('#output-text').textContent");
+  ASSERT_TRUE(output_text.empty() || output_text == "null" || output_text == "",
+              "No output for empty input");
+}
+
+void test_target_lang_selector(webdriver::Session &session) {
+  std::cout << "=== TC6: Target language selector\n";
+
+  navigate_and_wait_for_models(session);
+
+  // Check available language options (use JS — WebDriver can't find <option>)
+  auto lang_count = session.execute_script(
+      "return document.querySelectorAll('#target-lang option').length");
+  ASSERT_TRUE(lang_count != "0" && lang_count != "null",
+              "Language selector has multiple options (count=" + lang_count +
+                  ")");
+
+  // Switch to English and translate
+  session.execute_script("document.querySelector('#target-lang').value = 'en';"
+                         "document.querySelector('#target-lang').dispatchEvent("
+                         "  new Event('change'));");
+
+  // Clear output, then type — debounce auto-translate triggers
+  session.execute_script(
+      "document.querySelector('#output-text').textContent = ''");
+
+  auto input = session.css("#input-text");
+  input.clear();
+  input.send_keys("こんにちは");
+
+  bool has_output = session.wait_until(
+      "return document.querySelector('#output-text').textContent.length > 0",
+      120000);
+  ASSERT_TRUE(has_output, "Translation with target_lang=en produced output");
+}
+
+void test_model_switch(webdriver::Session &session) {
+  std::cout << "=== TC7: Model switching\n";
+
+  navigate_and_wait_for_models(session);
+
+  auto options = session.css_all("#model-select option");
+  if (options.size() < 2) {
+    PASS("Model switch skipped (only 1 model available)");
+    return;
+  }
+
+  // Get current model
+  auto current = session.execute_script(
+      "return document.querySelector('#model-select').value");
+
+  // Switch to a different model (pick the second option's value)
+  auto other_value = options[1].attribute("value");
+  if (other_value == current && options.size() > 2) {
+    other_value = options[2].attribute("value");
+  }
+
+  session.execute_script(
+      "document.querySelector('#model-select').value = '" + other_value +
+      "';"
+      "document.querySelector('#model-select').dispatchEvent("
+      "  new Event('change'));");
+
+  // Wait for model switch to complete (SSE: downloading → loading → ready)
+  bool ready = session.wait_until(
+      "return !document.body.classList.contains('busy')", 120000);
+  ASSERT_TRUE(ready, "Model switch completed");
+
+  auto new_value = session.execute_script(
+      "return document.querySelector('#model-select').value");
+  ASSERT_TRUE(new_value == other_value,
+              "Model changed to '" + other_value + "'");
+}
+
+void test_download_dialog_structure(webdriver::Session &session) {
+  std::cout << "=== TC8: Download dialog DOM structure\n";
+
+  session.navigate(base_url);
+
+  ASSERT_ELEMENT_EXISTS(session, "#download-dialog");
+  ASSERT_ELEMENT_EXISTS(session, "#download-progress");
+  ASSERT_ELEMENT_EXISTS(session, "#download-status");
+  ASSERT_ELEMENT_EXISTS(session, "#download-cancel");
+}
+
+// ─── Main ────────────────────────────────────────────────────
+
+int main(int argc, char *argv[]) {
+  if (argc < 2) {
+    std::cerr << "Usage: test_webui <server-port>\n";
+    return 1;
+  }
+
+  int port = std::atoi(argv[1]);
+  base_url = "http://127.0.0.1:" + std::to_string(port);
+
+  std::cout << "=== Ch5 Web UI Browser Tests\n";
+  std::cout << "=== Server: " << base_url << "\n\n";
+
+  try {
+    webdriver::Session session;
+
+    test_page_loads(session);
+    test_model_dropdown(session);
+    test_translation_sse(session);
+    test_busy_state(session);
+    test_empty_input(session);
+    test_target_lang_selector(session);
+    test_model_switch(session);
+    test_download_dialog_structure(session);
+
+  } catch (const webdriver::Error &e) {
+    std::cerr << "WebDriver error: " << e.what() << "\n";
+    ++fail_count;
+  } catch (const std::exception &e) {
+    std::cerr << "Error: " << e.what() << "\n";
+    ++fail_count;
+  }
+
+  std::cout << "\n=== Results: " << pass_count << " passed, " << fail_count
+            << " failed\n";
+
+  return fail_count > 0 ? 1 : 0;
+}
--- a/docs-util/llm-app/webdriver.h
+++ b/docs-util/llm-app/webdriver.h
@@ -0,0 +1,278 @@
+// webdriver.h — Thin W3C WebDriver client using cpp-httplib + nlohmann/json.
+// SPDX-License-Identifier: MIT
+//
+// Usage:
+//   webdriver::Session session;  // starts headless Firefox via geckodriver
+//   session.navigate("http://localhost:8080");
+//   auto el = session.css("h1");
+//   assert(el.text() == "Hello!");
+//   // session destructor closes the browser
+#pragma once
+
+#include "httplib.h"
+#include <nlohmann/json.hpp>
+
+#include <stdexcept>
+#include <string>
+#include <thread>
+#include <vector>
+
+namespace webdriver {
+
+using json = nlohmann::json;
+
+// ─── Errors ──────────────────────────────────────────────────
+
+class Error : public std::runtime_error {
+public:
+  using std::runtime_error::runtime_error;
+};
+
+// ─── Forward declarations ────────────────────────────────────
+
+class Session;
+
+// ─── Element ─────────────────────────────────────────────────
+
+class Element {
+  friend class Session;
+
+  httplib::Client *cli_;
+  std::string session_id_;
+  std::string element_id_;
+
+public:
+  Element(httplib::Client *cli, const std::string &session_id,
+          const std::string &element_id)
+      : cli_(cli), session_id_(session_id), element_id_(element_id) {}
+
+  std::string url(const std::string &suffix = "") const {
+    return "/session/" + session_id_ + "/element/" + element_id_ + suffix;
+  }
+
+public:
+  std::string text() const {
+    auto res = cli_->Get(url("/text"));
+    if (!res || res->status != 200) {
+      throw Error("Failed to get element text");
+    }
+    return json::parse(res->body)["value"].get<std::string>();
+  }
+
+  std::string attribute(const std::string &name) const {
+    auto res = cli_->Get(url("/attribute/" + name));
+    if (!res || res->status != 200) { return ""; }
+    auto val = json::parse(res->body)["value"];
+    return val.is_null() ? "" : val.get<std::string>();
+  }
+
+  std::string property(const std::string &name) const {
+    auto res = cli_->Get(url("/property/" + name));
+    if (!res || res->status != 200) { return ""; }
+    auto val = json::parse(res->body)["value"];
+    return val.is_null() ? "" : val.get<std::string>();
+  }
+
+  void click() const {
+    auto res = cli_->Post(url("/click"), "{}", "application/json");
+    if (!res || res->status != 200) { throw Error("Failed to click element"); }
+  }
+
+  void send_keys(const std::string &keys) const {
+    json body = {{"text", keys}};
+    auto res = cli_->Post(url("/value"), body.dump(), "application/json");
+    if (!res || res->status != 200) {
+      throw Error("Failed to send keys to element");
+    }
+  }
+
+  void clear() const {
+    auto res = cli_->Post(url("/clear"), "{}", "application/json");
+    if (!res || res->status != 200) { throw Error("Failed to clear element"); }
+  }
+
+  std::string tag_name() const {
+    auto res = cli_->Get(url("/name"));
+    if (!res || res->status != 200) { throw Error("Failed to get tag name"); }
+    return json::parse(res->body)["value"].get<std::string>();
+  }
+
+  bool is_displayed() const {
+    auto res = cli_->Get(url("/displayed"));
+    if (!res || res->status != 200) { return false; }
+    return json::parse(res->body)["value"].get<bool>();
+  }
+};
+
+// ─── Session ─────────────────────────────────────────────────
+
+class Session {
+  httplib::Client cli_;
+  std::string session_id_;
+
+  // W3C WebDriver uses this key for element references
+  static constexpr const char *ELEMENT_KEY =
+      "element-6066-11e4-a52e-4f735466cecf";
+
+  std::string extract_element_id(const json &value) const {
+    if (value.contains(ELEMENT_KEY)) {
+      return value[ELEMENT_KEY].get<std::string>();
+    }
+    // Fallback: try "ELEMENT" (older protocol)
+    if (value.contains("ELEMENT")) {
+      return value["ELEMENT"].get<std::string>();
+    }
+    throw Error("No element identifier in response: " + value.dump());
+  }
+
+  std::string url(const std::string &suffix) const {
+    return "/session/" + session_id_ + suffix;
+  }
+
+public:
+  explicit Session(const std::string &host = "127.0.0.1", int port = 4444)
+      : cli_(host, port) {
+    cli_.set_read_timeout(std::chrono::seconds(30));
+    cli_.set_connection_timeout(std::chrono::seconds(5));
+
+    json caps = {
+        {"capabilities",
+         {{"alwaysMatch",
+           {{"moz:firefoxOptions", {{"args", json::array({"-headless"})}}}}}}}};
+
+    auto res = cli_.Post("/session", caps.dump(), "application/json");
+    if (!res) { throw Error("Cannot connect to geckodriver"); }
+    if (res->status != 200) {
+      throw Error("Failed to create session: " + res->body);
+    }
+
+    auto body = json::parse(res->body);
+    session_id_ = body["value"]["sessionId"].get<std::string>();
+  }
+
+  ~Session() {
+    try {
+      cli_.Delete(url(""));
+    } catch (...) {}
+  }
+
+  // Non-copyable, non-movable (owns a session)
+  Session(const Session &) = delete;
+  Session &operator=(const Session &) = delete;
+
+  // ─── Navigation ──────────────────────────────────────────
+
+  void navigate(const std::string &nav_url) {
+    json body = {{"url", nav_url}};
+    auto res = cli_.Post(url("/url"), body.dump(), "application/json");
+    if (!res || res->status != 200) {
+      throw Error("Failed to navigate to: " + nav_url);
+    }
+  }
+
+  std::string title() {
+    auto res = cli_.Get(url("/title"));
+    if (!res || res->status != 200) { throw Error("Failed to get title"); }
+    return json::parse(res->body)["value"].get<std::string>();
+  }
+
+  std::string current_url() {
+    auto res = cli_.Get(url("/url"));
+    if (!res || res->status != 200) {
+      throw Error("Failed to get current URL");
+    }
+    return json::parse(res->body)["value"].get<std::string>();
+  }
+
+  // ─── Find elements ──────────────────────────────────────
+
+  Element find(const std::string &using_, const std::string &value) {
+    json body = {{"using", using_}, {"value", value}};
+    auto res = cli_.Post(url("/element"), body.dump(), "application/json");
+    if (!res || res->status != 200) {
+      throw Error("Element not found: " + using_ + "=" + value);
+    }
+    auto eid = extract_element_id(json::parse(res->body)["value"]);
+    return Element(&cli_, session_id_, eid);
+  }
+
+  std::vector<Element> find_all(const std::string &using_,
+                                const std::string &value) {
+    json body = {{"using", using_}, {"value", value}};
+    auto res = cli_.Post(url("/elements"), body.dump(), "application/json");
+    if (!res || res->status != 200) { return {}; }
+
+    std::vector<Element> elements;
+    for (auto &v : json::parse(res->body)["value"]) {
+      elements.emplace_back(&cli_, session_id_, extract_element_id(v));
+    }
+    return elements;
+  }
+
+  // Convenience: find by CSS selector
+  Element css(const std::string &selector) {
+    return find("css selector", selector);
+  }
+
+  std::vector<Element> css_all(const std::string &selector) {
+    return find_all("css selector", selector);
+  }
+
+  // ─── Wait ────────────────────────────────────────────────
+
+  // Poll for an element until it appears or timeout
+  Element wait_for(const std::string &selector, int timeout_ms = 5000) {
+    auto deadline = std::chrono::steady_clock::now() +
+                    std::chrono::milliseconds(timeout_ms);
+    while (std::chrono::steady_clock::now() < deadline) {
+      try {
+        return css(selector);
+      } catch (...) {
+        std::this_thread::sleep_for(std::chrono::milliseconds(200));
+      }
+    }
+    throw Error("Timeout waiting for element: " + selector);
+  }
+
+  // Wait until a JS expression returns truthy
+  bool wait_until(const std::string &script, int timeout_ms = 5000) {
+    auto deadline = std::chrono::steady_clock::now() +
+                    std::chrono::milliseconds(timeout_ms);
+    while (std::chrono::steady_clock::now() < deadline) {
+      auto result = execute_script(script);
+      if (result != "null" && result != "false" && result != "" &&
+          result != "0" && result != "undefined") {
+        return true;
+      }
+      std::this_thread::sleep_for(std::chrono::milliseconds(200));
+    }
+    return false;
+  }
+
+  // ─── Execute script ─────────────────────────────────────
+
+  std::string execute_script(const std::string &script,
+                             const json &args = json::array()) {
+    json body = {{"script", script}, {"args", args}};
+    auto res = cli_.Post(url("/execute/sync"), body.dump(), "application/json");
+    if (!res || res->status != 200) {
+      throw Error("Failed to execute script: " + script);
+    }
+    auto val = json::parse(res->body)["value"];
+    if (val.is_null()) { return "null"; }
+    if (val.is_string()) { return val.get<std::string>(); }
+    return val.dump();
+  }
+
+  // ─── Page source ────────────────────────────────────────
+
+  std::string page_source() {
+    auto res = cli_.Get(url("/source"));
+    if (!res || res->status != 200) {
+      throw Error("Failed to get page source");
+    }
+    return json::parse(res->body)["value"].get<std::string>();
+  }
+};
+
+} // namespace webdriver
--- a/9
+++ b/9
@@ -46,11 +46,8 @@ build:
 bench:
    @(cd benchmark && make bench-all)

-docs:
-    @cargo build --release --manifest-path docs-gen/Cargo.toml
-    @./docs-gen/target/release/docs-gen build docs-src --out docs
-
 docs-serve:
-    @cargo build --release --manifest-path docs-gen/Cargo.toml
-    -@./docs-gen/target/release/docs-gen serve docs-src --open
+    -@docs-gen serve docs-src --open

+docs-check:
+    @docs-gen check docs-src