"Building a Desktop LLM App with cpp-httplib" (#2403)

This commit is contained in:
yhirose
2026-03-21 23:31:55 -04:00
committed by GitHub
parent c2bdb1c5c1
commit 7178f451a4
35 changed files with 8889 additions and 35 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 120 KiB

View File

@@ -0,0 +1,236 @@
---
title: "1. Setting Up the Project Environment"
order: 1
---
Let's incrementally build a text translation REST API server using llama.cpp as the inference engine. By the end, a request like this will return a translation result.
```bash
curl -X POST http://localhost:8080/translate \
-H "Content-Type: application/json" \
-d '{"text": "The weather is nice today. Shall we go for a walk?", "target_lang": "ja"}'
```
```json
{
"translation": "今日はいい天気ですね。散歩に行きましょうか?"
}
```
The "Translation API" is just one example. By swapping out the prompt, you can adapt this to any LLM application you like, such as summarization, code generation, or a chatbot.
Here's the full list of APIs the server will provide.
| Method | Path | Description | Chapter |
| -------- | ---- | ---- | -- |
| `GET` | `/health` | Returns server status | 1 |
| `POST` | `/translate` | Translates text and returns JSON | 2 |
| `POST` | `/translate/stream` | SSE streaming on a per-token basis | 3 |
| `GET` | `/models` | Model list (available / downloaded / selected) | 4 |
| `POST` | `/models/select` | Select a model (automatically downloads if not yet downloaded) | 4 |
In this chapter, let's set up the project environment. We'll fetch the dependency libraries, create the directory structure, configure the build settings, and grab the model file, so that we're ready to start writing code in the next chapter.
## Prerequisites
- A C++20-compatible compiler (GCC 10+, Clang 10+, MSVC 2019 16.8+)
- CMake 3.20 or later
- OpenSSL (used for the HTTPS client in Chapter 4. macOS: `brew install openssl`, Ubuntu: `sudo apt install libssl-dev`)
- Sufficient disk space (model files can be several GB)
## 1.1 What We Will Use
Here are the libraries we'll use.
| Library | Role |
| ----------- | ------ |
| [cpp-httplib](https://github.com/yhirose/cpp-httplib) | HTTP server/client |
| [nlohmann/json](https://github.com/nlohmann/json) | JSON parser |
| [cpp-llamalib](https://github.com/yhirose/cpp-llamalib) | llama.cpp wrapper |
| [llama.cpp](https://github.com/ggml-org/llama.cpp) | LLM inference engine |
| [webview/webview](https://github.com/webview/webview) | Desktop WebView (used in Chapter 6) |
cpp-httplib, nlohmann/json, and cpp-llamalib are header-only libraries. You could just download a single header file with `curl` and `#include` it, but in this book we use CMake's `FetchContent` to fetch them automatically. Declare them in `CMakeLists.txt`, and `cmake -B build` downloads and builds everything for you. webview is used in Chapter 6, so you don't need to worry about it for now.
## 1.2 Directory Structure
The final structure will look like this.
```ascii
translate-app/
├── CMakeLists.txt
├── models/
│ └── (GGUF files)
└── src/
└── main.cpp
```
We don't include library source code in the project. CMake's `FetchContent` fetches them automatically at build time, so all you need is your own code.
Let's create the project directory and initialize a git repository.
```bash
mkdir translate-app && cd translate-app
mkdir src models
git init
```
## 1.3 Obtaining the GGUF Model File
You need a model file for LLM inference. GGUF is the model format used by llama.cpp, and you can find many models on Hugging Face.
Let's start by trying a small model. The quantized version of Google's Gemma 2 2B (~1.6 GB) is a good starting point. It's lightweight but supports multiple languages and works well for translation tasks.
```bash
curl -L -o models/gemma-2-2b-it-Q4_K_M.gguf \
https://huggingface.co/bartowski/gemma-2-2b-it-GGUF/resolve/main/gemma-2-2b-it-Q4_K_M.gguf
```
In Chapter 4, we'll add the ability to download models from within the app using cpp-httplib's client functionality.
## 1.4 CMakeLists.txt
Create a `CMakeLists.txt` in the project root. By declaring dependencies with `FetchContent`, CMake will automatically download and build them for you.
<!-- data-file="CMakeLists.txt" -->
```cmake
cmake_minimum_required(VERSION 3.20)
project(translate-server CXX)
set(CMAKE_CXX_STANDARD 20)
include(FetchContent)
# llama.cpp (LLM inference engine)
FetchContent_Declare(llama
GIT_REPOSITORY https://github.com/ggml-org/llama.cpp
GIT_TAG master
GIT_SHALLOW TRUE
)
FetchContent_MakeAvailable(llama)
# cpp-httplib (HTTP server/client)
FetchContent_Declare(httplib
GIT_REPOSITORY https://github.com/yhirose/cpp-httplib
GIT_TAG master
)
FetchContent_MakeAvailable(httplib)
# nlohmann/json (JSON parser)
FetchContent_Declare(json
URL https://github.com/nlohmann/json/releases/download/v3.11.3/json.tar.xz
)
FetchContent_MakeAvailable(json)
# cpp-llamalib (header-only llama.cpp wrapper)
FetchContent_Declare(cpp_llamalib
GIT_REPOSITORY https://github.com/yhirose/cpp-llamalib
GIT_TAG main
)
FetchContent_MakeAvailable(cpp_llamalib)
add_executable(translate-server src/main.cpp)
target_link_libraries(translate-server PRIVATE
httplib::httplib
nlohmann_json::nlohmann_json
cpp-llamalib
)
```
`FetchContent_Declare` tells CMake where to find each library, and `FetchContent_MakeAvailable` fetches and builds them. The first `cmake -B build` will take some time because it downloads all libraries and builds llama.cpp, but subsequent runs will use the cache.
Just link with `target_link_libraries`, and each library's CMake configuration sets up include paths and build settings for you.
## 1.5 Creating the Skeleton Code
We'll use this skeleton code as a base and add functionality chapter by chapter.
<!-- data-file="main.cpp" -->
```cpp
// src/main.cpp
#include <httplib.h>
#include <nlohmann/json.hpp>
#include <csignal>
#include <iostream>
using json = nlohmann::json;
httplib::Server svr;
// Graceful shutdown on `Ctrl+C`
void signal_handler(int sig) {
if (sig == SIGINT || sig == SIGTERM) {
std::cout << "\nReceived signal, shutting down gracefully...\n";
svr.stop();
}
}
int main() {
// Log requests and responses
svr.set_logger([](const auto &req, const auto &res) {
std::cout << req.method << " " << req.path << " -> " << res.status
<< std::endl;
});
// Health check
svr.Get("/health", [](const auto &, auto &res) {
res.set_content(json{{"status", "ok"}}.dump(), "application/json");
});
// Stub implementations for each endpoint (replaced with real ones in later chapters)
svr.Post("/translate",
[](const auto &req, auto &res) {
res.set_content(json{{"translation", "TODO"}}.dump(), "application/json");
});
svr.Post("/translate/stream",
[](const auto &req, auto &res) {
res.set_content("data: \"TODO\"\n\ndata: [DONE]\n\n", "text/event-stream");
});
svr.Get("/models",
[](const auto &req, auto &res) {
res.set_content(json{{"models", json::array()}}.dump(), "application/json");
});
svr.Post("/models/select",
[](const auto &req, auto &res) {
res.set_content(json{{"status", "TODO"}}.dump(), "application/json");
});
// Allow the server to be stopped with `Ctrl+C` (`SIGINT`) or `kill` (`SIGTERM`)
signal(SIGINT, signal_handler);
signal(SIGTERM, signal_handler);
// Start the server
std::cout << "Listening on http://127.0.0.1:8080" << std::endl;
svr.listen("127.0.0.1", 8080);
}
```
## 1.6 Building and Verifying
Build the project, start the server, and verify that requests work with curl.
```bash
cmake -B build
cmake --build build -j
./build/translate-server
```
From another terminal, try it with curl.
```bash
curl http://localhost:8080/health
# => {"status":"ok"}
```
If you see JSON come back, the setup is complete.
## Next Chapter
Now that the environment is set up, in the next chapter we'll implement the translation REST API on top of this skeleton. We'll run inference with llama.cpp and expose it as an HTTP endpoint with cpp-httplib.
**Next:** [Integrating llama.cpp to Build a REST API](../ch02-rest-api)

View File

@@ -0,0 +1,212 @@
---
title: "2. Integrating llama.cpp to Build a REST API"
order: 2
---
In the skeleton from Chapter 1, `/translate` simply returned `"TODO"`. In this chapter we integrate llama.cpp inference and turn it into an API that actually returns translation results.
Calling the llama.cpp API directly makes the code quite long, so we use a thin wrapper library called [cpp-llamalib](https://github.com/yhirose/cpp-llamalib). It lets you load a model and run inference in just a few lines, keeping the focus on cpp-httplib.
## 2.1 Initializing the LLM
Simply pass the path to a model file to `llamalib::Llama`, and model loading, context creation, and sampler configuration are all taken care of. If you downloaded a different model in Chapter 1, adjust the path accordingly.
```cpp
#include <cpp-llamalib.h>
int main() {
auto llm = llamalib::Llama{"models/gemma-2-2b-it-Q4_K_M.gguf"};
// LLM inference takes time, so set a longer timeout (default is 5 seconds)
svr.set_read_timeout(300);
svr.set_write_timeout(300);
// ... Build and start the HTTP server ...
}
```
If you want to change the number of GPU layers, context length, or other settings, you can specify them via `llamalib::Options`.
```cpp
auto llm = llamalib::Llama{"models/gemma-2-2b-it-Q4_K_M.gguf", {
.n_gpu_layers = 0, // CPU only
.n_ctx = 4096,
}};
```
## 2.2 The `/translate` Handler
We replace the handler that returned dummy JSON in Chapter 1 with actual inference.
```cpp
svr.Post("/translate",
[&](const httplib::Request &req, httplib::Response &res) {
// Parse JSON (3rd arg `false`: don't throw on failure, check with `is_discarded()`)
auto input = json::parse(req.body, nullptr, false);
if (input.is_discarded()) {
res.status = 400;
res.set_content(json{{"error", "Invalid JSON"}}.dump(),
"application/json");
return;
}
// Validate required fields
if (!input.contains("text") || !input["text"].is_string() ||
input["text"].get<std::string>().empty()) {
res.status = 400;
res.set_content(json{{"error", "'text' is required"}}.dump(),
"application/json");
return;
}
auto text = input["text"].get<std::string>();
auto target_lang = input.value("target_lang", "ja"); // Default is Japanese
// Build the prompt and run inference
auto prompt = "Translate the following text to " + target_lang +
". Output only the translation, nothing else.\n\n" + text;
try {
auto translation = llm.chat(prompt);
res.set_content(json{{"translation", translation}}.dump(),
"application/json");
} catch (const std::exception &e) {
res.status = 500;
res.set_content(json{{"error", e.what()}}.dump(), "application/json");
}
});
```
`llm.chat()` can throw exceptions during inference (for example, when the context length is exceeded). By catching them with `try/catch` and returning the error as JSON, we prevent the server from crashing.
## 2.3 Complete Code
Here is the finished code with all the changes so far.
<details>
<summary data-file="main.cpp">Complete code (main.cpp)</summary>
```cpp
#include <httplib.h>
#include <nlohmann/json.hpp>
#include <cpp-llamalib.h>
#include <csignal>
#include <iostream>
using json = nlohmann::json;
httplib::Server svr;
// Graceful shutdown on `Ctrl+C`
void signal_handler(int sig) {
if (sig == SIGINT || sig == SIGTERM) {
std::cout << "\nReceived signal, shutting down gracefully...\n";
svr.stop();
}
}
int main() {
// Load the model downloaded in Chapter 1
auto llm = llamalib::Llama{"models/gemma-2-2b-it-Q4_K_M.gguf"};
// LLM inference takes time, so set a longer timeout (default is 5 seconds)
svr.set_read_timeout(300);
svr.set_write_timeout(300);
// Log requests and responses
svr.set_logger([](const auto &req, const auto &res) {
std::cout << req.method << " " << req.path << " -> " << res.status
<< std::endl;
});
svr.Get("/health", [](const httplib::Request &, httplib::Response &res) {
res.set_content(json{{"status", "ok"}}.dump(), "application/json");
});
svr.Post("/translate",
[&](const httplib::Request &req, httplib::Response &res) {
// Parse JSON (3rd arg `false`: don't throw on failure, check with `is_discarded()`)
auto input = json::parse(req.body, nullptr, false);
if (input.is_discarded()) {
res.status = 400;
res.set_content(json{{"error", "Invalid JSON"}}.dump(),
"application/json");
return;
}
// Validate required fields
if (!input.contains("text") || !input["text"].is_string() ||
input["text"].get<std::string>().empty()) {
res.status = 400;
res.set_content(json{{"error", "'text' is required"}}.dump(),
"application/json");
return;
}
auto text = input["text"].get<std::string>();
auto target_lang = input.value("target_lang", "ja"); // Default is Japanese
// Build the prompt and run inference
auto prompt = "Translate the following text to " + target_lang +
". Output only the translation, nothing else.\n\n" + text;
try {
auto translation = llm.chat(prompt);
res.set_content(json{{"translation", translation}}.dump(),
"application/json");
} catch (const std::exception &e) {
res.status = 500;
res.set_content(json{{"error", e.what()}}.dump(), "application/json");
}
});
// Dummy implementations to be replaced with real ones in later chapters
svr.Get("/models",
[](const httplib::Request &, httplib::Response &res) {
res.set_content(json{{"models", json::array()}}.dump(), "application/json");
});
svr.Post("/models/select",
[](const httplib::Request &, httplib::Response &res) {
res.set_content(json{{"status", "TODO"}}.dump(), "application/json");
});
// Allow the server to be stopped with `Ctrl+C` (`SIGINT`) or `kill` (`SIGTERM`)
signal(SIGINT, signal_handler);
signal(SIGTERM, signal_handler);
// Start the server (blocks until `stop()` is called)
std::cout << "Listening on http://127.0.0.1:8080" << std::endl;
svr.listen("127.0.0.1", 8080);
}
```
</details>
## 2.4 Testing It Out
Rebuild and start the server, then verify that it now returns actual translation results.
```bash
cmake --build build -j
./build/translate-server
```
```bash
curl -X POST http://localhost:8080/translate \
-H "Content-Type: application/json" \
-d '{"text": "I had a great time visiting Tokyo last spring. The cherry blossoms were beautiful.", "target_lang": "ja"}'
# => {"translation":"去年の春に東京を訪れた。桜が綺麗だった。"}
```
In Chapter 1 the response was `"TODO"`, but now you get an actual translation back.
## Next Chapter
The REST API we built in this chapter waits for the entire translation to complete before sending the response, so for long texts the user has to wait with no indication of progress.
In the next chapter, we use SSE (Server-Sent Events) to stream tokens back in real time as they are generated.
**Next:** [Adding Token Streaming with SSE](../ch03-sse-streaming)

View File

@@ -0,0 +1,264 @@
---
title: "3. Adding Token Streaming with SSE"
order: 3
---
The `/translate` endpoint from Chapter 2 returned the entire translation at once after completion. This is fine for short sentences, but for longer text the user has to wait several seconds with nothing displayed.
In this chapter, we add a `/translate/stream` endpoint that uses SSE (Server-Sent Events) to return tokens in real time as they are generated. This is the same approach used by the ChatGPT and Claude APIs.
## 3.1 What is SSE?
SSE is a way to send HTTP responses as a stream. When a client sends a request, the server keeps the connection open and gradually returns events. The format is simple text.
```text
data: "去年の"
data: "春に"
data: "東京を"
data: [DONE]
```
Each line starts with `data:` and events are separated by blank lines. The Content-Type is `text/event-stream`. Tokens are sent as escaped JSON strings, so they appear enclosed in double quotes (we implement this in Section 3.3).
## 3.2 Streaming with cpp-httplib
In cpp-httplib, you can use `set_chunked_content_provider` to send responses incrementally. Each time you write to `sink.os` inside the callback, data is sent to the client.
```cpp
res.set_chunked_content_provider(
"text/event-stream",
[](size_t offset, httplib::DataSink &sink) {
sink.os << "data: hello\n\n";
sink.done();
return true;
});
```
Calling `sink.done()` ends the stream. If the client disconnects mid-stream, writing to `sink.os` will fail and `sink.os.fail()` will return `true`. You can use this to detect disconnection and abort unnecessary inference.
## 3.3 The `/translate/stream` Handler
JSON parsing and validation are the same as the `/translate` endpoint from Chapter 2. The only difference is how the response is returned. We combine the streaming callback of `llm.chat()` with `set_chunked_content_provider`.
```cpp
svr.Post("/translate/stream",
[&](const httplib::Request &req, httplib::Response &res) {
// ... JSON parsing and validation same as /translate ...
res.set_chunked_content_provider(
"text/event-stream",
[&, prompt](size_t, httplib::DataSink &sink) {
try {
llm.chat(prompt, [&](std::string_view token) {
sink.os << "data: "
<< json(std::string(token)).dump(
-1, ' ', false, json::error_handler_t::replace)
<< "\n\n";
return sink.os.good(); // Abort inference on disconnect
});
sink.os << "data: [DONE]\n\n";
} catch (const std::exception &e) {
sink.os << "data: " << json({{"error", e.what()}}).dump() << "\n\n";
}
sink.done();
return true;
});
});
```
A few key points:
- When you pass a callback to `llm.chat()`, it is called each time a token is generated. If the callback returns `false`, generation is aborted
- After writing to `sink.os`, you can check whether the client is still connected with `sink.os.good()`. If the client has disconnected, it returns `false` to stop inference
- Each token is escaped as a JSON string using `json(token).dump()` before sending. This is safe even for tokens containing newlines or quotes
- The first three arguments of `dump(-1, ' ', false, ...)` are the defaults. What matters is the fourth argument, `json::error_handler_t::replace`. Since the LLM returns tokens at the subword level, multi-byte characters (such as Japanese) can be split mid-character across tokens. Passing an incomplete UTF-8 byte sequence directly to `dump()` would throw an exception, so `replace` safely substitutes them. The browser reassembles the bytes on its end, so everything displays correctly
- The entire lambda is wrapped in `try/catch`. `llm.chat()` can throw exceptions for reasons such as exceeding the context window. If an exception goes uncaught inside the lambda, the server will crash, so we return the error as an SSE event instead
- `data: [DONE]` follows the OpenAI API convention to signal the end of the stream to the client
## 3.4 Complete Code
Here is the complete code with the `/translate/stream` endpoint added to the code from Chapter 2.
<details>
<summary data-file="main.cpp">Complete code (main.cpp)</summary>
```cpp
#include <httplib.h>
#include <nlohmann/json.hpp>
#include <cpp-llamalib.h>
#include <csignal>
#include <iostream>
using json = nlohmann::json;
httplib::Server svr;
// Graceful shutdown on `Ctrl+C`
void signal_handler(int sig) {
if (sig == SIGINT || sig == SIGTERM) {
std::cout << "\nReceived signal, shutting down gracefully...\n";
svr.stop();
}
}
int main() {
// Load the GGUF model
auto llm = llamalib::Llama{"models/gemma-2-2b-it-Q4_K_M.gguf"};
// LLM inference takes time, so set a longer timeout (default is 5 seconds)
svr.set_read_timeout(300);
svr.set_write_timeout(300);
// Log requests and responses
svr.set_logger([](const auto &req, const auto &res) {
std::cout << req.method << " " << req.path << " -> " << res.status
<< std::endl;
});
svr.Get("/health", [](const httplib::Request &, httplib::Response &res) {
res.set_content(json{{"status", "ok"}}.dump(), "application/json");
});
// Standard translation endpoint from Chapter 2
svr.Post("/translate",
[&](const httplib::Request &req, httplib::Response &res) {
// JSON parsing and validation (see Chapter 2 for details)
auto input = json::parse(req.body, nullptr, false);
if (input.is_discarded()) {
res.status = 400;
res.set_content(json{{"error", "Invalid JSON"}}.dump(),
"application/json");
return;
}
if (!input.contains("text") || !input["text"].is_string() ||
input["text"].get<std::string>().empty()) {
res.status = 400;
res.set_content(json{{"error", "'text' is required"}}.dump(),
"application/json");
return;
}
auto text = input["text"].get<std::string>();
auto target_lang = input.value("target_lang", "ja");
auto prompt = "Translate the following text to " + target_lang +
". Output only the translation, nothing else.\n\n" + text;
try {
auto translation = llm.chat(prompt);
res.set_content(json{{"translation", translation}}.dump(),
"application/json");
} catch (const std::exception &e) {
res.status = 500;
res.set_content(json{{"error", e.what()}}.dump(), "application/json");
}
});
// SSE streaming translation endpoint
svr.Post("/translate/stream",
[&](const httplib::Request &req, httplib::Response &res) {
// JSON parsing and validation (same as /translate)
auto input = json::parse(req.body, nullptr, false);
if (input.is_discarded()) {
res.status = 400;
res.set_content(json{{"error", "Invalid JSON"}}.dump(),
"application/json");
return;
}
if (!input.contains("text") || !input["text"].is_string() ||
input["text"].get<std::string>().empty()) {
res.status = 400;
res.set_content(json{{"error", "'text' is required"}}.dump(),
"application/json");
return;
}
auto text = input["text"].get<std::string>();
auto target_lang = input.value("target_lang", "ja");
auto prompt = "Translate the following text to " + target_lang +
". Output only the translation, nothing else.\n\n" + text;
res.set_chunked_content_provider(
"text/event-stream",
[&, prompt](size_t, httplib::DataSink &sink) {
try {
llm.chat(prompt, [&](std::string_view token) {
sink.os << "data: "
<< json(std::string(token)).dump(
-1, ' ', false, json::error_handler_t::replace)
<< "\n\n";
return sink.os.good(); // Abort inference on disconnect
});
sink.os << "data: [DONE]\n\n";
} catch (const std::exception &e) {
sink.os << "data: " << json({{"error", e.what()}}).dump() << "\n\n";
}
sink.done();
return true;
});
});
// Dummy implementations to be replaced in later chapters
svr.Get("/models",
[](const httplib::Request &, httplib::Response &res) {
res.set_content(json{{"models", json::array()}}.dump(), "application/json");
});
svr.Post("/models/select",
[](const httplib::Request &, httplib::Response &res) {
res.set_content(json{{"status", "TODO"}}.dump(), "application/json");
});
// Allow the server to be stopped with `Ctrl+C` (`SIGINT`) or `kill` (`SIGTERM`)
signal(SIGINT, signal_handler);
signal(SIGTERM, signal_handler);
// Start the server (blocks until `stop()` is called)
std::cout << "Listening on http://127.0.0.1:8080" << std::endl;
svr.listen("127.0.0.1", 8080);
}
```
</details>
## 3.5 Testing It Out
Build and start the server.
```bash
cmake --build build -j
./build/translate-server
```
Using curl's `-N` option to disable buffering, you can see tokens displayed in real time as they arrive.
```bash
curl -N -X POST http://localhost:8080/translate/stream \
-H "Content-Type: application/json" \
-d '{"text": "I had a great time visiting Tokyo last spring. The cherry blossoms were beautiful.", "target_lang": "ja"}'
```
```text
data: "去年の"
data: "春に"
data: "東京を"
data: "訪れた"
data: "。"
data: "桜が"
data: "綺麗だった"
data: "。"
data: [DONE]
```
You should see tokens streaming in one by one. The `/translate` endpoint from Chapter 2 continues to work as well.
## Next Chapter
The server's translation functionality is now complete. In the next chapter, we use cpp-httplib's client functionality to add the ability to fetch and manage models from Hugging Face.
**Next:** [Adding Model Download and Management](../ch04-model-management)

View File

@@ -0,0 +1,788 @@
---
title: "4. Adding Model Download and Management"
order: 4
---
By the end of Chapter 3, the server's translation functionality was fully in place. However, the only model file available is the one we manually downloaded in Chapter 1. In this chapter, we'll use cpp-httplib's **client functionality** to enable downloading and switching Hugging Face models from within the app.
Once complete, you'll be able to manage models with requests like these:
```bash
# Get the list of available models
curl http://localhost:8080/models
```
```json
{
"models": [
{"name": "gemma-2-2b-it", "params": "2B", "size": "1.6 GB", "downloaded": true, "selected": true},
{"name": "gemma-2-9b-it", "params": "9B", "size": "5.8 GB", "downloaded": false, "selected": false},
{"name": "Llama-3.1-8B-Instruct", "params": "8B", "size": "4.9 GB", "downloaded": false, "selected": false}
]
}
```
```bash
# Select a different model (automatically downloads if not yet available)
curl -N -X POST http://localhost:8080/models/select \
-H "Content-Type: application/json" \
-d '{"model": "gemma-2-9b-it"}'
```
```text
data: {"status":"downloading","progress":0}
data: {"status":"downloading","progress":12}
...
data: {"status":"downloading","progress":100}
data: {"status":"loading"}
data: {"status":"ready"}
```
## 4.1 httplib::Client Basics
So far we've only used `httplib::Server`, but cpp-httplib also provides client functionality. Since Hugging Face uses HTTPS, we need a TLS-capable client.
```cpp
#include <httplib.h>
// Including the URL scheme automatically uses SSLClient
httplib::Client cli("https://huggingface.co");
// Automatically follow redirects (Hugging Face redirects to a CDN)
cli.set_follow_location(true);
auto res = cli.Get("/api/models");
if (res && res->status == 200) {
std::cout << res->body << std::endl;
}
```
To use HTTPS, you need to enable OpenSSL at build time. Add the following to your `CMakeLists.txt`:
```cmake
find_package(OpenSSL REQUIRED)
target_link_libraries(translate-server PRIVATE OpenSSL::SSL OpenSSL::Crypto)
target_compile_definitions(translate-server PRIVATE CPPHTTPLIB_OPENSSL_SUPPORT)
# macOS: required for loading system certificates
if(APPLE)
target_link_libraries(translate-server PRIVATE "-framework CoreFoundation" "-framework Security")
endif()
```
Defining `CPPHTTPLIB_OPENSSL_SUPPORT` enables `httplib::Client("https://...")` to make TLS connections. On macOS, you also need to link the CoreFoundation and Security frameworks to access the system certificate store. See Section 4.8 for the complete `CMakeLists.txt`.
## 4.2 Defining the Model List
Let's define the list of models that the app can handle. Here are four models we've verified for translation tasks.
```cpp
struct ModelInfo {
std::string name; // Display name
std::string params; // Parameter count
std::string size; // GGUF Q4 size
std::string repo; // Hugging Face repository
std::string filename; // GGUF filename
};
const std::vector<ModelInfo> MODELS = {
{
.name = "gemma-2-2b-it",
.params = "2B",
.size = "1.6 GB",
.repo = "bartowski/gemma-2-2b-it-GGUF",
.filename = "gemma-2-2b-it-Q4_K_M.gguf",
},
{
.name = "gemma-2-9b-it",
.params = "9B",
.size = "5.8 GB",
.repo = "bartowski/gemma-2-9b-it-GGUF",
.filename = "gemma-2-9b-it-Q4_K_M.gguf",
},
{
.name = "Llama-3.1-8B-Instruct",
.params = "8B",
.size = "4.9 GB",
.repo = "bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",
.filename = "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
},
};
```
## 4.3 Model Storage Location
Up through Chapter 3, we stored models in the `models/` directory within the project. However, when managing multiple models, a dedicated app directory makes more sense. On macOS/Linux we use `~/.translate-app/models/`, and on Windows we use `%APPDATA%\translate-app\models\`.
```cpp
std::filesystem::path get_models_dir() {
#ifdef _WIN32
auto env = std::getenv("APPDATA");
auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
return base / "translate-app" / "models";
#else
auto env = std::getenv("HOME");
auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
return base / ".translate-app" / "models";
#endif
}
```
If the environment variable isn't set, it falls back to the current directory. The app creates this directory at startup (`create_directories` won't error even if it already exists).
## 4.4 Rewriting Model Initialization
We rewrite the model initialization at the beginning of `main()`. In Chapter 1 we hardcoded the path, but from here on we support model switching. We track the currently loaded filename in `selected_model` and load the first entry in `MODELS` at startup. The `GET /models` and `POST /models/select` handlers reference and update this variable.
Since cpp-httplib runs handlers concurrently on a thread pool, reassigning `llm` while another thread is calling `llm.chat()` would crash. We add a `std::mutex` to protect against this.
```cpp
int main() {
auto models_dir = get_models_dir();
std::filesystem::create_directories(models_dir);
std::string selected_model = MODELS[0].filename;
auto path = models_dir / selected_model;
// Automatically download the default model if not yet present
if (!std::filesystem::exists(path)) {
std::cout << "Downloading " << selected_model << "..." << std::endl;
if (!download_model(MODELS[0], [](int pct) {
std::cout << "\r" << pct << "%" << std::flush;
return true;
})) {
std::cerr << "\nFailed to download model." << std::endl;
return 1;
}
std::cout << std::endl;
}
auto llm = llamalib::Llama{path};
std::mutex llm_mutex; // Protect access during model switching
// ...
}
```
This ensures that users don't need to manually download models with curl on first launch. It uses the `download_model` function from Section 4.6 and displays progress on the console.
## 4.5 The `GET /models` Handler
This returns the model list with information about whether each model has been downloaded and whether it's currently selected.
```cpp
svr.Get("/models",
[&](const httplib::Request &, httplib::Response &res) {
auto arr = json::array();
for (const auto &m : MODELS) {
auto path = get_models_dir() / m.filename;
arr.push_back({
{"name", m.name},
{"params", m.params},
{"size", m.size},
{"downloaded", std::filesystem::exists(path)},
{"selected", m.filename == selected_model},
});
}
res.set_content(json{{"models", arr}}.dump(), "application/json");
});
```
## 4.6 Downloading Large Files
GGUF models are several gigabytes, so we can't load the entire file into memory. By passing callbacks to `httplib::Client::Get`, we can receive data chunk by chunk.
```cpp
// content_receiver: callback that receives data chunks
// progress: download progress callback
cli.Get(url,
[&](const char *data, size_t len) { // content_receiver
ofs.write(data, len);
return true; // returning false aborts the download
},
[&](size_t current, size_t total) { // progress
int pct = total ? (int)(current * 100 / total) : 0;
std::cout << pct << "%" << std::endl;
return true; // returning false aborts the download
});
```
Let's use this to create a function that downloads models from Hugging Face.
```cpp
#include <filesystem>
#include <fstream>
// Download a model and report progress via progress_cb.
// If progress_cb returns false, the download is aborted.
bool download_model(const ModelInfo &model,
std::function<bool(int)> progress_cb) {
httplib::Client cli("https://huggingface.co");
cli.set_follow_location(true);
cli.set_read_timeout(std::chrono::hours(1));
auto url = "/" + model.repo + "/resolve/main/" + model.filename;
auto path = get_models_dir() / model.filename;
auto tmp_path = std::filesystem::path(path).concat(".tmp");
std::ofstream ofs(tmp_path, std::ios::binary);
if (!ofs) { return false; }
auto res = cli.Get(url,
[&](const char *data, size_t len) {
ofs.write(data, len);
return ofs.good();
},
[&](size_t current, size_t total) {
return progress_cb(total ? (int)(current * 100 / total) : 0);
});
ofs.close();
if (!res || res->status != 200) {
std::filesystem::remove(tmp_path);
return false;
}
// Write to .tmp first, then rename, so that an incomplete file
// is never mistaken for a usable model if the download is interrupted
std::filesystem::rename(tmp_path, path);
return true;
}
```
## 4.7 The `/models/select` Handler
This handles model selection requests. We always respond with SSE, reporting status in sequence: download progress, loading, and ready.
```cpp
svr.Post("/models/select",
[&](const httplib::Request &req, httplib::Response &res) {
auto input = json::parse(req.body, nullptr, false);
if (input.is_discarded() || !input.contains("model")) {
res.status = 400;
res.set_content(json{{"error", "'model' is required"}}.dump(),
"application/json");
return;
}
auto name = input["model"].get<std::string>();
// Find the model in the list
auto it = std::find_if(MODELS.begin(), MODELS.end(),
[&](const ModelInfo &m) { return m.name == name; });
if (it == MODELS.end()) {
res.status = 404;
res.set_content(json{{"error", "Unknown model"}}.dump(),
"application/json");
return;
}
const auto &model = *it;
// Always respond with SSE (same format whether already downloaded or not)
res.set_chunked_content_provider(
"text/event-stream",
[&, model](size_t, httplib::DataSink &sink) {
// SSE event sending helper
auto send = [&](const json &event) {
sink.os << "data: " << event.dump() << "\n\n";
};
// Download if not yet present (report progress via SSE)
auto path = get_models_dir() / model.filename;
if (!std::filesystem::exists(path)) {
bool ok = download_model(model, [&](int pct) {
send({{"status", "downloading"}, {"progress", pct}});
return sink.os.good(); // Abort download on client disconnect
});
if (!ok) {
send({{"status", "error"}, {"message", "Download failed"}});
sink.done();
return true;
}
}
// Load and switch to the model
send({{"status", "loading"}});
{
std::lock_guard<std::mutex> lock(llm_mutex);
llm = llamalib::Llama{path};
selected_model = model.filename;
}
send({{"status", "ready"}});
sink.done();
return true;
});
});
```
A few notes:
- We send SSE events directly from the `download_model` progress callback. This is an application of `set_chunked_content_provider` + `sink.os` from Chapter 3
- Since the callback returns `sink.os.good()`, the download stops if the client disconnects. The cancel button we add in Chapter 5 uses this
- When we update `selected_model`, it's reflected in the `selected` flag of `GET /models`
- The `llm` reassignment is protected by `llm_mutex`. The `/translate` and `/translate/stream` handlers also lock the same mutex, so inference can't run during a model switch (see the complete code)
## 4.8 Complete Code
Here is the complete code with model management added to the Chapter 3 code.
<details>
<summary data-file="CMakeLists.txt">Complete code (CMakeLists.txt)</summary>
```cmake
cmake_minimum_required(VERSION 3.20)
project(translate-server CXX)
set(CMAKE_CXX_STANDARD 20)
include(FetchContent)
# llama.cpp
FetchContent_Declare(llama
GIT_REPOSITORY https://github.com/ggml-org/llama.cpp
GIT_TAG master
GIT_SHALLOW TRUE
)
FetchContent_MakeAvailable(llama)
# cpp-httplib
FetchContent_Declare(httplib
GIT_REPOSITORY https://github.com/yhirose/cpp-httplib
GIT_TAG master
)
FetchContent_MakeAvailable(httplib)
# nlohmann/json
FetchContent_Declare(json
URL https://github.com/nlohmann/json/releases/download/v3.11.3/json.tar.xz
)
FetchContent_MakeAvailable(json)
# cpp-llamalib
FetchContent_Declare(cpp_llamalib
GIT_REPOSITORY https://github.com/yhirose/cpp-llamalib
GIT_TAG main
)
FetchContent_MakeAvailable(cpp_llamalib)
find_package(OpenSSL REQUIRED)
add_executable(translate-server src/main.cpp)
target_link_libraries(translate-server PRIVATE
httplib::httplib
nlohmann_json::nlohmann_json
cpp-llamalib
OpenSSL::SSL OpenSSL::Crypto
)
target_compile_definitions(translate-server PRIVATE CPPHTTPLIB_OPENSSL_SUPPORT)
if(APPLE)
target_link_libraries(translate-server PRIVATE
"-framework CoreFoundation"
"-framework Security"
)
endif()
```
</details>
<details>
<summary data-file="main.cpp">Complete code (main.cpp)</summary>
```cpp
#include <httplib.h>
#include <nlohmann/json.hpp>
#include <cpp-llamalib.h>
#include <algorithm>
#include <csignal>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <mutex>
using json = nlohmann::json;
// -------------------------------------------------------------------------
// Model definitions
// -------------------------------------------------------------------------
struct ModelInfo {
std::string name;
std::string params;
std::string size;
std::string repo;
std::string filename;
};
const std::vector<ModelInfo> MODELS = {
{
.name = "gemma-2-2b-it",
.params = "2B",
.size = "1.6 GB",
.repo = "bartowski/gemma-2-2b-it-GGUF",
.filename = "gemma-2-2b-it-Q4_K_M.gguf",
},
{
.name = "gemma-2-9b-it",
.params = "9B",
.size = "5.8 GB",
.repo = "bartowski/gemma-2-9b-it-GGUF",
.filename = "gemma-2-9b-it-Q4_K_M.gguf",
},
{
.name = "Llama-3.1-8B-Instruct",
.params = "8B",
.size = "4.9 GB",
.repo = "bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",
.filename = "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
},
};
// -------------------------------------------------------------------------
// Model storage directory
// -------------------------------------------------------------------------
std::filesystem::path get_models_dir() {
#ifdef _WIN32
auto env = std::getenv("APPDATA");
auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
return base / "translate-app" / "models";
#else
auto env = std::getenv("HOME");
auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
return base / ".translate-app" / "models";
#endif
}
// -------------------------------------------------------------------------
// Model download
// -------------------------------------------------------------------------
// If progress_cb returns false, the download is aborted
bool download_model(const ModelInfo &model,
std::function<bool(int)> progress_cb) {
httplib::Client cli("https://huggingface.co");
cli.set_follow_location(true); // Hugging Face redirects to a CDN
cli.set_read_timeout(std::chrono::hours(1)); // Set a long timeout for large models
auto url = "/" + model.repo + "/resolve/main/" + model.filename;
auto path = get_models_dir() / model.filename;
auto tmp_path = std::filesystem::path(path).concat(".tmp");
std::ofstream ofs(tmp_path, std::ios::binary);
if (!ofs) { return false; }
auto res = cli.Get(url,
// content_receiver: receive data chunk by chunk and write to file
[&](const char *data, size_t len) {
ofs.write(data, len);
return ofs.good();
},
// progress: report download progress (returning false aborts)
[&, last_pct = -1](size_t current, size_t total) mutable {
int pct = total ? (int)(current * 100 / total) : 0;
if (pct == last_pct) return true; // Skip if same value
last_pct = pct;
return progress_cb(pct);
});
ofs.close();
if (!res || res->status != 200) {
std::filesystem::remove(tmp_path);
return false;
}
// Rename after download completes
std::filesystem::rename(tmp_path, path);
return true;
}
// -------------------------------------------------------------------------
// Server
// -------------------------------------------------------------------------
httplib::Server svr;
void signal_handler(int sig) {
if (sig == SIGINT || sig == SIGTERM) {
std::cout << "\nReceived signal, shutting down gracefully...\n";
svr.stop();
}
}
int main() {
// Create the model storage directory
auto models_dir = get_models_dir();
std::filesystem::create_directories(models_dir);
// Automatically download the default model if not yet present
std::string selected_model = MODELS[0].filename;
auto path = models_dir / selected_model;
if (!std::filesystem::exists(path)) {
std::cout << "Downloading " << selected_model << "..." << std::endl;
if (!download_model(MODELS[0], [](int pct) {
std::cout << "\r" << pct << "%" << std::flush;
return true;
})) {
std::cerr << "\nFailed to download model." << std::endl;
return 1;
}
std::cout << std::endl;
}
auto llm = llamalib::Llama{path};
std::mutex llm_mutex; // Protect access during model switching
// Set a long timeout since LLM inference takes time (default is 5 seconds)
svr.set_read_timeout(300);
svr.set_write_timeout(300);
svr.set_logger([](const auto &req, const auto &res) {
std::cout << req.method << " " << req.path << " -> " << res.status
<< std::endl;
});
svr.Get("/health", [](const httplib::Request &, httplib::Response &res) {
res.set_content(json{{"status", "ok"}}.dump(), "application/json");
});
// --- Translation endpoint (Chapter 2) ------------------------------------
svr.Post("/translate",
[&](const httplib::Request &req, httplib::Response &res) {
// JSON parsing and validation (see Chapter 2 for details)
auto input = json::parse(req.body, nullptr, false);
if (input.is_discarded()) {
res.status = 400;
res.set_content(json{{"error", "Invalid JSON"}}.dump(),
"application/json");
return;
}
if (!input.contains("text") || !input["text"].is_string() ||
input["text"].get<std::string>().empty()) {
res.status = 400;
res.set_content(json{{"error", "'text' is required"}}.dump(),
"application/json");
return;
}
auto text = input["text"].get<std::string>();
auto target_lang = input.value("target_lang", "ja");
auto prompt = "Translate the following text to " + target_lang +
". Output only the translation, nothing else.\n\n" + text;
try {
std::lock_guard<std::mutex> lock(llm_mutex);
auto translation = llm.chat(prompt);
res.set_content(json{{"translation", translation}}.dump(),
"application/json");
} catch (const std::exception &e) {
res.status = 500;
res.set_content(json{{"error", e.what()}}.dump(), "application/json");
}
});
// --- SSE streaming translation (Chapter 3) -------------------------------
svr.Post("/translate/stream",
[&](const httplib::Request &req, httplib::Response &res) {
auto input = json::parse(req.body, nullptr, false);
if (input.is_discarded()) {
res.status = 400;
res.set_content(json{{"error", "Invalid JSON"}}.dump(),
"application/json");
return;
}
if (!input.contains("text") || !input["text"].is_string() ||
input["text"].get<std::string>().empty()) {
res.status = 400;
res.set_content(json{{"error", "'text' is required"}}.dump(),
"application/json");
return;
}
auto text = input["text"].get<std::string>();
auto target_lang = input.value("target_lang", "ja");
auto prompt = "Translate the following text to " + target_lang +
". Output only the translation, nothing else.\n\n" + text;
res.set_chunked_content_provider(
"text/event-stream",
[&, prompt](size_t, httplib::DataSink &sink) {
std::lock_guard<std::mutex> lock(llm_mutex);
try {
llm.chat(prompt, [&](std::string_view token) {
sink.os << "data: "
<< json(std::string(token)).dump(
-1, ' ', false, json::error_handler_t::replace)
<< "\n\n";
return sink.os.good(); // Abort inference on disconnect
});
sink.os << "data: [DONE]\n\n";
} catch (const std::exception &e) {
sink.os << "data: " << json({{"error", e.what()}}).dump() << "\n\n";
}
sink.done();
return true;
});
});
// --- Model list (Chapter 4) ----------------------------------------------
svr.Get("/models",
[&](const httplib::Request &, httplib::Response &res) {
auto models_dir = get_models_dir();
auto arr = json::array();
for (const auto &m : MODELS) {
auto path = models_dir / m.filename;
arr.push_back({
{"name", m.name},
{"params", m.params},
{"size", m.size},
{"downloaded", std::filesystem::exists(path)},
{"selected", m.filename == selected_model},
});
}
res.set_content(json{{"models", arr}}.dump(), "application/json");
});
// --- Model selection (Chapter 4) -----------------------------------------
svr.Post("/models/select",
[&](const httplib::Request &req, httplib::Response &res) {
auto input = json::parse(req.body, nullptr, false);
if (input.is_discarded() || !input.contains("model")) {
res.status = 400;
res.set_content(json{{"error", "'model' is required"}}.dump(),
"application/json");
return;
}
auto name = input["model"].get<std::string>();
auto it = std::find_if(MODELS.begin(), MODELS.end(),
[&](const ModelInfo &m) { return m.name == name; });
if (it == MODELS.end()) {
res.status = 404;
res.set_content(json{{"error", "Unknown model"}}.dump(),
"application/json");
return;
}
const auto &model = *it;
// Always respond with SSE (same format whether already downloaded or not)
res.set_chunked_content_provider(
"text/event-stream",
[&, model](size_t, httplib::DataSink &sink) {
// SSE event sending helper
auto send = [&](const json &event) {
sink.os << "data: " << event.dump() << "\n\n";
};
// Download if not yet present (report progress via SSE)
auto path = get_models_dir() / model.filename;
if (!std::filesystem::exists(path)) {
bool ok = download_model(model, [&](int pct) {
send({{"status", "downloading"}, {"progress", pct}});
return sink.os.good(); // Abort download on client disconnect
});
if (!ok) {
send({{"status", "error"}, {"message", "Download failed"}});
sink.done();
return true;
}
}
// Load and switch to the model
send({{"status", "loading"}});
{
std::lock_guard<std::mutex> lock(llm_mutex);
llm = llamalib::Llama{path};
selected_model = model.filename;
}
send({{"status", "ready"}});
sink.done();
return true;
});
});
// Allow the server to be stopped with `Ctrl+C` (`SIGINT`) or `kill` (`SIGTERM`)
signal(SIGINT, signal_handler);
signal(SIGTERM, signal_handler);
std::cout << "Listening on http://127.0.0.1:8080" << std::endl;
svr.listen("127.0.0.1", 8080);
}
```
</details>
## 4.9 Testing
Since we added OpenSSL configuration to CMakeLists.txt, we need to re-run CMake before building.
```bash
cmake -B build
cmake --build build -j
./build/translate-server
```
### Checking the Model List
```bash
curl http://localhost:8080/models
```
The gemma-2-2b-it model downloaded in Chapter 1 should show `downloaded: true` and `selected: true`.
### Switching to a Different Model
```bash
curl -N -X POST http://localhost:8080/models/select \
-H "Content-Type: application/json" \
-d '{"model": "gemma-2-9b-it"}'
```
Download progress streams via SSE, and `"ready"` appears when it's done.
### Comparing Translations Across Models
Let's translate the same sentence with different models.
```bash
# Translate with gemma-2-9b-it (the model we just switched to)
curl -X POST http://localhost:8080/translate \
-H "Content-Type: application/json" \
-d '{"text": "The quick brown fox jumps over the lazy dog.", "target_lang": "ja"}'
# Switch back to gemma-2-2b-it
curl -N -X POST http://localhost:8080/models/select \
-H "Content-Type: application/json" \
-d '{"model": "gemma-2-2b-it"}'
# Translate the same sentence
curl -X POST http://localhost:8080/translate \
-H "Content-Type: application/json" \
-d '{"text": "The quick brown fox jumps over the lazy dog.", "target_lang": "ja"}'
```
Translation results vary depending on the model, even with the same code and the same prompt. Since cpp-llamalib automatically applies the appropriate chat template for each model, no code changes are needed.
## Next Chapter
The server's main features are now complete: REST API, SSE streaming, and model download and switching. In the next chapter, we'll add static file serving and build a Web UI you can use from a browser.
**Next:** [Adding a Web UI](../ch05-web-ui)

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,724 @@
---
title: "6. Turning It into a Desktop App with WebView"
order: 6
---
In Chapter 5, we completed a translation app you can use from a browser. But every time, you have to start the server, open the URL in a browser... Wouldn't it be nice to just double-click and start using it, like a normal app?
In this chapter, we'll do two things:
1. **WebView integration** — Use [webview/webview](https://github.com/webview/webview) to turn it into a desktop app that runs without a browser
2. **Single binary packaging** — Use [cpp-embedlib](https://github.com/yhirose/cpp-embedlib) to embed HTML/CSS/JS into the binary, making the distributable a single file
When finished, you'll be able to just run `./translate-app` to open a window and start translating.
![Desktop App](../app.png#large-center)
The model downloads automatically on first launch, so the only thing you need to give users is the single binary.
## 6.1 Introducing webview/webview
[webview/webview](https://github.com/webview/webview) is a library that lets you use the OS's native WebView component (WKWebView on macOS, WebKitGTK on Linux, WebView2 on Windows) from C/C++. Unlike Electron, it doesn't bundle its own browser, so the impact on binary size is negligible.
We'll fetch it with CMake. Add the following to your `CMakeLists.txt`:
```cmake
# webview/webview
FetchContent_Declare(webview
GIT_REPOSITORY https://github.com/webview/webview
GIT_TAG master
)
FetchContent_MakeAvailable(webview)
```
This makes the `webview::core` CMake target available. When you link it with `target_link_libraries`, it automatically sets up include paths and platform-specific frameworks.
> **macOS**: No additional dependencies are needed. WKWebView is built into the system.
>
> **Linux**: WebKitGTK is required. Install it with `sudo apt install libwebkit2gtk-4.1-dev`.
>
> **Windows**: The WebView2 runtime is required. It comes pre-installed on Windows 11. For Windows 10, download it from the [official Microsoft website](https://developer.microsoft.com/en-us/microsoft-edge/webview2/).
## 6.2 Running the Server on a Background Thread
Up through Chapter 5, the server's `listen()` was blocking the main thread. To use WebView, we need to run the server on a separate thread and run the WebView event loop on the main thread.
```cpp
#include "webview/webview.h"
#include <thread>
int main() {
// ... (server setup is the same as Chapter 5) ...
// Start the server on a background thread
auto port = svr.bind_to_any_port("127.0.0.1");
std::thread server_thread([&]() { svr.listen_after_bind(); });
std::cout << "Listening on http://127.0.0.1:" << port << std::endl;
// Display the UI with WebView
webview::webview w(false, nullptr);
w.set_title("Translate App");
w.set_size(1024, 768, WEBVIEW_HINT_NONE);
w.navigate("http://127.0.0.1:" + std::to_string(port));
w.run(); // Block until the window is closed
// Stop the server when the window is closed
svr.stop();
server_thread.join();
}
```
Let's look at the key points:
- **`bind_to_any_port`** — Instead of `listen("127.0.0.1", 8080)`, we let the OS choose an available port. Since desktop apps can be launched multiple times, using a fixed port would cause conflicts
- **`listen_after_bind`** — Starts accepting requests on the port reserved by `bind_to_any_port`. While `listen()` does bind and listen in one call, we need to know the port number first, so we split the operations
- **Shutdown order** — When the WebView window is closed, we stop the server with `svr.stop()` and wait for the thread to finish with `server_thread.join()`. If we reversed the order, WebView would lose access to the server
The `signal_handler` from Chapter 5 is no longer needed. In a desktop app, closing the window means terminating the application.
## 6.3 Embedding Static Files with cpp-embedlib
In Chapter 5, we served files from the `public/` directory, so you'd need to distribute `public/` alongside the binary. With [cpp-embedlib](https://github.com/yhirose/cpp-embedlib), you can embed HTML, CSS, and JavaScript into the binary, packaging the distributable into a single file.
### CMakeLists.txt
Fetch cpp-embedlib and embed `public/`:
```cmake
# cpp-embedlib
FetchContent_Declare(cpp-embedlib
GIT_REPOSITORY https://github.com/yhirose/cpp-embedlib
GIT_TAG main
)
FetchContent_MakeAvailable(cpp-embedlib)
# Embed the public/ directory into the binary
cpp_embedlib_add(WebAssets
FOLDER ${CMAKE_CURRENT_SOURCE_DIR}/public
NAMESPACE Web
)
target_link_libraries(translate-app PRIVATE
WebAssets # Embedded files
cpp-embedlib-httplib # cpp-httplib integration
)
```
`cpp_embedlib_add` converts the files under `public/` into binary data at compile time and creates a static library called `WebAssets`. When linked, you can access the embedded files through a `Web::FS` object. `cpp-embedlib-httplib` is a helper library that provides the `httplib::mount()` function.
### Replacing set_mount_point with httplib::mount
Simply replace Chapter 5's `set_mount_point` with cpp-embedlib's `httplib::mount`:
```cpp
#include <cpp-embedlib-httplib.h>
#include "WebAssets.h"
// Chapter 5:
// svr.set_mount_point("/", "./public");
// Chapter 6:
httplib::mount(svr, Web::FS);
```
`httplib::mount` registers handlers that serve the files embedded in `Web::FS` over HTTP. MIME types are automatically determined from file extensions, so there's no need to manually set `Content-Type`.
The file contents are directly mapped to the binary's data segment, so no memory copies or heap allocations occur.
## 6.4 macOS: Adding the Edit Menu
If you try to paste text into the input field with `Cmd+V`, you'll find it doesn't work. On macOS, keyboard shortcuts like `Cmd+V` (paste) and `Cmd+C` (copy) are routed through the application's menu bar. Since webview/webview doesn't create one, these shortcuts never reach the WebView. We need to add a macOS Edit menu using the Objective-C runtime:
```cpp
#ifdef __APPLE__
#include <objc/objc-runtime.h>
void setup_macos_edit_menu() {
auto cls = [](const char *n) { return (id)objc_getClass(n); };
auto sel = sel_registerName;
auto msg = reinterpret_cast<id (*)(id, SEL)>(objc_msgSend);
auto msg_s = reinterpret_cast<id (*)(id, SEL, const char *)>(objc_msgSend);
auto msg_id = reinterpret_cast<id (*)(id, SEL, id)>(objc_msgSend);
auto msg_v = reinterpret_cast<void (*)(id, SEL, id)>(objc_msgSend);
auto msg_mi = reinterpret_cast<id (*)(id, SEL, id, SEL, id)>(objc_msgSend);
auto str = [&](const char *s) {
return msg_s(cls("NSString"), sel("stringWithUTF8String:"), s);
};
id app = msg(cls("NSApplication"), sel("sharedApplication"));
id mainMenu = msg(msg(cls("NSMenu"), sel("alloc")), sel("init"));
id editItem = msg(msg(cls("NSMenuItem"), sel("alloc")), sel("init"));
id editMenu = msg_id(msg(cls("NSMenu"), sel("alloc")),
sel("initWithTitle:"), str("Edit"));
struct { const char *title; const char *action; const char *key; } items[] = {
{"Undo", "undo:", "z"},
{"Redo", "redo:", "Z"},
{"Cut", "cut:", "x"},
{"Copy", "copy:", "c"},
{"Paste", "paste:", "v"},
{"Select All", "selectAll:", "a"},
};
for (auto &[title, action, key] : items) {
id mi = msg_mi(msg(cls("NSMenuItem"), sel("alloc")),
sel("initWithTitle:action:keyEquivalent:"),
str(title), sel(action), str(key));
msg_v(editMenu, sel("addItem:"), mi);
}
msg_v(editItem, sel("setSubmenu:"), editMenu);
msg_v(mainMenu, sel("addItem:"), editItem);
msg_v(app, sel("setMainMenu:"), mainMenu);
}
#endif
```
Call this before `w.run()`:
```cpp
#ifdef __APPLE__
setup_macos_edit_menu();
#endif
w.run();
```
On Windows and Linux, keyboard shortcuts are delivered directly to the focused control without going through the menu bar, so this workaround is macOS-specific.
## 6.5 Complete Code
<details>
<summary data-file="CMakeLists.txt">Complete code (CMakeLists.txt)</summary>
```cmake
cmake_minimum_required(VERSION 3.20)
project(translate-app CXX)
set(CMAKE_CXX_STANDARD 20)
include(FetchContent)
# llama.cpp
FetchContent_Declare(llama
GIT_REPOSITORY https://github.com/ggml-org/llama.cpp
GIT_TAG master
GIT_SHALLOW TRUE
)
FetchContent_MakeAvailable(llama)
# cpp-httplib
FetchContent_Declare(httplib
GIT_REPOSITORY https://github.com/yhirose/cpp-httplib
GIT_TAG master
)
FetchContent_MakeAvailable(httplib)
# nlohmann/json
FetchContent_Declare(json
URL https://github.com/nlohmann/json/releases/download/v3.11.3/json.tar.xz
)
FetchContent_MakeAvailable(json)
# cpp-llamalib
FetchContent_Declare(cpp_llamalib
GIT_REPOSITORY https://github.com/yhirose/cpp-llamalib
GIT_TAG main
)
FetchContent_MakeAvailable(cpp_llamalib)
# webview/webview
FetchContent_Declare(webview
GIT_REPOSITORY https://github.com/webview/webview
GIT_TAG master
)
FetchContent_MakeAvailable(webview)
# cpp-embedlib
FetchContent_Declare(cpp-embedlib
GIT_REPOSITORY https://github.com/yhirose/cpp-embedlib
GIT_TAG main
)
FetchContent_MakeAvailable(cpp-embedlib)
# Embed the public/ directory into the binary
cpp_embedlib_add(WebAssets
FOLDER ${CMAKE_CURRENT_SOURCE_DIR}/public
NAMESPACE Web
)
find_package(OpenSSL REQUIRED)
add_executable(translate-app src/main.cpp)
target_link_libraries(translate-app PRIVATE
httplib::httplib
nlohmann_json::nlohmann_json
cpp-llamalib
OpenSSL::SSL OpenSSL::Crypto
WebAssets
cpp-embedlib-httplib
webview::core
)
if(APPLE)
target_link_libraries(translate-app PRIVATE
"-framework CoreFoundation"
"-framework Security"
)
endif()
target_compile_definitions(translate-app PRIVATE
CPPHTTPLIB_OPENSSL_SUPPORT
)
```
</details>
<details>
<summary data-file="main.cpp">Complete code (main.cpp)</summary>
```cpp
#include <httplib.h>
#include <nlohmann/json.hpp>
#include <cpp-llamalib.h>
#include <cpp-embedlib-httplib.h>
#include "WebAssets.h"
#include "webview/webview.h"
#ifdef __APPLE__
#include <objc/objc-runtime.h>
#endif
#include <algorithm>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <mutex>
#include <thread>
using json = nlohmann::json;
// -------------------------------------------------------------------------
// macOS Edit menu (Cmd+C/V/X/A require an Edit menu on macOS)
// -------------------------------------------------------------------------
#ifdef __APPLE__
void setup_macos_edit_menu() {
auto cls = [](const char *n) { return (id)objc_getClass(n); };
auto sel = sel_registerName;
auto msg = reinterpret_cast<id (*)(id, SEL)>(objc_msgSend);
auto msg_s = reinterpret_cast<id (*)(id, SEL, const char *)>(objc_msgSend);
auto msg_id = reinterpret_cast<id (*)(id, SEL, id)>(objc_msgSend);
auto msg_v = reinterpret_cast<void (*)(id, SEL, id)>(objc_msgSend);
auto msg_mi = reinterpret_cast<id (*)(id, SEL, id, SEL, id)>(objc_msgSend);
auto str = [&](const char *s) {
return msg_s(cls("NSString"), sel("stringWithUTF8String:"), s);
};
id app = msg(cls("NSApplication"), sel("sharedApplication"));
id mainMenu = msg(msg(cls("NSMenu"), sel("alloc")), sel("init"));
id editItem = msg(msg(cls("NSMenuItem"), sel("alloc")), sel("init"));
id editMenu = msg_id(msg(cls("NSMenu"), sel("alloc")),
sel("initWithTitle:"), str("Edit"));
struct { const char *title; const char *action; const char *key; } items[] = {
{"Undo", "undo:", "z"},
{"Redo", "redo:", "Z"},
{"Cut", "cut:", "x"},
{"Copy", "copy:", "c"},
{"Paste", "paste:", "v"},
{"Select All", "selectAll:", "a"},
};
for (auto &[title, action, key] : items) {
id mi = msg_mi(msg(cls("NSMenuItem"), sel("alloc")),
sel("initWithTitle:action:keyEquivalent:"),
str(title), sel(action), str(key));
msg_v(editMenu, sel("addItem:"), mi);
}
msg_v(editItem, sel("setSubmenu:"), editMenu);
msg_v(mainMenu, sel("addItem:"), editItem);
msg_v(app, sel("setMainMenu:"), mainMenu);
}
#endif
// -------------------------------------------------------------------------
// Model definitions
// -------------------------------------------------------------------------
struct ModelInfo {
std::string name;
std::string params;
std::string size;
std::string repo;
std::string filename;
};
const std::vector<ModelInfo> MODELS = {
{
.name = "gemma-2-2b-it",
.params = "2B",
.size = "1.6 GB",
.repo = "bartowski/gemma-2-2b-it-GGUF",
.filename = "gemma-2-2b-it-Q4_K_M.gguf",
},
{
.name = "gemma-2-9b-it",
.params = "9B",
.size = "5.8 GB",
.repo = "bartowski/gemma-2-9b-it-GGUF",
.filename = "gemma-2-9b-it-Q4_K_M.gguf",
},
{
.name = "Llama-3.1-8B-Instruct",
.params = "8B",
.size = "4.9 GB",
.repo = "bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",
.filename = "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
},
};
// -------------------------------------------------------------------------
// Model storage directory
// -------------------------------------------------------------------------
std::filesystem::path get_models_dir() {
#ifdef _WIN32
auto env = std::getenv("APPDATA");
auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
return base / "translate-app" / "models";
#else
auto env = std::getenv("HOME");
auto base = env ? std::filesystem::path(env) : std::filesystem::path(".");
return base / ".translate-app" / "models";
#endif
}
// -------------------------------------------------------------------------
// Model download
// -------------------------------------------------------------------------
// Abort the download if progress_cb returns false
bool download_model(const ModelInfo &model,
std::function<bool(int)> progress_cb) {
httplib::Client cli("https://huggingface.co");
cli.set_follow_location(true); // Hugging Face redirects to a CDN
cli.set_read_timeout(std::chrono::hours(1)); // Long timeout for large models
auto url = "/" + model.repo + "/resolve/main/" + model.filename;
auto path = get_models_dir() / model.filename;
auto tmp_path = std::filesystem::path(path).concat(".tmp");
std::ofstream ofs(tmp_path, std::ios::binary);
if (!ofs) { return false; }
auto res = cli.Get(url,
// content_receiver: Receive data chunk by chunk and write to file
[&](const char *data, size_t len) {
ofs.write(data, len);
return ofs.good();
},
// progress: Report download progress (return false to abort)
[&, last_pct = -1](size_t current, size_t total) mutable {
int pct = total ? (int)(current * 100 / total) : 0;
if (pct == last_pct) return true; // Skip if the value hasn't changed
last_pct = pct;
return progress_cb(pct);
});
ofs.close();
if (!res || res->status != 200) {
std::filesystem::remove(tmp_path);
return false;
}
// Rename after download completes
std::filesystem::rename(tmp_path, path);
return true;
}
// -------------------------------------------------------------------------
// Server
// -------------------------------------------------------------------------
int main() {
httplib::Server svr;
// Create the model storage directory
auto models_dir = get_models_dir();
std::filesystem::create_directories(models_dir);
// Auto-download the default model if not already present
std::string selected_model = MODELS[0].filename;
auto path = models_dir / selected_model;
if (!std::filesystem::exists(path)) {
std::cout << "Downloading " << selected_model << "..." << std::endl;
if (!download_model(MODELS[0], [](int pct) {
std::cout << "\r" << pct << "%" << std::flush;
return true;
})) {
std::cerr << "\nFailed to download model." << std::endl;
return 1;
}
std::cout << std::endl;
}
auto llm = llamalib::Llama{path};
std::mutex llm_mutex; // Protect access during model switching
// Set a long timeout since LLM inference takes time (default is 5 seconds)
svr.set_read_timeout(300);
svr.set_write_timeout(300);
svr.set_logger([](const auto &req, const auto &res) {
std::cout << req.method << " " << req.path << " -> " << res.status
<< std::endl;
});
svr.Get("/health", [](const httplib::Request &, httplib::Response &res) {
res.set_content(json{{"status", "ok"}}.dump(), "application/json");
});
// --- Translation endpoint (Chapter 2) ------------------------------------
svr.Post("/translate",
[&](const httplib::Request &req, httplib::Response &res) {
auto input = json::parse(req.body, nullptr, false);
if (input.is_discarded()) {
res.status = 400;
res.set_content(json{{"error", "Invalid JSON"}}.dump(),
"application/json");
return;
}
if (!input.contains("text") || !input["text"].is_string() ||
input["text"].get<std::string>().empty()) {
res.status = 400;
res.set_content(json{{"error", "'text' is required"}}.dump(),
"application/json");
return;
}
auto text = input["text"].get<std::string>();
auto target_lang = input.value("target_lang", "ja");
auto prompt = "Translate the following text to " + target_lang +
". Output only the translation, nothing else.\n\n" + text;
try {
std::lock_guard<std::mutex> lock(llm_mutex);
auto translation = llm.chat(prompt);
res.set_content(json{{"translation", translation}}.dump(),
"application/json");
} catch (const std::exception &e) {
res.status = 500;
res.set_content(json{{"error", e.what()}}.dump(), "application/json");
}
});
// --- SSE streaming translation (Chapter 3) -------------------------------
svr.Post("/translate/stream",
[&](const httplib::Request &req, httplib::Response &res) {
auto input = json::parse(req.body, nullptr, false);
if (input.is_discarded()) {
res.status = 400;
res.set_content(json{{"error", "Invalid JSON"}}.dump(),
"application/json");
return;
}
if (!input.contains("text") || !input["text"].is_string() ||
input["text"].get<std::string>().empty()) {
res.status = 400;
res.set_content(json{{"error", "'text' is required"}}.dump(),
"application/json");
return;
}
auto text = input["text"].get<std::string>();
auto target_lang = input.value("target_lang", "ja");
auto prompt = "Translate the following text to " + target_lang +
". Output only the translation, nothing else.\n\n" + text;
res.set_chunked_content_provider(
"text/event-stream",
[&, prompt](size_t, httplib::DataSink &sink) {
std::lock_guard<std::mutex> lock(llm_mutex);
try {
llm.chat(prompt, [&](std::string_view token) {
sink.os << "data: "
<< json(std::string(token)).dump(
-1, ' ', false, json::error_handler_t::replace)
<< "\n\n";
return sink.os.good(); // Abort inference on disconnect
});
sink.os << "data: [DONE]\n\n";
} catch (const std::exception &e) {
sink.os << "data: " << json({{"error", e.what()}}).dump() << "\n\n";
}
sink.done();
return true;
});
});
// --- Model list (Chapter 4) ----------------------------------------------
svr.Get("/models",
[&](const httplib::Request &, httplib::Response &res) {
auto models_dir = get_models_dir();
auto arr = json::array();
for (const auto &m : MODELS) {
auto path = models_dir / m.filename;
arr.push_back({
{"name", m.name},
{"params", m.params},
{"size", m.size},
{"downloaded", std::filesystem::exists(path)},
{"selected", m.filename == selected_model},
});
}
res.set_content(json{{"models", arr}}.dump(), "application/json");
});
// --- Model selection (Chapter 4) -----------------------------------------
svr.Post("/models/select",
[&](const httplib::Request &req, httplib::Response &res) {
auto input = json::parse(req.body, nullptr, false);
if (input.is_discarded() || !input.contains("model")) {
res.status = 400;
res.set_content(json{{"error", "'model' is required"}}.dump(),
"application/json");
return;
}
auto name = input["model"].get<std::string>();
auto it = std::find_if(MODELS.begin(), MODELS.end(),
[&](const ModelInfo &m) { return m.name == name; });
if (it == MODELS.end()) {
res.status = 404;
res.set_content(json{{"error", "Unknown model"}}.dump(),
"application/json");
return;
}
const auto &model = *it;
// Always respond with SSE (same format whether downloaded or not)
res.set_chunked_content_provider(
"text/event-stream",
[&, model](size_t, httplib::DataSink &sink) {
// SSE event sending helper
auto send = [&](const json &event) {
sink.os << "data: " << event.dump() << "\n\n";
};
// Download if not yet downloaded (report progress via SSE)
auto path = get_models_dir() / model.filename;
if (!std::filesystem::exists(path)) {
bool ok = download_model(model, [&](int pct) {
send({{"status", "downloading"}, {"progress", pct}});
return sink.os.good(); // Abort download on client disconnect
});
if (!ok) {
send({{"status", "error"}, {"message", "Download failed"}});
sink.done();
return true;
}
}
// Load and switch to the model
send({{"status", "loading"}});
{
std::lock_guard<std::mutex> lock(llm_mutex);
llm = llamalib::Llama{path};
selected_model = model.filename;
}
send({{"status", "ready"}});
sink.done();
return true;
});
});
// --- Embedded file serving (Chapter 6) ------------------------------------
// Chapter 5: svr.set_mount_point("/", "./public");
httplib::mount(svr, Web::FS);
// Start the server on a background thread
auto port = svr.bind_to_any_port("127.0.0.1");
std::thread server_thread([&]() { svr.listen_after_bind(); });
std::cout << "Listening on http://127.0.0.1:" << port << std::endl;
// Display the UI with WebView
webview::webview w(false, nullptr);
w.set_title("Translate App");
w.set_size(1024, 768, WEBVIEW_HINT_NONE);
w.navigate("http://127.0.0.1:" + std::to_string(port));
#ifdef __APPLE__
setup_macos_edit_menu();
#endif
w.run(); // Block until the window is closed
// Stop the server when the window is closed
svr.stop();
server_thread.join();
}
```
</details>
To summarize the changes from Chapter 5:
- `#include <csignal>` replaced with `#include <thread>`, `<cpp-embedlib-httplib.h>`, `"WebAssets.h"`, `"webview/webview.h"`
- Removed the `signal_handler` function
- `svr.set_mount_point("/", "./public")` replaced with `httplib::mount(svr, Web::FS)`
- `svr.listen("127.0.0.1", 8080)` replaced with `bind_to_any_port` + `listen_after_bind` + WebView event loop
Not a single line of handler code has changed. The REST API, SSE streaming, and model management built through Chapter 5 all work as-is.
## 6.6 Building and Testing
```bash
cmake -B build
cmake --build build -j
```
Launch the app:
```bash
./build/translate-app
```
No browser is needed. A window opens automatically. The same UI from Chapter 5 appears as-is, and translation and model switching all work just the same.
When you close the window, the server shuts down automatically. There's no need for `Ctrl+C`.
### What Needs to Be Distributed
You only need to distribute:
- The single `translate-app` binary
That's it. You don't need the `public/` directory. HTML, CSS, and JavaScript are embedded in the binary. Model files download automatically on first launch, so there's no need to ask users to prepare anything in advance.
## Next Chapter
Congratulations! 🎉
In Chapter 1, `/health` just returned `{"status":"ok"}`. Now we have a desktop app where you type text and translations stream in real time, pick a different model from a dropdown and it downloads automatically, and closing the window cleanly shuts everything down — all in a single distributable binary.
What we changed in this chapter was just the static file serving and the server startup. Not a single line of handler code changed. The REST API, SSE streaming, and model management we built through Chapter 5 all work as a desktop app, as-is.
In the next chapter, we'll shift perspective and read through the code of llama.cpp's own `llama-server`. Let's compare our simple server with a production-quality one and see what design decisions differ and why.
**Next:** [Reading the llama.cpp Server Source Code](../ch07-code-reading)

View File

@@ -0,0 +1,154 @@
---
title: "7. Reading the llama.cpp Server Source Code"
order: 7
---
Over the course of six chapters, we built a translation desktop app from scratch. We have a working product, but it's ultimately a "learning-oriented" implementation. So how does "production-quality" code differ? Let's read the source code of `llama-server`, the official server bundled with llama.cpp, and compare.
`llama-server` is located at `llama.cpp/tools/server/`. It uses the same cpp-httplib, so you can read the code the same way as in the previous chapters.
## 7.1 Source Code Location
```ascii
llama.cpp/tools/server/
├── server.cpp # Main server implementation
├── httplib.h # cpp-httplib (bundled version)
└── ...
```
The code is contained in a single `server.cpp`. It runs to several thousand lines, but once you understand the structure, you can narrow down the parts worth reading.
## 7.2 OpenAI-Compatible API
The biggest difference between the server we built and `llama-server` is the API design.
**Our API:**
```text
POST /translate → {"translation": "..."}
POST /translate/stream → SSE: data: "token"
```
**llama-server's API:**
```text
POST /v1/chat/completions → OpenAI-compatible JSON
POST /v1/completions → OpenAI-compatible JSON
POST /v1/embeddings → Text embedding vectors
```
`llama-server` conforms to [OpenAI's API specification](https://platform.openai.com/docs/api-reference). This means OpenAI's official client libraries (such as the Python `openai` package) work out of the box.
```python
# Example of connecting to llama-server with the OpenAI client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="dummy")
response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "Hello!"}]
)
```
Compatibility with existing tools and libraries is a big design decision. We designed a simple translation-specific API, but if you're building a general-purpose server, OpenAI compatibility has become the de facto standard.
## 7.3 Concurrent Request Handling
Our server processes requests one at a time. If another request arrives while a translation is in progress, it waits until the previous inference finishes. This is fine for a desktop app used by one person, but it becomes a problem for a server shared by multiple users.
`llama-server` handles concurrent requests through a mechanism called **slots**.
![llama-server's slot management](../slots.svg#half)
The key point is that tokens from each slot are not inferred **one by one in sequence**, but rather **all at once in a single batch**. GPUs excel at parallel processing, so processing two users simultaneously takes almost the same time as processing one. This is called "continuous batching."
In our server, cpp-httplib's thread pool assigns one thread per request, but the inference itself runs single-threaded inside `llm.chat()`. `llama-server` consolidates this inference step into a shared batch processing loop.
## 7.4 Differences in SSE Format
The streaming mechanism itself is the same (`set_chunked_content_provider` + SSE), but the data format differs.
**Our format:**
```text
data: "去年の"
data: "春に"
data: [DONE]
```
**llama-server (OpenAI-compatible):**
```text
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"delta":{"content":"去年の"}}]}
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"delta":{"content":"春に"}}]}
data: [DONE]
```
Our format simply sends the tokens. Because `llama-server` follows the OpenAI specification, even a single token comes wrapped in JSON. It may look verbose, but it includes useful information for clients, like an `id` to identify the request and a `finish_reason` to indicate why generation stopped.
## 7.5 KV Cache Reuse
In our server, we process the entire prompt from scratch on every request. Our translation app's prompt is short ("Translate the following text to ja..." + input text), so this isn't a problem.
`llama-server` reuses the KV cache for the prefix portion when a request shares a common prompt prefix with a previous request.
![KV cache reuse](../kv-cache.svg#half)
For chatbots that send a long system prompt and few-shot examples with every request, this alone dramatically reduces response time. The difference is night and day: processing several thousand tokens of system prompt every time versus reading them from cache in an instant.
For our translation app, where the system prompt is just a single sentence, the benefit is limited. However, it's an optimization worth keeping in mind when applying this to your own applications.
## 7.6 Structured Output
Since our translation API returns plain text, there was no need to constrain the output format. But what if you want the LLM to respond in JSON?
```text
Prompt: Analyze the sentiment of the following text and return it as JSON.
LLM output (expected): {"sentiment": "positive", "score": 0.8}
LLM output (reality): Here are the results of the sentiment analysis. {"sentiment": ...
```
LLMs sometimes ignore instructions and add extraneous text. `llama-server` solves this problem with **grammar constraints**.
```bash
curl http://localhost:8080/v1/chat/completions \
-d '{
"messages": [{"role": "user", "content": "Analyze sentiment..."}],
"json_schema": {
"type": "object",
"properties": {
"sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
"score": {"type": "number"}
},
"required": ["sentiment", "score"]
}
}'
```
When you specify `json_schema`, tokens that don't conform to the grammar are excluded during token generation. This guarantees that the output is always valid JSON, so there's no need to worry about `json::parse` failing.
When embedding LLMs into applications, whether you can reliably parse the output directly impacts reliability. Grammar constraints are unnecessary for free-text output like translation, but they're essential for use cases where you need to return structured data as an API response.
## 7.7 Summary
Let's organize the differences we've covered.
| Aspect | Our Server | llama-server |
|------|-------------|--------------|
| API design | Translation-specific | OpenAI-compatible |
| Concurrent requests | Sequential processing | Slots + continuous batching |
| SSE format | Tokens only | OpenAI-compatible JSON |
| KV cache | Cleared each time | Prefix reuse |
| Structured output | None | JSON Schema / grammar constraints |
| Code size | ~200 lines | Several thousand lines |
Our code is simple because of the assumption that "one person uses it as a desktop app." If you're building a server for multiple users or one that integrates with the existing ecosystem, `llama-server`'s design serves as a valuable reference.
Conversely, even 200 lines of code is enough to make a fully functional translation app. I hope this code reading exercise has also conveyed the value of "building only what you need."
## Next Chapter
In the next chapter, we'll cover the key points for swapping in your own library and customizing the app to make it truly yours.
**Next:** [Making It Your Own](../ch08-customization)

View File

@@ -0,0 +1,120 @@
---
title: "8. Making It Your Own"
order: 8
---
Through Chapter 7, we've built a translation desktop app and studied how production-quality code differs. In this chapter, let's go over the key points for **turning this app into something entirely your own**.
The translation app was just a vehicle. Replace llama.cpp with your own library, and the same architecture works for any application.
## 8.1 Swapping Out the Build Configuration
First, replace the llama.cpp-related `FetchContent` entries in `CMakeLists.txt` with your own library.
```cmake
# Remove: llama.cpp and cpp-llamalib FetchContent
# Add: your own library
FetchContent_Declare(my_lib
GIT_REPOSITORY https://github.com/yourname/my-lib
GIT_TAG main
)
FetchContent_MakeAvailable(my_lib)
target_link_libraries(my-app PRIVATE
httplib::httplib
nlohmann_json::nlohmann_json
my_lib # Your library instead of cpp-llamalib
# ...
)
```
If your library doesn't support CMake, you can place the header and source files directly in `src/` and add them to `add_executable`. Keep cpp-httplib, nlohmann/json, and webview as they are.
## 8.2 Adapting the API to Your Task
Change the translation API's endpoints and parameters to match your task.
| Translation app | Your app (e.g., image processing) |
|---|---|
| `POST /translate` | `POST /process` |
| `{"text": "...", "target_lang": "ja"}` | `{"image": "base64...", "filter": "blur"}` |
| `POST /translate/stream` | `POST /process/stream` |
| `GET /models` | `GET /filters` or `GET /presets` |
Then update each handler's implementation. For example, just replace the `llm.chat()` calls with your own library's API.
```cpp
// Before: LLM translation
auto translation = llm.chat(prompt);
res.set_content(json{{"translation", translation}}.dump(), "application/json");
// After: e.g., an image processing library
auto result = my_lib::process(input_image, options);
res.set_content(json{{"result", result}}.dump(), "application/json");
```
The same goes for SSE streaming. If your library has a function that reports progress via a callback, you can use the exact same pattern from Chapter 3 to send incremental responses. SSE isn't limited to LLMs — it's useful for any time-consuming task: image processing progress, data conversion steps, long-running computations.
## 8.3 Design Considerations
### Libraries with Expensive Initialization
In this book, we load the LLM model at the top of `main()` and keep it in a variable. This is intentional. Loading the model on every request would take several seconds, so we load it once at startup and reuse it. If your library has expensive initialization (loading large data files, acquiring GPU resources, etc.), the same approach works well.
### Thread Safety
cpp-httplib processes requests concurrently using a thread pool. In Chapter 4 we protected the `llm` object with a `std::mutex` to prevent crashes during model switching. The same pattern applies when integrating your own library. If your library isn't thread-safe or you need to swap objects at runtime, protect access with a `std::mutex`.
## 8.4 Customizing the UI
Edit the three files in `public/`.
- **`index.html`** — Change the input form layout. Swap `<textarea>` for `<input type="file">`, add parameter fields, etc.
- **`style.css`** — Adjust the layout and colors. Keep the two-column design or switch to a single column
- **`script.js`** — Update the `fetch()` target URLs, request bodies, and how responses are displayed
Even without changing any server code, just swapping the HTML makes the app look completely different. Since these are static files, you can iterate quickly — just reload the browser without restarting the server.
This book used plain HTML, CSS, and JavaScript, but combining them with a frontend framework like Vue or React, or a CSS framework, would let you build an even more polished app.
## 8.5 Distribution Considerations
### Licenses
Check the licenses of the libraries you're using. cpp-httplib (MIT), nlohmann/json (MIT), and webview (MIT) all allow commercial use. Don't forget to check the license of your own library and its dependencies too.
### Models and Data Files
The download mechanism we built in Chapter 4 isn't limited to LLM models. If your app needs large data files, the same pattern lets you auto-download them on first launch, keeping the binary small while sparing users the manual setup.
If the data is small, you can embed it directly into the binary with cpp-embedlib.
### Cross-Platform Builds
webview supports macOS, Linux, and Windows. When building for each platform:
- **macOS** — No additional dependencies
- **Linux** — Requires `libwebkit2gtk-4.1-dev`
- **Windows** — Requires the WebView2 runtime (pre-installed on Windows 11)
Consider setting up cross-platform builds in CI (e.g., GitHub Actions) too.
## Closing
Thank you so much for reading to the end. 🙏
This book started with `/health` returning `{"status":"ok"}` in Chapter 1. From there we built a REST API, added SSE streaming, downloaded models from Hugging Face, created a browser-based Web UI, and packaged it all into a single-binary desktop app. In Chapter 7 we read through `llama-server`'s code and learned how production-quality servers differ in their design. It's been quite a journey, and I'm truly grateful you stuck with it all the way through.
Looking back, we used several key cpp-httplib features hands-on:
- **Server**: routing, JSON responses, SSE streaming with `set_chunked_content_provider`, static file serving with `set_mount_point`
- **Client**: HTTPS connections, redirect following, large downloads with content receivers, progress callbacks
- **WebView integration**: `bind_to_any_port` + `listen_after_bind` for background threading
cpp-httplib offers many more features beyond what we covered here, including multipart file uploads, authentication, timeout control, compression, and range requests. See [A Tour of cpp-httplib](../../tour/) for details.
These patterns aren't limited to a translation app. If you want to add a web API to your C++ library, give it a browser UI, or ship it as an easy-to-distribute desktop app — I hope this book serves as a useful reference.
Take your own library, build your own app, and have fun with it. Happy hacking! 🚀

View File

@@ -1,23 +1,26 @@
---
title: "Building a Desktop LLM App with cpp-httplib"
order: 0
status: "draft"
---
Build an LLM-powered translation desktop app step by step, learning both the server and client sides of cpp-httplib along the way. Translation is just an example — swap it out to build your own summarizer, code generator, chatbot, or any other LLM application.
Have you ever wanted to add a web API to your own C++ library, or quickly build an Electron-like desktop app? In Rust you might reach for "Tauri + axum," but in C++ it always seemed out of reach.
## Dependencies
With [cpp-httplib](https://github.com/yhirose/cpp-httplib), [webview/webview](https://github.com/webview/webview), and [cpp-embedlib](https://github.com/yhirose/cpp-embedlib), you can take the same approach in pure C++ — and produce a small, easy-to-distribute single binary.
- [llama.cpp](https://github.com/ggml-org/llama.cpp) — LLM inference engine
- [nlohmann/json](https://github.com/nlohmann/json) — JSON parser (header-only)
- [webview/webview](https://github.com/webview/webview) — WebView wrapper (header-only)
- [cpp-httplib](https://github.com/yhirose/cpp-httplib) — HTTP server/client (header-only)
In this tutorial we build an LLM-powered translation app using [llama.cpp](https://github.com/ggml-org/llama.cpp), progressing step by step from "REST API" to "SSE streaming" to "Web UI" to "desktop app." Translation is just the vehicle — replace llama.cpp with your own library and the same architecture works for any application.
![Desktop App](app.png#large-center)
If you know basic C++17 and understand the basics of HTTP / REST APIs, you're ready to start.
## Chapters
1. **Embed llama.cpp and create a REST API**Start with a simple API that accepts text via POST and returns a translation as JSON
2. **Add token streaming with SSE**Stream translation results token by token using the standard LLM API approach
3. **Add model discovery and download**Use the client to search and download GGUF models from Hugging Face
4. **Add a Web UI** — Serve a translation UI with static file hosting, making the app accessible from a browser
5. **Turn it into a desktop app with WebView**Wrap the web app with webview/webview to create an Electron-like desktop application
6. **Code reading: llama.cpp's server implementation**Compare your implementation with production-quality code and learn from the differences
1. **[Set up the project](ch01-setup)** — Fetch dependencies, configure the build, write scaffold code
2. **[Embed llama.cpp and create a REST API](ch02-rest-api)** — Return translation results as JSON
3. **[Add token streaming with SSE](ch03-sse-streaming)** — Stream responses token by token
4. **[Add model discovery and management](ch04-model-management)** — Download and switch models from Hugging Face
5. **[Add a Web UI](ch05-web-ui)** — A browser-based translation interface
6. **[Turn it into a desktop app with WebView](ch06-desktop-app)** — A single-binary desktop application
7. **[Reading the llama.cpp server source code](ch07-code-reading)** — Compare with production-quality code
8. **[Making it your own](ch08-customization)** — Swap in your own library and customize

View File

@@ -0,0 +1,36 @@
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 426 160" font-family="system-ui, sans-serif" font-size="14">
<rect x="0" y="0" width="426" height="160" rx="8" fill="#f5f3ef"/>
<defs>
<marker id="arrowhead" markerWidth="7" markerHeight="5" refX="7" refY="2.5" orient="auto">
<polygon points="0,0 7,2.5 0,5" fill="#198754"/>
</marker>
</defs>
<!-- request 1 (y=16) -->
<text x="94" y="35" fill="#333" font-weight="bold" text-anchor="end">Request 1:</text>
<rect x="106" y="16" width="138" height="30" rx="4" fill="#d1e7dd" stroke="#198754" stroke-width="1"/>
<text x="175" y="36" fill="#333" text-anchor="middle">System prompt</text>
<text x="256" y="36" fill="#333" text-anchor="middle">+</text>
<rect x="270" y="16" width="138" height="30" rx="4" fill="#cfe2ff" stroke="#0d6efd" stroke-width="1"/>
<text x="339" y="36" fill="#333" text-anchor="middle">User question A</text>
<!-- annotation: cache save -->
<text x="175" y="64" fill="#198754" font-size="11" text-anchor="middle">Saved to KV cache</text>
<!-- arrow -->
<line x1="175" y1="70" x2="175" y2="90" stroke="#198754" stroke-width="1.2" marker-end="url(#arrowhead)"/>
<text x="188" y="85" fill="#198754" font-size="11">Reuse</text>
<!-- request 2 (y=96) -->
<text x="94" y="115" fill="#333" font-weight="bold" text-anchor="end">Request 2:</text>
<rect x="106" y="96" width="138" height="30" rx="4" fill="#d1e7dd" stroke="#198754" stroke-width="1" stroke-dasharray="6,3"/>
<text x="175" y="116" fill="#333" text-anchor="middle">System prompt</text>
<text x="256" y="116" fill="#333" text-anchor="middle">+</text>
<rect x="270" y="96" width="138" height="30" rx="4" fill="#cfe2ff" stroke="#0d6efd" stroke-width="1"/>
<text x="339" y="116" fill="#333" text-anchor="middle">User question B</text>
<!-- bottom labels -->
<text x="175" y="144" fill="#198754" font-size="11" text-anchor="middle">No recomputation</text>
<text x="339" y="144" fill="#0d6efd" font-size="11" text-anchor="middle">Only this is computed</text>
</svg>

After

Width:  |  Height:  |  Size: 2.0 KiB

View File

@@ -0,0 +1,24 @@
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 440 240" font-family="system-ui, sans-serif" font-size="14">
<!-- outer box -->
<rect x="0" y="0" width="440" height="240" rx="8" fill="#f5f3ef"/>
<text x="20" y="28" font-weight="bold" font-size="16" fill="#333">llama-server</text>
<!-- slot 0 -->
<rect x="20" y="46" width="400" height="32" rx="4" fill="#d1e7dd" stroke="#198754" stroke-width="1"/>
<text x="32" y="67" fill="#333">Slot 0: User A's request</text>
<!-- slot 1 -->
<rect x="20" y="86" width="400" height="32" rx="4" fill="#d1e7dd" stroke="#198754" stroke-width="1"/>
<text x="32" y="107" fill="#333">Slot 1: User B's request</text>
<!-- slot 2 -->
<rect x="20" y="126" width="400" height="32" rx="4" fill="#e9ecef" stroke="#adb5bd" stroke-width="1"/>
<text x="32" y="147" fill="#999">Slot 2: (idle)</text>
<!-- slot 3 -->
<rect x="20" y="166" width="400" height="32" rx="4" fill="#e9ecef" stroke="#adb5bd" stroke-width="1"/>
<text x="32" y="187" fill="#999">Slot 3: (idle)</text>
<!-- arrow + label -->
<text x="20" y="224" fill="#333" font-size="13">→ Active slots are inferred together in a single batch</text>
</svg>

After

Width:  |  Height:  |  Size: 1.2 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 130 KiB