mirror of
https://github.com/yhirose/cpp-httplib.git
synced 2026-04-11 19:28:30 +00:00
155 lines
7.1 KiB
Markdown
155 lines
7.1 KiB
Markdown
---
|
|
title: "7. Reading the llama.cpp Server Source Code"
|
|
order: 7
|
|
|
|
---
|
|
|
|
Over the course of six chapters, we built a translation desktop app from scratch. We have a working product, but it's ultimately a "learning-oriented" implementation. So how does "production-quality" code differ? Let's read the source code of `llama-server`, the official server bundled with llama.cpp, and compare.
|
|
|
|
`llama-server` is located at `llama.cpp/tools/server/`. It uses the same cpp-httplib, so you can read the code the same way as in the previous chapters.
|
|
|
|
## 7.1 Source Code Location
|
|
|
|
```ascii
|
|
llama.cpp/tools/server/
|
|
├── server.cpp # Main server implementation
|
|
├── httplib.h # cpp-httplib (bundled version)
|
|
└── ...
|
|
```
|
|
|
|
The code is contained in a single `server.cpp`. It runs to several thousand lines, but once you understand the structure, you can narrow down the parts worth reading.
|
|
|
|
## 7.2 OpenAI-Compatible API
|
|
|
|
The biggest difference between the server we built and `llama-server` is the API design.
|
|
|
|
**Our API:**
|
|
|
|
```text
|
|
POST /translate → {"translation": "..."}
|
|
POST /translate/stream → SSE: data: "token"
|
|
```
|
|
|
|
**llama-server's API:**
|
|
|
|
```text
|
|
POST /v1/chat/completions → OpenAI-compatible JSON
|
|
POST /v1/completions → OpenAI-compatible JSON
|
|
POST /v1/embeddings → Text embedding vectors
|
|
```
|
|
|
|
`llama-server` conforms to [OpenAI's API specification](https://platform.openai.com/docs/api-reference). This means OpenAI's official client libraries (such as the Python `openai` package) work out of the box.
|
|
|
|
```python
|
|
# Example of connecting to llama-server with the OpenAI client
|
|
from openai import OpenAI
|
|
client = OpenAI(base_url="http://localhost:8080/v1", api_key="dummy")
|
|
|
|
response = client.chat.completions.create(
|
|
model="local-model",
|
|
messages=[{"role": "user", "content": "Hello!"}]
|
|
)
|
|
```
|
|
|
|
Compatibility with existing tools and libraries is a big design decision. We designed a simple translation-specific API, but if you're building a general-purpose server, OpenAI compatibility has become the de facto standard.
|
|
|
|
## 7.3 Concurrent Request Handling
|
|
|
|
Our server processes requests one at a time. If another request arrives while a translation is in progress, it waits until the previous inference finishes. This is fine for a desktop app used by one person, but it becomes a problem for a server shared by multiple users.
|
|
|
|
`llama-server` handles concurrent requests through a mechanism called **slots**.
|
|
|
|

|
|
|
|
The key point is that tokens from each slot are not inferred **one by one in sequence**, but rather **all at once in a single batch**. GPUs excel at parallel processing, so processing two users simultaneously takes almost the same time as processing one. This is called "continuous batching."
|
|
|
|
In our server, cpp-httplib's thread pool assigns one thread per request, but the inference itself runs single-threaded inside `llm.chat()`. `llama-server` consolidates this inference step into a shared batch processing loop.
|
|
|
|
## 7.4 Differences in SSE Format
|
|
|
|
The streaming mechanism itself is the same (`set_chunked_content_provider` + SSE), but the data format differs.
|
|
|
|
**Our format:**
|
|
|
|
```text
|
|
data: "去年の"
|
|
data: "春に"
|
|
data: [DONE]
|
|
```
|
|
|
|
**llama-server (OpenAI-compatible):**
|
|
|
|
```text
|
|
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"delta":{"content":"去年の"}}]}
|
|
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"delta":{"content":"春に"}}]}
|
|
data: [DONE]
|
|
```
|
|
|
|
Our format simply sends the tokens. Because `llama-server` follows the OpenAI specification, even a single token comes wrapped in JSON. It may look verbose, but it includes useful information for clients, like an `id` to identify the request and a `finish_reason` to indicate why generation stopped.
|
|
|
|
## 7.5 KV Cache Reuse
|
|
|
|
In our server, we process the entire prompt from scratch on every request. Our translation app's prompt is short ("Translate the following text to ja..." + input text), so this isn't a problem.
|
|
|
|
`llama-server` reuses the KV cache for the prefix portion when a request shares a common prompt prefix with a previous request.
|
|
|
|

|
|
|
|
For chatbots that send a long system prompt and few-shot examples with every request, this alone dramatically reduces response time. The difference is night and day: processing several thousand tokens of system prompt every time versus reading them from cache in an instant.
|
|
|
|
For our translation app, where the system prompt is just a single sentence, the benefit is limited. However, it's an optimization worth keeping in mind when applying this to your own applications.
|
|
|
|
## 7.6 Structured Output
|
|
|
|
Since our translation API returns plain text, there was no need to constrain the output format. But what if you want the LLM to respond in JSON?
|
|
|
|
```text
|
|
Prompt: Analyze the sentiment of the following text and return it as JSON.
|
|
LLM output (expected): {"sentiment": "positive", "score": 0.8}
|
|
LLM output (reality): Here are the results of the sentiment analysis. {"sentiment": ...
|
|
```
|
|
|
|
LLMs sometimes ignore instructions and add extraneous text. `llama-server` solves this problem with **grammar constraints**.
|
|
|
|
```bash
|
|
curl http://localhost:8080/v1/chat/completions \
|
|
-d '{
|
|
"messages": [{"role": "user", "content": "Analyze sentiment..."}],
|
|
"json_schema": {
|
|
"type": "object",
|
|
"properties": {
|
|
"sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
|
|
"score": {"type": "number"}
|
|
},
|
|
"required": ["sentiment", "score"]
|
|
}
|
|
}'
|
|
```
|
|
|
|
When you specify `json_schema`, tokens that don't conform to the grammar are excluded during token generation. This guarantees that the output is always valid JSON, so there's no need to worry about `json::parse` failing.
|
|
|
|
When embedding LLMs into applications, whether you can reliably parse the output directly impacts reliability. Grammar constraints are unnecessary for free-text output like translation, but they're essential for use cases where you need to return structured data as an API response.
|
|
|
|
## 7.7 Summary
|
|
|
|
Let's organize the differences we've covered.
|
|
|
|
| Aspect | Our Server | llama-server |
|
|
|------|-------------|--------------|
|
|
| API design | Translation-specific | OpenAI-compatible |
|
|
| Concurrent requests | Sequential processing | Slots + continuous batching |
|
|
| SSE format | Tokens only | OpenAI-compatible JSON |
|
|
| KV cache | Cleared each time | Prefix reuse |
|
|
| Structured output | None | JSON Schema / grammar constraints |
|
|
| Code size | ~200 lines | Several thousand lines |
|
|
|
|
Our code is simple because of the assumption that "one person uses it as a desktop app." If you're building a server for multiple users or one that integrates with the existing ecosystem, `llama-server`'s design serves as a valuable reference.
|
|
|
|
Conversely, even 200 lines of code is enough to make a fully functional translation app. I hope this code reading exercise has also conveyed the value of "building only what you need."
|
|
|
|
## Next Chapter
|
|
|
|
In the next chapter, we'll cover the key points for swapping in your own library and customizing the app to make it truly yours.
|
|
|
|
**Next:** [Making It Your Own](../ch08-customization)
|