cpp-httplib/docs-src/pages/en/llm-app/ch07-code-reading.md

---
title: "7. Reading the llama.cpp Server Source Code"
order: 7

---

Over the course of six chapters, we built a translation desktop app from scratch. We have a working product, but it's ultimately a "learning-oriented" implementation. So how does "production-quality" code differ? Let's read the source code of `llama-server`, the official server bundled with llama.cpp, and compare.

`llama-server` is located at `llama.cpp/tools/server/`. It uses the same cpp-httplib, so you can read the code the same way as in the previous chapters.

## 7.1 Source Code Location

```ascii
llama.cpp/tools/server/
├── server.cpp           # Main server implementation
├── httplib.h            # cpp-httplib (bundled version)
└── ...
```

The code is contained in a single `server.cpp`. It runs to several thousand lines, but once you understand the structure, you can narrow down the parts worth reading.

## 7.2 OpenAI-Compatible API

The biggest difference between the server we built and `llama-server` is the API design.

**Our API:**

```text
POST /translate          → {"translation": "..."}
POST /translate/stream   → SSE: data: "token"
```

**llama-server's API:**

```text
POST /v1/chat/completions  → OpenAI-compatible JSON
POST /v1/completions       → OpenAI-compatible JSON
POST /v1/embeddings        → Text embedding vectors
```

`llama-server` conforms to [OpenAI's API specification](https://platform.openai.com/docs/api-reference). This means OpenAI's official client libraries (such as the Python `openai` package) work out of the box.

```python
# Example of connecting to llama-server with the OpenAI client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="dummy")

response = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "Hello!"}]
)
```

Compatibility with existing tools and libraries is a big design decision. We designed a simple translation-specific API, but if you're building a general-purpose server, OpenAI compatibility has become the de facto standard.

## 7.3 Concurrent Request Handling

Our server processes requests one at a time. If another request arrives while a translation is in progress, it waits until the previous inference finishes. This is fine for a desktop app used by one person, but it becomes a problem for a server shared by multiple users.

`llama-server` handles concurrent requests through a mechanism called **slots**.

![llama-server's slot management](../slots.svg#half)

The key point is that tokens from each slot are not inferred **one by one in sequence**, but rather **all at once in a single batch**. GPUs excel at parallel processing, so processing two users simultaneously takes almost the same time as processing one. This is called "continuous batching."

In our server, cpp-httplib's thread pool assigns one thread per request, but the inference itself runs single-threaded inside `llm.chat()`. `llama-server` consolidates this inference step into a shared batch processing loop.

## 7.4 Differences in SSE Format

The streaming mechanism itself is the same (`set_chunked_content_provider` + SSE), but the data format differs.

**Our format:**

```text
data: "去年の"
data: "春に"
data: [DONE]
```

**llama-server (OpenAI-compatible):**

```text
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"delta":{"content":"去年の"}}]}
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"delta":{"content":"春に"}}]}
data: [DONE]
```

Our format simply sends the tokens. Because `llama-server` follows the OpenAI specification, even a single token comes wrapped in JSON. It may look verbose, but it includes useful information for clients, like an `id` to identify the request and a `finish_reason` to indicate why generation stopped.

## 7.5 KV Cache Reuse

In our server, we process the entire prompt from scratch on every request. Our translation app's prompt is short ("Translate the following text to ja..." + input text), so this isn't a problem.

`llama-server` reuses the KV cache for the prefix portion when a request shares a common prompt prefix with a previous request.

![KV cache reuse](../kv-cache.svg#half)

For chatbots that send a long system prompt and few-shot examples with every request, this alone dramatically reduces response time. The difference is night and day: processing several thousand tokens of system prompt every time versus reading them from cache in an instant.

For our translation app, where the system prompt is just a single sentence, the benefit is limited. However, it's an optimization worth keeping in mind when applying this to your own applications.

## 7.6 Structured Output

Since our translation API returns plain text, there was no need to constrain the output format. But what if you want the LLM to respond in JSON?

```text
Prompt: Analyze the sentiment of the following text and return it as JSON.
LLM output (expected): {"sentiment": "positive", "score": 0.8}
LLM output (reality): Here are the results of the sentiment analysis. {"sentiment": ...
```

LLMs sometimes ignore instructions and add extraneous text. `llama-server` solves this problem with **grammar constraints**.

```bash
curl http://localhost:8080/v1/chat/completions \
  -d '{
    "messages": [{"role": "user", "content": "Analyze sentiment..."}],
    "json_schema": {
      "type": "object",
      "properties": {
        "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
        "score": {"type": "number"}
      },
      "required": ["sentiment", "score"]
    }
  }'
```

When you specify `json_schema`, tokens that don't conform to the grammar are excluded during token generation. This guarantees that the output is always valid JSON, so there's no need to worry about `json::parse` failing.

When embedding LLMs into applications, whether you can reliably parse the output directly impacts reliability. Grammar constraints are unnecessary for free-text output like translation, but they're essential for use cases where you need to return structured data as an API response.

## 7.7 Summary

Let's organize the differences we've covered.

| Aspect | Our Server | llama-server |
|------|-------------|--------------|
| API design | Translation-specific | OpenAI-compatible |
| Concurrent requests | Sequential processing | Slots + continuous batching |
| SSE format | Tokens only | OpenAI-compatible JSON |
| KV cache | Cleared each time | Prefix reuse |
| Structured output | None | JSON Schema / grammar constraints |
| Code size | ~200 lines | Several thousand lines |

Our code is simple because of the assumption that "one person uses it as a desktop app." If you're building a server for multiple users or one that integrates with the existing ecosystem, `llama-server`'s design serves as a valuable reference.

Conversely, even 200 lines of code is enough to make a fully functional translation app. I hope this code reading exercise has also conveyed the value of "building only what you need."

## Next Chapter

In the next chapter, we'll cover the key points for swapping in your own library and customizing the app to make it truly yours.

**Next:** [Making It Your Own](../ch08-customization)