ochafik101 days ago (edited 92 days ago)👍 4❤ 2👀 1

Fixes tool call support of DeepSeek-R1-Distill-Qwen-7B & 32B (follow up to #9639)
Adds --reasoning-format FORMAT flag that populates message.reasoning_content in the response using native <think> tags for DeepSeek R1 and <|START_THINKING|> for Command R7B if the format is deepseek (default), otherwise leaves thinking traces as they are in message.content if format is none
Updates Command R7B parser to handle more cases / folds the tool_plan field added temporarily #11585 into the new reasoning_content

(non-streaming api only for now).

Usage

Get and build this PR's branch

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git remote add ochafik https://github.com/ochafik/llama.cpp
git fetch ochafik
git checkout ochafik/r1-toolcall
cmake -B build -DLLAMA_CURL=1
cmake --build build -t llama-server --parallel --config Release
alias llama-server=./build/bin/llama-server

Run with (add --verbose to inspect prompt / grammars used):

# For DeepSeek only, optional chat template override (please report results w/ and w/o)
llama-server --jinja --think -hf bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF:Q6_K_L \
  --chat-template-file models/templates/llama-cpp-deepseek-r1.jinja
llama-server --jinja --think -hf bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF:Q6_K_L \
  --chat-template-file models/templates/llama-cpp-deepseek-r1.jinja

# Command R7B also has native think tokens:
llama-server --jinja -fa --think -hf bartowski/c4ai-command-r7b-12-2024-GGUF:Q6_K_L \
  --chat-template-file <( python scripts/get_chat_template.py CohereForAI/c4ai-command-r7b-12-2024 tool_use )

# Try adding  ‘--reasoning-format none’ To disable thoughts extraction

Call the API and profit

curl http://localhost:8080/v1/chat/completions -d '{
  "messages": [
    {
      "role": "system",
      "content": "You are a tool calling agent."
    },
    {
      "role": "user",
      "content": "scrape https://ochafik.com and tell me the number of times it says Olivier"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "python",
        "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
        "parameters": {
          "type": "object",
          "properties": {
            "code": {
              "type": "string",
              "description": "The code to run in the ipython interpreter."
            }
          },
          "required": [
            "code"
          ]
        }
      }
    }
  ],
  "temperature": 0.0,
  "top_k": 1,
  "top_p": 1.0
}' | jq

Show result w/ `DeepSeek-R1-Distill-Qwen-32B-GGUF:Q6_K_L`

{
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "Okay, so the user wants me to scrape the website https://ochafik.com and find out how many times the name \"Olivier\" appears. Hmm, I need to figure out how to approach this. \n\nFirst, I remember that web scraping can be done using Python libraries like requests and BeautifulSoup. Requests can fetch the webpage content, and BeautifulSoup can parse the HTML to extract text. \n\nI should start by writing a Python script that sends a GET request to the URL. Then, I'll check if the request was successful. If it is, I'll use BeautifulSoup to parse the content. \n\nNext, I'll extract all the text from the parsed HTML. Once I have the text, I can split it into words and count how many times \"Olivier\" appears. But wait, I should make sure the case matches. Maybe the name is in different cases, but the user specifically asked for \"Olivier,\" so I'll keep it case-sensitive.\n\nI also need to handle any potential errors, like if the website doesn't respond or if there's an issue parsing the HTML. But since I'm using a tool that runs the code, I can let it handle those exceptions and just report the result.\n\nPutting it all together, the code will fetch the webpage, parse it, extract the text, split it into words, and count the occurrences of \"Olivier.\" Then, it'll print the result. I think that should do it.",
        "content": null,
        "tool_calls": [
          {
            "type": "function",
            "function": {
              "name": "python",
              "arguments": "{\"code\":\"import requests\\nfrom bs4 import BeautifulSoup\\n\\nurl = 'https://ochafik.com'\\nresponse = requests.get(url)\\n\\nif response.status_code == 200:\\n    soup = BeautifulSoup(response.text, 'html.parser')\\n    text = soup.get_text()\\n    words = text.split()\\n    count = words.count('Olivier')\\n    print(f'The name \\\"Olivier\\\" appears {count} times.')\\nelse:\\n    print('Failed to retrieve the webpage.')\"}"
            },
            "id": ""
          }
        ]
      }
    }
  ]
}

Which is this code:

import requests
from bs4 import BeautifulSoup

url = 'https://ochafik.com'
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    text = soup.get_text()
    words = text.split()
    count = words.count('Olivier')
    print(f'The name "Olivier" appears {count} times.')
else:
    print('Failed to retrieve the webpage.')

Not too bad, but it didn't do lower-case and word split is a bit poor.

Trying again w/ the following extra args to make the sampling greedy:

curl ... {
  ...
  "temperature": 0.0,
  "top_k": 1,
  "top_p": 1.0
}

We have a winner:

import requests
from bs4 import BeautifulSoup

url = 'https://ochafik.com'
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    text = soup.get_text()
    count = text.lower().count('olivier')
    print(f'The name "Olivier" appears {count} times.')
else:
    print('Failed to retrieve the webpage.')

And the thoughts:

Okay, so the user wants me to scrape the website https://ochafik.com and find out how many times it mentions "Olivier." Hmm, I need to figure out how to do that. Since I can use the Python function tool, I can write a script that does this.

First, I should think about how to scrape a website. I remember that using libraries like requests and BeautifulSoup can help. Requests can fetch the webpage content, and BeautifulSoup can parse the HTML to extract text.

So, the steps would be: send a GET request to the URL, check if the request was successful, parse the HTML content, extract all the text, and then count the occurrences of the word "Olivier."

Wait, but sometimes the word might be part of a larger word, like "Olivier's" or "Oliviering." The user probably wants exact matches, so I should make sure to count only whole words. Maybe using a regular expression with word boundaries would be better.

Also, I should handle any potential errors, like if the website doesn't respond or if there's an issue with parsing. But since the user didn't specify handling errors, maybe I can just proceed with the basic script.

Putting it all together, the Python code would import requests and BeautifulSoup, send the request, parse the content, extract the text, and then count the occurrences. I'll write the code accordingly and use the Python tool to execute it.

Implementation notes

Had to work around DeepSeek R1's official jinja template:
- It doesn't describe the available tools, and the backfill done by Minja wasn't phrased well enough (for the 7B model), so I've added autogenerated tool call examples to minja's revamped "polyfill" behaviour (using a delta template eval).
  #11641
- It ignores message.tool_calls if message.content is not null, updated / testing the server output accordingly (better oai compliance)
- After a tool result, it leaves the prompt hanging on a <｜tool▁output▁end｜> or <｜tool▁call▁end｜> (need to close the list of outputs / calls w/ plural <｜tool▁outputs▁end｜> / <｜tool▁calls▁end｜>, respectively, and then missing end of sentence + optional add_generation_prompt)
  - Hacked a workaround so the default template now works well with this branch
  - Added / documented better template (models/templates/llama-cpp-deepseek-r1.jinja)
DeepSeek R1 Distill Qwen 8B & 32B models seem to take liberties with their tool call start tag, so accepting variations of the syntax (which then triggers the lazy grammar / full compliance). I'd avoid the 8B (or its lower quants) if possible, it sometimes tries to skip the tools output tag.
NEW --reasoning-format flag, which controls output of reasoning_content in the API (see test_thoughts)
- Sourced from native <think>...</think> tags for DeepSeek R1 and <|START_THINKING|>...<|END_THINKING|> for Command R7B.
  - Retired tool_plan field / now flowing into reasoning_content (was added in #11585)
Updated tests:
- Added the Q4_K_M quant to some (but not all) slow server tests (had to tell it not to overthink, bit... ironic).
- Added slow tool result server test test_calc_result (checking models make some use of tool call results, which some struggle a bit with)
- More corner cases for parsing, esp. DeepSeek R1 & Command R7B

TODOs:

Test content is null for tool calls
Test changes in Minja repo (add backfill goldens, maybe start adding control over backfill behaviour?)
Test prompting w/ tool results in server tests
send --jinja --chat-template chatml support separately (#11616)
- send double BOS fix separately
send opportunistic fixes for Command R7B separately (#11608)

Possible follow ups

Document differences between stream & non-stream modes (thought & tool_plan not sent in stream)
look at the Llama distill more closely (see #11591)

Reintroduce forced thinking in generic handler under some --reasoning flag (+ explore @ngxson's idea to support a disabled value that biases thinking tags)

```typescript
// ResponseSchema is json_schema if set, otherwise string

type SchemaToolRequired     =  {thoughts: string} & ToolCallSchema
type Schema                 = ({thoughts: string} & ToolCallSchema) | {thoughts: string, response: ResponseSchema}

type ToolCallSchema         = SingleToolCallSchema | ParallelToolCallSchema
type SingleToolCallSchema   = {tool_call: ToolCall}
type ParallelToolCallSchema = {tool_calls: ToolCall[]} // If parallel_tool_calls is true

// Note: id only defined if parallel_tool_calls is true
type ToolCall =
      {name: 'tool_1', arguments: Parameters1Schema, id?: string} |
      {name: 'tool_2', arguments: Parameters2Schema, id?: string} |
      ...
```

minja: enhance backfill of templates w/o tools description (use examp…

d3b60b8a

pass vocab to common_chat_params_init

87de852b

DeepSeek R1: parse thoughts / return in separate field in API (non st…

130ca222

Avoid double bos w/ jinja

04d511b5

server/oai: ensure content is null when there are tool calls

28345877

update logs

c80cb309

rename tests

08716281

tool-call: allow `--jinja --chat-template chatml`

73d08d49

tool-call: fix command-r7b parsing when response is multiline

04be723b

tool-calls: add DeepSeek R1 Qwen 7B to server test_hello_world

ae9d5812

tell DS R1 not to overthink (weather test)

19bea4ec

github-actions added testing

github-actions added examples

github-actions added python

github-actions added server

add deepseek models to server tool call section in readme

5e6f2a21

tool-call: allow `--jinja --chat-template chatml`

1e9acd2d

Update test_tool_call.py

77ae97e7

minimize diffs

a76073cf

fix typo

cf83623a

fix double bos issue (drop bos/eos tokens from jinja template)

5d18d76b

fix bad merge

aa98e590

fix build / rm diff

2b3c4829

Merge branch 'jinja-chatml' into r1-toolcall

4cb0e1d8

add missing try catch around jinja parsing to default to chatml

b2dd4909

Merge branch 'jinja-chatml' into r1-toolcall

08271b55

tool-calls: r1: add missing <｜tool▁calls▁end｜> to grammar!

df3474e2

tweak delta logic

c397bd1f

tool-calls: accommodate variety of wrong tool call opening tags both …

569610ee

Simplify default chatml logic

d73448de

Merge branch 'jinja-chatml' into r1-toolcall

0be7f652

tool-calls: add deepseek r1 template + accommodate broken official te…

7dc271fb

rm unneeded vocab

c6214ee9

simpler hacky fixes for original broken template (+ fix minja example…

1c302e18

sync: minja https://github.com/google/minja/pull/46

108da907

Merge branch 'master' into r1-toolcall

bc6d910f

actually we want eos_token in the template to infer tool call example…

11c1f0c7

update to minja's new api

30ea3591

sync: minja

bbd45bf6

simplify hack to fix original template's backfill from minja

bff549de

tool-call: r1: add one more trigger approx "<｜tool calls begin｜>"

ce28224d

r1: fix inadvertent newline in grammar before <｜tool▁call▁end｜>

e84ee88f

tool-call: r1: fix grammar

18a11f43

move trigger_words init inside non-llguidance branch

9a6847c8

fix / test parsing of r1 parser

a682d121

Fix / test models/templates/llama-cpp-deepseek-r1.jinja

f0154a64

update test_calc_result

326e7002

fix test_calc_result

78b47bb0

fix spaces

86994db6

`sync`: minja

09caa634

Update test-chat.cpp

b1527292

fix mistral chat test: need empty tokens

56a14ddc

Update chat.cpp

f12e3507

Merge branch 'sync-minja-4' into r1-toolcall

d43e4f6c

server: check that content is null when we get tool_calls

812544ab

tool-call: ensure we don't return content when there are tool calls /…

d44eb95c

fix mistral expectation

b6e14a41

ensure deepseek r1 thoughts parsed even w/o tool calls

1f5ec598

fix test-chat

438ce0b8

ochafik changed the title ~~`tool-call`: fix DeepSeek R1 Qwen distill (WIP)~~ `tool-call`: fix DeepSeek R1 Qwen distill 100 days ago

ochafik marked this pull request as ready for review 100 days ago

ochafik requested a review from

ngxson 100 days ago

ochafik changed the title ~~`tool-call`: fix DeepSeek R1 Qwen distill~~ `tool-call`: fix DeepSeek R1 Qwen distills 100 days ago

Update chat.cpp

21f20715

Merge branch 'sync-minja-4' into r1-toolcall

b5b117fa

Fix r1 grammar since we made <｜tool▁calls▁begin｜> optional (triggerin…

0db98812

r1: revert making <｜tool▁calls▁begin｜> optional as somehow sampling t…

d1b66910

ochafik requested a review from

ggerganov 99 days ago

ochafik changed the title ~~`tool-call`: fix DeepSeek R1 Qwen distills~~ `tool-call`: fix DeepSeek R1 Qwen distills + return thoughts in separate field 99 days ago

ochafik changed the title ~~`tool-call`: fix DeepSeek R1 Qwen distills + return thoughts in separate field~~ `tool-call`: fix DeepSeek R1 Qwen distills tool-call + return thoughts in separate field 99 days ago

Mushoz99 days ago👍 1❤ 1

Looking at the deepseek documentation: https://api-docs.deepseek.com/guides/reasoning_model

We have the reasoning_content and content as the two output fields. Might be better to follow the same naming to maintain compatibility with tools that parse this output. Instead of the currents thoughts field you are using.

return thoughts in reasoning_content field

39c1d816

ochafik99 days ago👍 1

Looking at the deepseek documentation: https://api-docs.deepseek.com/guides/reasoning_model

We have the reasoning_content and content as the two output fields. Might be better to follow the same naming to maintain compatibility with tools that parse this output. Instead of the currents thoughts field you are using.

Aaaabsolutely, thanks, done! (somehow failed to find this bit in the doc, and also I don't have an API key yet 🤦‍♂️)

ochafik changed the title ~~`tool-call`: fix DeepSeek R1 Qwen distills tool-call + return thoughts in separate field~~ `tool-call`: fix DeepSeek R1 Qwen distills tool-call + return reasoning_content in API 99 days ago

update readme section about common model tool call formats

b2d17287

Merge branch 'master' into r1-toolcall

933f7a18

Update test_tool_call.py

5d60cebb

Mushoz99 days ago

I noticed that these r1 distill models also suck with tool calling in Cline. Will this help with that? Or will that need a cline specific template?

ngxson99 days ago (edited 99 days ago)👍 1

@ochafik This is an example response from the official deepseek API (steam + non-stream): https://gist.github.com/ngxson/89a568d22e02b9a93f845abdfd8427a6

ngxson99 days ago (edited 99 days ago)👍 2

Btw, I think it could be better to only enable this feature via a flag. For example, only return separated reasoning_content if the request response contains "reasoning_format": "deepseek":

{
  "messages": [
    {"role": "user", "content": "hello"}
  ],
  "reasoning_format": "deepseek",
  "stream": false,
  "temperature": 0.2
}

This is because this feature is a non-OAI-compat, so doing this just in case OAI implement it differently in the future (which btw, they will definitely does, due to some childish politic conflicts)

Edit: and/or, we can also rely on "model": "deepseek-reasoner", which user will definitely set if they're using deepseek API, so we just provide a drop-in replacement for deepseek API here.

MoonRide30399 days ago (edited 99 days ago)

Btw, I think it could be better to only enable this feature via a flag. For example, only return separated reasoning_content if the request response contains "reasoning_format": "deepseek":
{
  "messages": [
    {"role": "user", "content": "hello"}
  ],
  "reasoning_format": "deepseek",
  "stream": false,
  "temperature": 0.2
}
This is because this feature is a non-OAI-compat, so doing this just in case OAI implement it differently in the future (which btw, they will definitely does, due to some childish politic conflicts)

Edit: and/or, we can also rely on "model": "deepseek-reasoner", which user will definitely set if they're using deepseek API, so we just provide a drop-in replacement for deepseek API here.

It's also available as "deepseek/deepseek-r1" using standard openai client on OpenRouter - it just returns the content (no thinking part) without having to specify any extra parameters. Distilled versions like "deepseek/deepseek-r1-distill-llama-70b" work the same way. Local gguf files might be named like "DeepSeek-R1-Distill-Qwen-32B-IQ3_XS.gguf" or "FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview-IQ3_XS.gguf", so simplest and most universal solution for now (compatible with how people use those models via API) could be simply checking if lowercase([model_name]) contains "deepseek-reasoner", "deepseek-r1" or "deepseekr1", and then activating content splitting.

Being able to enforce this format manually (so it would work with models working alike R1, but using completely different names) sounds like a good idea, too (and later on other popular thinking models could added to auto-detection list - it might be stuff like Open-R1, etc.).

Merge branch 'master' into r1-toolcall

1f1f06aa

--think to force any model to return reasoning_content (or just parse…

9d7c3cc5

Merge branch 'r1-toolcall' of github.com:ochafik/llama.cpp into r1-to…

d20c2ce4

ochafik changed the title ~~`tool-call`: fix DeepSeek R1 Qwen distills tool-call + return reasoning_content in API~~ `tool-call`: fix DeepSeek R1 Qwen distills tool-call, --think to return reasoning_content w/ any model 98 days ago

fix test_thoughts

f3e9f8b6

ochafik98 days ago (edited 98 days ago)👍 2

Btw, I think it could be better to only enable this feature via a flag

So... I've fenced the parsing of DeepSeek R1's <think> tags behind new experimental --think flag.

Then I thought... what about other models? Don't they deserve to think too? (seems weird that a flag would only affect a single model). So I added forced thoughts output to the generic chat handler when that flag is set (similar to what I was doing in #6389 ).

Marked as experimental / will probably need customisability (w/ templates??), but turns any model into a thinking model.

WDYT?

ochafik changed the title ~~`tool-call`: fix DeepSeek R1 Qwen distills tool-call, --think to return reasoning_content w/ any model~~ `tool-call`: fix DeepSeek R1 Qwen distills, add `--think` to return reasoning_content w/ any model 98 days ago

fix compiler warning about parens

3841a163

align Command R7B w/ --think / reasoning_content behaviour

e6d9b524

ochafik98 days ago (edited 98 days ago)

Also aligned Command 7RB on this new thinking model (removed the tool_plan field added temporarily in #11585 )

Updated instructions, tests & README, --jinja --think (and possibly the right template override) is all one needs to get universal tool calls and thinking.

Update README.md

39b50c37

fix --think arg env

0917e0a8

disable some failing chatml tests

098629df

ochafik changed the title ~~`tool-call`: fix DeepSeek R1 Qwen distills, add `--think` to return reasoning_content w/ any model~~ `tool-call`: fix DeepSeek R1 Qwen distills, add `--think` to return reasoning_content w/ any model (native for Command 7RB & DeepSeek R1) 98 days ago

Update README.md

33efcb3c

use existing string_strip

994301da

revert tool example backfill change - command 7rb just needs the righ…

d1a06407

ochafik changed the title ~~`tool-call`: fix DeepSeek R1 Qwen distills, add `--think` to return reasoning_content w/ any model (native for Command 7RB & DeepSeek R1)~~ `server`: fix tool-call of DeepSeek R1 Qwen, add `--think` to return reasoning_content w/ any model (native for Command 7RB & DeepSeek R1) 98 days ago

ngxson commented on 2025-02-06

common/arg.cpp

	1983		[](common_params & params) {
	1984		params.think = true;
	1985		}
	1986		).set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_MAIN}).set_env("LLAMA_ARG_THINK"));

ngxson97 days ago (edited 97 days ago)👍 1

IMO the --think flag is not very intuitive, it should be something like --reasoning-format, but personally I still prefer to do it per-request basis.

Also, for future-proof, we should make this flag accepts a param. For example, --reasoning-format deepseek will return it as reasoning_content. Again, this is because we are pretty sure that openai will break the whole thing in the near future.

HanClinto97 days ago

From a usability standpoint, --think feels a bit more intuitive / memorable to me about what it does.

Other alternatives might be --separate-thinking or --extract-reasoning or --format-reasoning or...?

ngxson97 days ago👍 1

I mean, what non-intuitive about it is that the model will still think even with --think not being added. This flag is supposed to force the model to start the reasoning process, but not to enable/disable it completely

ochafik97 days ago (edited 97 days ago)

Tbh I'm not really a fan of a query parameter yet, mostly because:

I see thinking mode as a static property of the model and its orchestration (much like the template)
It's a new non-standard API to define (more naming headaches)
It makes llama-server less compatible w/ DeepSeek's API, not more.
We can always decide to add it later if we start w/ a flag

Re/ flags, I think there may be room for two: one that controls the reasoning behaviour (extract, leave inline, force), and one for the return format. For the first one, how about:

--reasoning=extract → Parses DeepSeek R1 & Command R7B thoughts (default)
--reasoning=inline → Leaves thoughts inline in the content (format is model-specific)
--reasoning=force → Coerces non-thinking models to think (edit: maybe force-experimental for now)

As for the format, there are already two APIs out there:

DeepSeek's reasoning_content
Cohere's message.tool_plan, alas only as a prelude to tool calls (but it seems easy to trick R7B into thinking more generally, just needs a tweaked template)

I'd favour just sticking to reasoning_content for now until OpenAI announces their own format (and when they do, default to OpenAI's format and offer --reasoning-format=deepseek for backwards compatibility). OR decide to create our own format now / default --reasoning-format=llama.cpp that returns thoughts in the message.thoughts field, for instance.

WDYT?

ngxson96 days ago👍 1

Tbh I'm not really a fan of a query parameter yet

Seems ok for me, but eventually I think someone will gonna add this as a per-request param.

Re. having 2 flags for behavior / format, this seems more reasonable for me. From a functional programming perspective, it can be expressed as response = format(behavior(original_generated_content))

But I think your idea is still mixed between these 2 layers.

For--reasoning extract|inline|force:

I'm not sure what force is supposed to do, but seems like it needs some prompt engineering so I think you should consider that, maintaining prompts can be a burden
For inline does that means reasoning can appear middle of generation? Example, content..think..content..think. Please lmk if I understand correctly.

For --reasoning-format I don't get why we want to invent a new --reasoning-format llama.cpp that put things inside message.thoughts. IMO we should keep thing simple until openai drop their format. Probably we can have --reasoning-format deepseek|none and set deepseek as the default for now, then change the default to oai once we have that openai format

ngxson96 days ago (edited 96 days ago)👍 1

But I think your idea is still mixed between these 2 layers.

IMO if we want --reasoning to control the behavior, then it should affect the generation phrase (for example, control grammar/logits bias). So, it should 3 values:

enabled: the model behaves as usual
disabled: we never allow model to use <think> token ==> control via logits bias
force: force the model to think, maybe use grammar or prompt engineering?

Then for the --reasoning-format, it should only "transform" the result into the desired format (a.k.a a pure function), we can have 3 values:

deepseek: put content inside reasoning_content
none: do not format, simply forward all the generated tokens to user
and then oai can be added in the future

ochafik95 days ago

IMO if we want --reasoning to control the behavior, then it should affect the generation phrase (for example, control grammar/logits bias). So, it should 3 values:

enabled: the model behaves as usual

disabled: we never allow model to use <think> token ==> control via logits bias

force: force the model to think, maybe use grammar or prompt engineering?

@ngxson Makes sense, removed the forced thinking from this PR / will explore again as follow up (also, see more of a case of this option as a query param, while reasoning-format now has stronger flag vibes)

I'm not sure what force is supposed to do, but seems like it needs some prompt engineering so I think you should consider that, maintaining prompts can be a burden

Good point. In earlier experimentations I tried controlling the entire tool call process (even grammar generation) from Jinja templates, might play with this again.

Then for the --reasoning-format, it should only "transform" the result into the desired format (a.k.a a pure function), we can have 3 values:
deepseek: put content inside reasoning_content
none: do not format, simply forward all the generated tokens to user
and then oai can be added in the future

Updated the code (defaulting to deepseek), thanks!

ochafik92 days ago (edited 92 days ago)

Note that I’ve updated the code to the latest DeepSeek template changes (they added a trailing <think> 😅; updated minja accordingly: #11774 (comment) )

ochafik91 days ago

Hey @ngxson, any more concerns / suggestions about this PR?

ngxson91 days ago👍 1

Sorry I'm a bit busy recently, will do a review later today or tomorrow

ggerganov removed review request from

ggerganov 97 days ago

ggerganov97 days ago (edited 97 days ago)👍 2

Tag me for review after @ngxson approves. I will do a quick pass after that, as I am not very familiar with the tool calling functionality yet and can't provide much meaningful feedback.

Sherlock-Holo96 days ago

just curious

why non-streaming api only for now? how much additional work need to do to support streaming api?

ochafik96 days ago👍 1

just curious

why non-streaming api only for now? how much additional work need to do to support streaming api?

@Sherlock-Holo That’s next on my list, wanted to get non-streamed logic working well first, then will need to revamp the parsers to accept eg unclosed json list of “parallel” tool calls and stream them back one by one (bit of delta book keeping to do, tool call deltas give updates to the arguments for the current tool call, then move to the next, etc). Medium amount of work but probably gnarly haha.

WangxuP96 days ago👍 1

In streaming mode, the output data does not separate 'content' and 'reasoning content'
like this:

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"<think>"}}],"created":1739000016,"id":"chatcmpl-QPiD7T4WVir86Qga3YHuhmJ0DO7hNQHK","model":"DeepSeek-R1-UD-IQ1_M","system_fingerprint":"b0-unknown","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"\n"}}],"created":1739000017,"id":"chatcmpl-QPiD7T4WVir86Qga3YHuhmJ0DO7hNQHK","model":"DeepSeek-R1-UD-IQ1_M","system_fingerprint":"b0-unknown","object":"chat.completion.chunk"}

Merge remote-tracking branch 'origin/master' into r1-toolcall

cc2c712c

Use --reasoning-format, remove forced thinking for now

c0f972bb

return reasoning_content before content

af638860

update model template / format mapping

a59fde29

ochafik changed the title ~~`server`: fix tool-call of DeepSeek R1 Qwen, add `--think` to return reasoning_content w/ any model (native for Command 7RB & DeepSeek R1)~~ `server`: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless `--reasoning-format none` 95 days ago

fix test-chat

b829cab7

rm thoughts from generic parser

95cddfd8

sync: minja (https://github.com/google/minja/pull/52)

e598e7aa

tool-calls: allow r1 output to miss <think> opening tag (since latest…

91542ca2

sync: minja (https://github.com/ggerganov/llama.cpp/pull/11774)

8d82be90

rm wrong warning in command-r parser (when normal text)

30dcfaa5

update deepseek r1 templates (+ put update commands in ./scripts/get_…

e1bff8f6

fix server test_tool_calls.py

a29dc921

add models/templates/README.md

ea2f41e0

github-actions added script

fix test_calc_result & test_thoughts

8409bf18

fix test-chat (update delta to latest r1 template change)

01db4291

Merge remote-tracking branch 'origin/master' into r1-toolcall

37a4bb25

ochafik91 days ago👍 1

In streaming mode, the output data does not separate 'content' and 'reasoning content' like this:

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"<think>"}}],"created":1739000016,"id":"chatcmpl-QPiD7T4WVir86Qga3YHuhmJ0DO7hNQHK","model":"DeepSeek-R1-UD-IQ1_M","system_fingerprint":"b0-unknown","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"\n"}}],"created":1739000017,"id":"chatcmpl-QPiD7T4WVir86Qga3YHuhmJ0DO7hNQHK","model":"DeepSeek-R1-UD-IQ1_M","system_fingerprint":"b0-unknown","object":"chat.completion.chunk"}

@WangxuP Based on my (limited) understanding of the delta format used by OpenAI (incl. for tool calls), the "correct" way to stream thoughts back would be to hold off on anything that might be an opening <think> tag, then send it as a reasoning_content delta. Hope we see how OpenAI stream their own thoughts in a near future (I have a few more things to crunch on before implementing streaming anyway).

ngxson approved these changes on 2025-02-12

examples/server/README.md

1136	1137
1137	1138	\| Template \| Format \|
1138	1139	\|----------\|--------\|
1139		\| CohereForAI-c4ai-command-r-plus-default.jinja \| generic tool calls \|
1140		\| CohereForAI-c4ai-command-r-plus-rag.jinja \| generic tool calls \|
1141		\| CohereForAI-c4ai-command-r-plus-tool_use.jinja \| generic tool calls \|
1142		\| MiniMaxAI-MiniMax-Text-01.jinja \| generic tool calls \|
1143		\| NexaAIDev-Octopus-v2.jinja \| generic tool calls \|
1144		\| NousResearch-Hermes-2-Pro-Llama-3-8B-default.jinja \| generic tool calls \|
1145		\| NousResearch-Hermes-2-Pro-Llama-3-8B-tool_use.jinja \| hermes 2 pro tool calls \|
1146		\| NousResearch-Hermes-2-Pro-Mistral-7B-default.jinja \| generic tool calls \|
1147		\| NousResearch-Hermes-2-Pro-Mistral-7B-tool_use.jinja \| hermes 2 pro tool calls \|
1148		\| NousResearch-Hermes-3-Llama-3.1-70B-default.jinja \| generic tool calls \|
1149		\| NousResearch-Hermes-3-Llama-3.1-70B-tool_use.jinja \| hermes 2 pro tool calls \|
1150		\| OrionStarAI-Orion-14B-Chat.jinja \| generic tool calls \|
1151		\| Qwen-QwQ-32B-Preview.jinja \| hermes 2 pro tool calls \|
1152		\| Qwen-Qwen2-7B-Instruct.jinja \| generic tool calls \|
1153		\| Qwen-Qwen2-VL-7B-Instruct.jinja \| generic tool calls \|
1154		\| Qwen-Qwen2.5-7B-Instruct.jinja \| hermes 2 pro tool calls \|
1155		\| Qwen-Qwen2.5-Math-7B-Instruct.jinja \| hermes 2 pro tool calls \|
1156		\| TheBloke-FusionNet_34Bx2_MoE-AWQ.jinja \| generic tool calls \|
1157		\| abacusai-Fewshot-Metamath-OrcaVicuna-Mistral.jinja \| generic tool calls \|
1158		\| bofenghuang-vigogne-2-70b-chat.jinja \| generic tool calls \|
1159		\| databricks-dbrx-instruct.jinja \| generic tool calls \|
1160		\| deepseek-ai-DeepSeek-Coder-V2-Instruct.jinja \| generic tool calls \|
1161		\| deepseek-ai-DeepSeek-R1-Distill-Llama-8B.jinja \| deepseek r1 tool calls \|
1162		\| deepseek-ai-DeepSeek-R1-Distill-Qwen-32B.jinja \| deepseek r1 tool calls \|
1163		\| deepseek-ai-DeepSeek-R1-Distill-Qwen-7B.jinja \| deepseek r1 tool calls \|
1164		\| deepseek-ai-DeepSeek-V2.5.jinja \| deepseek r1 tool calls \|
1165		\| deepseek-ai-deepseek-coder-33b-instruct.jinja \| generic tool calls \|
1166		\| google-gemma-2-2b-it.jinja \| generic tool calls \|
1167		\| google-gemma-7b-it.jinja \| generic tool calls \|
1168		\| indischepartij-MiniCPM-3B-OpenHermes-2.5-v2.jinja \| generic tool calls \|
1169		\| mattshumer-Reflection-Llama-3.1-70B.jinja \| generic tool calls \|
1170		\| meetkai-functionary-medium-v3.2.jinja \| functionary v3.2 tool calls \|
1171		\| meta-llama-Llama-3.1-8B-Instruct.jinja \| llama 3.x tool calls (w/ builtin tools) \|
1172		\| meta-llama-Llama-3.2-3B-Instruct.jinja \| llama 3.x tool calls \|
1173		\| meta-llama-Llama-3.3-70B-Instruct.jinja \| llama 3.x tool calls (w/ builtin tools) \|
1174		\| meta-llama-Meta-Llama-3.1-8B-Instruct.jinja \| llama 3.x tool calls (w/ builtin tools) \|
1175		\| microsoft-Phi-3-medium-4k-instruct.jinja \| generic tool calls \|
1176		\| microsoft-Phi-3-mini-4k-instruct.jinja \| generic tool calls \|
1177		\| microsoft-Phi-3-small-8k-instruct.jinja \| generic tool calls \|
1178		\| microsoft-Phi-3.5-mini-instruct.jinja \| generic tool calls \|
1179		\| microsoft-Phi-3.5-vision-instruct.jinja \| generic tool calls \|
1180		\| mistralai-Mistral-7B-Instruct-v0.2.jinja \| generic tool calls \|
1181		\| mistralai-Mistral-Large-Instruct-2407.jinja \| mistral nemo tool calls \|
1182		\| mistralai-Mistral-Large-Instruct-2411.jinja \| generic tool calls \|
1183		\| mistralai-Mistral-Nemo-Instruct-2407.jinja \| mistral nemo tool calls \|
1184		\| mistralai-Mixtral-8x7B-Instruct-v0.1.jinja \| generic tool calls \|
1185		\| mlabonne-AlphaMonarch-7B.jinja \| generic tool calls \|
1186		\| nvidia-Llama-3.1-Nemotron-70B-Instruct-HF.jinja \| llama 3.x tool calls (w/ builtin tools) \|
1187		\| openchat-openchat-3.5-0106.jinja \| generic tool calls \|
1188		\| teknium-OpenHermes-2.5-Mistral-7B.jinja \| generic tool calls \|
	1140	\| Almawave-Velvet-14B.jinja \| Hermes 2 Pro \|

ngxson91 days ago❤ 1

Just noting here (no need to take any actions right now), but this README file is now too long and hard to follow for new users. I'm planning to break this into small files (like what we did with docs directory). Potentially we will end up with a main API docs, tool-calling docs and development docs.

ochafik91 days ago

Sounds great!! happy to help with this (if only reviewing)

common/chat.cpp

	596		// Distill Qwen 7B & 32B models seem confused re/ syntax of their tool call opening tag,
	597		// so we accept common variants (then it's all constrained)
	598		builder.add_rule("root",
	599		"( \"<｜tool▁calls▁begin｜>\" \| \"<｜tool_calls_begin｜>\" \| \"<｜tool calls begin｜>\" \| \"<｜tool\\\\_calls\\\\_begin｜>\" ) "

ngxson91 days ago

Small nits, if you're doing multiple string concatenations, it's better to use std::ostringstream to reduce the number of copy.

ochafik91 days ago

Fair point, for now I've been favouring readability but will keep this in mind when doing an optimization pass (depending on how much this all ends up costing, we might want to cache various bits and/or create a grammar DSL that would bypass the string stage altogether; JSON schema conversion has lots of room for optimization & I'd also like to take the llguidance stuff into account: exciting prospects!)

common/chat.cpp

	585		data.grammar = build_grammar([&](const common_grammar_builder & builder) {
	586		std::vector<std::string> tool_rules;
	587		foreach_function(inputs.tools, [&](const json & tool) {
	588		const auto & function = tool["function"];

ngxson91 days ago❤ 1

Better to use .at() instead of operator[] when it's possible, as explained in https://github.com/nlohmann/json

In function from_json, use function at() to access the object values rather than operator[]. In case a key does not exist, at throws an exception that you can handle, whereas operator[] exhibits undefined behavior

ochafik91 days ago

Updated, thanks!

ngxson requested a review from

ggerganov 91 days ago

prefer json::at to operator[] in chat.cpp

d52579a9

Merge remote-tracking branch 'origin/master' into r1-toolcall

47002452

ggerganov approved these changes on 2025-02-13

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Apply suggestions from code review

043cb99f

ochafik merged c7f460ab into master 90 days ago

ngxson90 days ago (edited 90 days ago)❤ 2👀 2

That’s next on my list, wanted to get non-streamed logic working well first, then will need to revamp the parsers to accept eg unclosed json list of “parallel” tool calls and stream them back one by one (bit of delta book keeping to do, tool call deltas give updates to the arguments for the current tool call, then move to the next, etc). Medium amount of work but probably gnarly haha.

Yes I would say this will be a hard approach. Specially because each model has their own format, so we can't really rely on regex much in stream mode.

Indeed, I assume that most of the complication will be about moving away from regex and use some kind of "state machine" to keep track of the generated text. From this perspective, I'm wondering, is it worth inventing our own implementation of regex? Regex is just a state machine under the hood, so by doing this we can fully manage the state on our own.

Written as (pseudo-) code, my idea looks like:

struct chat_regex tool_regex;
tool_regex.add_literal("<tool_name>")
tool_regex.add_string()
tool_regex.add_literal("</tool_name><tool_data>")
tool_regex.add_json()
tool_regex.add_literal("</tool_data>")
tool_regex.end()

Which will be compiled into:

flowchart LR

F@{ shape: dbl-circ, label: "end" }

A --&lt;tool_name&gt;--> B
B --string--> C
C --&lt;/tool_name&gt;&lt;tool_data&gt;--> D
D --json--> E
E --&lt;/tool_data&gt;--> F

We create a "state" object each time we want to use it (i.e. store it into the server slot):

slot.chat_parser_state = chat_parser_state(tool_regex); // initial state A
slot.chat_parser_state << slot.generated_text; // with "generated_text" is the "delta" generated content
slot.chat_parser_state.get_matched(); // not sure yet what it should return

ochafik90 days ago

From this perspective, I'm wondering, is it worth inventing our own implementation of regex? Regex is just a state machine under the hood, so by doing this we can fully manage the state on our own.

@ngxson Yesss!! 🫡🫡🫡🫡🫡

So, my original dream was to write a recursive descent / backtracking parser based on the existing GBNF grammars, and use a crude naming convention to extract data out of rules:

*-tool-calls-N
then nested *-tool-call-N
then nested *-tool-call-name-N & *-tool-call-arguments-N

(the N is there to allow alternative tool call syntaxes to be extracted).

A bit adhoc and magic, but very limited modifications needed in the tool call code (just stick to a consistent naming, and delete all the regexp code) and a pretty simple parser to implement (can add some hooks to make it generic wrt/ the naming convention to extract any kind of data).

It would also make it trivial to support partial extractions / streaming by memorizing the parsing stack (& extracted variables) that consumed the longest text (when parsing fails).

(and +1 to keeping the state in the slot, although TBD whether that should be a parsing stack state - first stack that failed because of an EOF? - or just the JSON tree of the last response returned, doing a React-style full-DOM diff at every step; much slower but might be safer, to be investigated)

If we agree to explore this route, I might start by refactoring the grammar parsing code to output an easier intermediate grammar AST that can then be used directly by the recursive descent parser (and be trivially converted to the pointer-heavy sampling grammar structure).

ochafik90 days ago (edited 90 days ago)

Written as (pseudo-) code, my idea looks like:

struct chat_regex tool_regex;
tool_regex.add_literal("<tool_name>")
tool_regex.add_string()
tool_regex.add_literal("</tool_name><tool_data>")
tool_regex.add_json()
tool_regex.add_literal("</tool_data>")
tool_regex.end()

@ngxson we could also explore this kind of syntax to build a DSL to create the dual-use GBNF grammar (possibly also llguidance grammar)

cc/ @mmoskal @HanClinto

ngxson90 days ago (edited 90 days ago)

(and +1 to keeping the state in the slot, although TBD whether that should be a parsing stack state - first stack that failed because of an EOF? - or just the JSON tree of the last response returned, doing a React-style full-DOM diff at every step; much slower but might be safer, to be investigated)

I don't really understand the second part of your phrase about "first stack that failed because of an EOF", but IMO storing the parsing stack is fine. The React-style diff may sounds intuitive/safer, but I think nlohmann::json is not performant enough to do that efficiently. I'm even doubt that we may end up with a implementation slower than the javascript version used by react.

we could also explore this kind of syntax to build a DSL to create the dual-use GBNF grammar (possibly also llguidance grammar)

I have a quick look at all of the regex you're currently using in chat.cpp, but I think a DSL is not very needed at the moment because most of your regex can be expressed in a more intuitive way using my pseudo-code above. Furthermore, the maintenance cost may be high, given that we only gonna use it internally.

Most of your regex(es) use [\\s\\n\\r], [^something]+, ([\\s\\S\\r\\n]*?), which can be expressed as cpp functions like maybe_space(), match_until(...)

And to make it look even nicer, we can use cpp operator overloading, for example with operator->:

tool_regex -> literal("<tool_name>") -> string() -> literal("</tool_name>");

Another benefit of this approach is that some expressions can also be optimized during compile time.

Edit: on second thought, using -> could be a bad idea because it can be confused with pointer dereference. >> or << would be a better choice. Or maybe just chain call tool_regex.literal(...).string().literal(...) for simplicity

mmoskal90 days ago

@ochafik right! llguidance already does support streaming and emitting capture groups for subgrammars; it will even know to only emit "foo" when the tokens so far are "foo<", but then emit "<text" when "text" is sampled (and not "tool").

There is also some code in there to support general stop regexes using a (lazy) state machine.

Note that as people develop the tool calling more in models, they are likely to use special tokens for tool calling, JSON mode etc. Not sure gbnf handles that particularly well (that is the difference between "<|foo|>" and "<|" "foo" "|>").

ggerganov89 days ago👍 1

If we agree to explore this route, I might start by refactoring the grammar parsing code to output an easier intermediate grammar AST that can then be used directly by the recursive descent parser (and be trivially converted to the pointer-heavy sampling grammar structure).

Most of the grammar and tool functionalities are way over my head and I cannot provide a very meaningful feedback. But overall I think the llama-grammar module could use some deeper refactoring and maintenance. The main things I would lookout for is to keep the implementation simple, no extra dependencies and good performance. The general approach has been to prototype stuff in libcommon and when it becomes mature enough, to move it to libllama.

One more thought is that long-term we can also think about moving some core functionality about grammars to ggml. At some point I was considering it because I wanted to reuse grammar functionality in whisper.cpp. So it's something to think about, but very low-prio atm.

mmoskal88 days ago❤ 1👀 1

@ochafik here's how the lazy matching is handled in llguidance, see also docs

import llguidance
import huggingface_hub
import json

lark_grm = """

start: "<tool_name>" name "<tool_data>" data "</tool_data>"
name[capture, suffix="</tool_name>"]: /.*/
data[capture]: %json {
    "properties": {
        "foo": { "type": "string" }
    },
    "required": ["foo"]
}

"""


def main():
    tok_name = huggingface_hub.hf_hub_download(
        "microsoft/Phi-3.5-mini-instruct", "tokenizer.json"
    )
    with open(tok_name, "r") as f:
        text = f.read()
    tok = llguidance.LLTokenizer(text)

    interp = llguidance.LLInterpreter(
        tok,
        json.dumps({"grammars": [{"lark_grammar": lark_grm}]}),
        enable_ff_tokens=False,
        enable_backtrack=False,
        log_level=1,
    )
    interp.start_without_prompt()

    toks = tok.tokenize_str("<tool_name>foo<bar></tool_name><tool_data>{\"foo\": \"bar\"}</tool_data>")
    for t in toks:
        mask, r = interp.compute_mask()
        obj = json.loads(r)
        for p in obj["progress"]:
            if p["object"] != "text":
                print("\n  ", end="")
                print(p)
        # feeding token now
        print(tok.dbg_tokens([t]), end=" ")
        interp.commit_token(t)
    print("\n")

if __name__ == "__main__":
    main()

When you run it, you get:

⟦<⟧ ⟦tool⟧ ⟦_⟧ ⟦name⟧ ⟦>⟧ ⟦foo⟧ ⟦<⟧ ⟦bar⟧ ⟦></⟧ ⟦tool⟧ ⟦_⟧ ⟦name⟧ ⟦><⟧ 
  {'object': 'capture', 'name': 'name', 'str': 'foo<bar>', 'hex': '666f6f3c6261723e', 'log_prob': 0.0}
⟦tool⟧ ⟦_⟧ ⟦data⟧ ⟦>{⟧ ⟦"⟧ ⟦foo⟧ ⟦":⟧ ⟦ "⟧ ⟦bar⟧ ⟦"}⟧ 
  {'object': 'capture', 'name': 'data', 'str': '{"foo": "bar"}', 'hex': '7b22666f6f223a2022626172227d', 'log_prob': 0.0}
⟦</⟧ ⟦tool⟧ ⟦_⟧ ⟦data⟧ ⟦>⟧

The captures are generated immedietly after getting enough tokens.

If the model use special tokens, you need to write the grammar slightly differently:

start: <|assistant|> name <|end|> /\s*/ data
name[capture]: /.*/
data[capture]: %json {
    "properties": {
        "foo": { "type": "string" }
    },
    "required": ["foo"]
}

Note lack of suffix= on name - it will extend greedily, until it hits the <|end|> special token. Special tokens are never allowed by regular expressions.

	731		if (!msg.reasoning_content.empty()) {
	732		message["reasoning_content"] = msg.reasoning_content;
	733		}
	734		if (msg.content == "" && !msg.tool_calls.empty()) {

	982		throw std::runtime_error("Cannot specify grammar with tools");
	983		}
	984		if (caps.supports_tool_calls && !caps.supports_tools) {
	985		LOG_WRN("Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.");

	127
	128		if (!result.tool_calls.empty()) {
	129		if (!string_strip(result.content).empty()) {
	130		LOG_WRN("Content found with tool calls: %s", result.content.c_str());

llama.cpp
`server`: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless `--reasoning-format none`
#11607

Merged

`server`: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless `--reasoning-format none` #11607

Usage

Implementation notes

TODOs:

	if (msg.content == "" && !msg.tool_calls.empty()) {
	if (msg.content.empty() && !msg.tool_calls.empty()) {

llama.cpp `server`: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless `--reasoning-format none` #11607 Merged

`server`: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless `--reasoning-format none` #11607

Usage

Implementation notes

TODOs:

llama.cpp
`server`: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless `--reasoning-format none`
#11607

Merged