llama.cpp
`server`: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless `--reasoning-format none`
#11607
Merged

`server`: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless `--reasoning-format none` #11607

ochafik merged 94 commits into ggml-org:master from ochafik:r1-toolcall
ochafik
ochafik101 days ago (edited 92 days ago)👍 4❤ 2👀 1
  • Fixes tool call support of DeepSeek-R1-Distill-Qwen-7B & 32B (follow up to #9639)
  • Adds --reasoning-format FORMAT flag that populates message.reasoning_content in the response using native <think> tags for DeepSeek R1 and <|START_THINKING|> for Command R7B if the format is deepseek (default), otherwise leaves thinking traces as they are in message.content if format is none
  • Updates Command R7B parser to handle more cases / folds the tool_plan field added temporarily #11585 into the new reasoning_content

(non-streaming api only for now).

Usage

  • Get and build this PR's branch
    git clone https://github.com/ggerganov/llama.cpp
    cd llama.cpp
    git remote add ochafik https://github.com/ochafik/llama.cpp
    git fetch ochafik
    git checkout ochafik/r1-toolcall
    cmake -B build -DLLAMA_CURL=1
    cmake --build build -t llama-server --parallel --config Release
    alias llama-server=./build/bin/llama-server
  • Run with (add --verbose to inspect prompt / grammars used):

    # For DeepSeek only, optional chat template override (please report results w/ and w/o)
    llama-server --jinja --think -hf bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF:Q6_K_L \
      --chat-template-file models/templates/llama-cpp-deepseek-r1.jinja
    llama-server --jinja --think -hf bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF:Q6_K_L \
      --chat-template-file models/templates/llama-cpp-deepseek-r1.jinja
    
    # Command R7B also has native think tokens:
    llama-server --jinja -fa --think -hf bartowski/c4ai-command-r7b-12-2024-GGUF:Q6_K_L \
      --chat-template-file <( python scripts/get_chat_template.py CohereForAI/c4ai-command-r7b-12-2024 tool_use )
    
    # Try adding  ‘--reasoning-format none’ To disable thoughts extraction
  • Call the API and profit

    curl http://localhost:8080/v1/chat/completions -d '{
      "messages": [
        {
          "role": "system",
          "content": "You are a tool calling agent."
        },
        {
          "role": "user",
          "content": "scrape https://ochafik.com and tell me the number of times it says Olivier"
        }
      ],
      "tools": [
        {
          "type": "function",
          "function": {
            "name": "python",
            "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
            "parameters": {
              "type": "object",
              "properties": {
                "code": {
                  "type": "string",
                  "description": "The code to run in the ipython interpreter."
                }
              },
              "required": [
                "code"
              ]
            }
          }
        }
      ],
      "temperature": 0.0,
      "top_k": 1,
      "top_p": 1.0
    }' | jq
  • Show result w/ `DeepSeek-R1-Distill-Qwen-32B-GGUF:Q6_K_L`
    {
      "choices": [
        {
          "finish_reason": "tool_calls",
          "index": 0,
          "message": {
            "role": "assistant",
            "reasoning_content": "Okay, so the user wants me to scrape the website https://ochafik.com and find out how many times the name \"Olivier\" appears. Hmm, I need to figure out how to approach this. \n\nFirst, I remember that web scraping can be done using Python libraries like requests and BeautifulSoup. Requests can fetch the webpage content, and BeautifulSoup can parse the HTML to extract text. \n\nI should start by writing a Python script that sends a GET request to the URL. Then, I'll check if the request was successful. If it is, I'll use BeautifulSoup to parse the content. \n\nNext, I'll extract all the text from the parsed HTML. Once I have the text, I can split it into words and count how many times \"Olivier\" appears. But wait, I should make sure the case matches. Maybe the name is in different cases, but the user specifically asked for \"Olivier,\" so I'll keep it case-sensitive.\n\nI also need to handle any potential errors, like if the website doesn't respond or if there's an issue parsing the HTML. But since I'm using a tool that runs the code, I can let it handle those exceptions and just report the result.\n\nPutting it all together, the code will fetch the webpage, parse it, extract the text, split it into words, and count the occurrences of \"Olivier.\" Then, it'll print the result. I think that should do it.",
            "content": null,
            "tool_calls": [
              {
                "type": "function",
                "function": {
                  "name": "python",
                  "arguments": "{\"code\":\"import requests\\nfrom bs4 import BeautifulSoup\\n\\nurl = 'https://ochafik.com'\\nresponse = requests.get(url)\\n\\nif response.status_code == 200:\\n    soup = BeautifulSoup(response.text, 'html.parser')\\n    text = soup.get_text()\\n    words = text.split()\\n    count = words.count('Olivier')\\n    print(f'The name \\\"Olivier\\\" appears {count} times.')\\nelse:\\n    print('Failed to retrieve the webpage.')\"}"
                },
                "id": ""
              }
            ]
          }
        }
      ]
    }

    Which is this code:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://ochafik.com'
    response = requests.get(url)
    
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        text = soup.get_text()
        words = text.split()
        count = words.count('Olivier')
        print(f'The name "Olivier" appears {count} times.')
    else:
        print('Failed to retrieve the webpage.')

    Not too bad, but it didn't do lower-case and word split is a bit poor.

    Trying again w/ the following extra args to make the sampling greedy:

    curl ... {
      ...
      "temperature": 0.0,
      "top_k": 1,
      "top_p": 1.0
    }
    

    We have a winner:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://ochafik.com'
    response = requests.get(url)
    
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        text = soup.get_text()
        count = text.lower().count('olivier')
        print(f'The name "Olivier" appears {count} times.')
    else:
        print('Failed to retrieve the webpage.')

    And the thoughts:

    Okay, so the user wants me to scrape the website https://ochafik.com and find out how many times it mentions "Olivier." Hmm, I need to figure out how to do that. Since I can use the Python function tool, I can write a script that does this.

    First, I should think about how to scrape a website. I remember that using libraries like requests and BeautifulSoup can help. Requests can fetch the webpage content, and BeautifulSoup can parse the HTML to extract text.

    So, the steps would be: send a GET request to the URL, check if the request was successful, parse the HTML content, extract all the text, and then count the occurrences of the word "Olivier."

    Wait, but sometimes the word might be part of a larger word, like "Olivier's" or "Oliviering." The user probably wants exact matches, so I should make sure to count only whole words. Maybe using a regular expression with word boundaries would be better.

    Also, I should handle any potential errors, like if the website doesn't respond or if there's an issue with parsing. But since the user didn't specify handling errors, maybe I can just proceed with the basic script.

    Putting it all together, the Python code would import requests and BeautifulSoup, send the request, parse the content, extract the text, and then count the occurrences. I'll write the code accordingly and use the Python tool to execute it.

Implementation notes

  • Had to work around DeepSeek R1's official jinja template:
    • It doesn't describe the available tools, and the backfill done by Minja wasn't phrased well enough (for the 7B model), so I've added autogenerated tool call examples to minja's revamped "polyfill" behaviour (using a delta template eval).
      #11641
    • It ignores message.tool_calls if message.content is not null, updated / testing the server output accordingly (better oai compliance)
    • After a tool result, it leaves the prompt hanging on a <|tool▁output▁end|> or <|tool▁call▁end|> (need to close the list of outputs / calls w/ plural <|tool▁outputs▁end|> / <|tool▁calls▁end|>, respectively, and then missing end of sentence + optional add_generation_prompt)
      • Hacked a workaround so the default template now works well with this branch
      • Added / documented better template (models/templates/llama-cpp-deepseek-r1.jinja)
  • DeepSeek R1 Distill Qwen 8B & 32B models seem to take liberties with their tool call start tag, so accepting variations of the syntax (which then triggers the lazy grammar / full compliance). I'd avoid the 8B (or its lower quants) if possible, it sometimes tries to skip the tools output tag.
  • NEW --reasoning-format flag, which controls output of reasoning_content in the API (see test_thoughts)
    • Sourced from native <think>...</think> tags for DeepSeek R1 and <|START_THINKING|>...<|END_THINKING|> for Command R7B.
      • Retired tool_plan field / now flowing into reasoning_content (was added in #11585)
  • Updated tests:
    • Added the Q4_K_M quant to some (but not all) slow server tests (had to tell it not to overthink, bit... ironic).
    • Added slow tool result server test test_calc_result (checking models make some use of tool call results, which some struggle a bit with)
    • More corner cases for parsing, esp. DeepSeek R1 & Command R7B

TODOs:

  • Test content is null for tool calls
  • Test changes in Minja repo (add backfill goldens, maybe start adding control over backfill behaviour?)
  • Test prompting w/ tool results in server tests
  • send --jinja --chat-template chatml support separately (#11616)
    • send double BOS fix separately
  • send opportunistic fixes for Command R7B separately (#11608)

Possible follow ups

  • Document differences between stream & non-stream modes (thought & tool_plan not sent in stream)

  • look at the Llama distill more closely (see #11591)

  • Reintroduce forced thinking in generic handler under some --reasoning flag (+ explore @ngxson's idea to support a disabled value that biases thinking tags)

    ```typescript
    // ResponseSchema is json_schema if set, otherwise string
    
    type SchemaToolRequired     =  {thoughts: string} & ToolCallSchema
    type Schema                 = ({thoughts: string} & ToolCallSchema) | {thoughts: string, response: ResponseSchema}
    
    type ToolCallSchema         = SingleToolCallSchema | ParallelToolCallSchema
    type SingleToolCallSchema   = {tool_call: ToolCall}
    type ParallelToolCallSchema = {tool_calls: ToolCall[]} // If parallel_tool_calls is true
    
    // Note: id only defined if parallel_tool_calls is true
    type ToolCall =
          {name: 'tool_1', arguments: Parameters1Schema, id?: string} |
          {name: 'tool_2', arguments: Parameters2Schema, id?: string} |
          ...
    ```
    
minja: enhance backfill of templates w/o tools description (use examp…
d3b60b8a
pass vocab to common_chat_params_init
87de852b
DeepSeek R1: parse thoughts / return in separate field in API (non st…
130ca222
Avoid double bos w/ jinja
04d511b5
server/oai: ensure content is null when there are tool calls
28345877
update logs
c80cb309
rename tests
08716281
tool-call: allow `--jinja --chat-template chatml`
73d08d49
tool-call: fix command-r7b parsing when response is multiline
04be723b
tool-calls: add DeepSeek R1 Qwen 7B to server test_hello_world
ae9d5812
tell DS R1 not to overthink (weather test)
19bea4ec
github-actions github-actions added testing
github-actions github-actions added examples
github-actions github-actions added python
github-actions github-actions added server
add deepseek models to server tool call section in readme
5e6f2a21
tool-call: allow `--jinja --chat-template chatml`
1e9acd2d
Update test_tool_call.py
77ae97e7
minimize diffs
a76073cf
fix typo
cf83623a
fix double bos issue (drop bos/eos tokens from jinja template)
5d18d76b
fix bad merge
aa98e590
fix build / rm diff
2b3c4829
Merge branch 'jinja-chatml' into r1-toolcall
4cb0e1d8
add missing try catch around jinja parsing to default to chatml
b2dd4909
Merge branch 'jinja-chatml' into r1-toolcall
08271b55
tool-calls: r1: add missing <|tool▁calls▁end|> to grammar!
df3474e2
tweak delta logic
c397bd1f
tool-calls: accommodate variety of wrong tool call opening tags both …
569610ee
Simplify default chatml logic
d73448de
Merge branch 'jinja-chatml' into r1-toolcall
0be7f652
tool-calls: add deepseek r1 template + accommodate broken official te…
7dc271fb
rm unneeded vocab
c6214ee9
simpler hacky fixes for original broken template (+ fix minja example…
1c302e18
sync: minja https://github.com/google/minja/pull/46
108da907
Merge branch 'master' into r1-toolcall
bc6d910f
actually we want eos_token in the template to infer tool call example…
11c1f0c7
update to minja's new api
30ea3591
sync: minja
bbd45bf6
simplify hack to fix original template's backfill from minja
bff549de
tool-call: r1: add one more trigger approx "<|tool calls begin|>"
ce28224d
r1: fix inadvertent newline in grammar before <|tool▁call▁end|>
e84ee88f
tool-call: r1: fix grammar
18a11f43
move trigger_words init inside non-llguidance branch
9a6847c8
fix / test parsing of r1 parser
a682d121
Fix / test models/templates/llama-cpp-deepseek-r1.jinja
f0154a64
update test_calc_result
326e7002
fix test_calc_result
78b47bb0
fix spaces
86994db6
`sync`: minja
09caa634
Update test-chat.cpp
b1527292
fix mistral chat test: need empty tokens
56a14ddc
Update chat.cpp
f12e3507
Merge branch 'sync-minja-4' into r1-toolcall
d43e4f6c
server: check that content is null when we get tool_calls
812544ab
tool-call: ensure we don't return content when there are tool calls /…
d44eb95c
fix mistral expectation
b6e14a41
ensure deepseek r1 thoughts parsed even w/o tool calls
1f5ec598
fix test-chat
438ce0b8
ochafik ochafik changed the title `tool-call`: fix DeepSeek R1 Qwen distill (WIP) `tool-call`: fix DeepSeek R1 Qwen distill 100 days ago
ochafik ochafik marked this pull request as ready for review 100 days ago
ochafik ochafik requested a review from ngxson ngxson 100 days ago
ochafik ochafik changed the title `tool-call`: fix DeepSeek R1 Qwen distill `tool-call`: fix DeepSeek R1 Qwen distills 100 days ago
Update chat.cpp
21f20715
Merge branch 'sync-minja-4' into r1-toolcall
b5b117fa
Fix r1 grammar since we made <|tool▁calls▁begin|> optional (triggerin…
0db98812
r1: revert making <|tool▁calls▁begin|> optional as somehow sampling t…
d1b66910
ochafik ochafik requested a review from ggerganov ggerganov 99 days ago
ochafik ochafik changed the title `tool-call`: fix DeepSeek R1 Qwen distills `tool-call`: fix DeepSeek R1 Qwen distills + return thoughts in separate field 99 days ago
ochafik ochafik changed the title `tool-call`: fix DeepSeek R1 Qwen distills + return thoughts in separate field `tool-call`: fix DeepSeek R1 Qwen distills tool-call + return thoughts in separate field 99 days ago
Mushoz
Mushoz99 days ago👍 1❤ 1

Looking at the deepseek documentation: https://api-docs.deepseek.com/guides/reasoning_model

We have the reasoning_content and content as the two output fields. Might be better to follow the same naming to maintain compatibility with tools that parse this output. Instead of the currents thoughts field you are using.

return thoughts in reasoning_content field
39c1d816
ochafik
ochafik99 days ago👍 1

Looking at the deepseek documentation: https://api-docs.deepseek.com/guides/reasoning_model

We have the reasoning_content and content as the two output fields. Might be better to follow the same naming to maintain compatibility with tools that parse this output. Instead of the currents thoughts field you are using.

Aaaabsolutely, thanks, done! (somehow failed to find this bit in the doc, and also I don't have an API key yet 🤦‍♂️)

ochafik ochafik changed the title `tool-call`: fix DeepSeek R1 Qwen distills tool-call + return thoughts in separate field `tool-call`: fix DeepSeek R1 Qwen distills tool-call + return reasoning_content in API 99 days ago
update readme section about common model tool call formats
b2d17287
Merge branch 'master' into r1-toolcall
933f7a18
Update test_tool_call.py
5d60cebb
Mushoz
Mushoz99 days ago

I noticed that these r1 distill models also suck with tool calling in Cline. Will this help with that? Or will that need a cline specific template?

ngxson
ngxson99 days ago (edited 99 days ago)👍 1

@ochafik This is an example response from the official deepseek API (steam + non-stream): https://gist.github.com/ngxson/89a568d22e02b9a93f845abdfd8427a6

ngxson
ngxson99 days ago (edited 99 days ago)👍 2

Btw, I think it could be better to only enable this feature via a flag. For example, only return separated reasoning_content if the request response contains "reasoning_format": "deepseek":

{
  "messages": [
    {"role": "user", "content": "hello"}
  ],
  "reasoning_format": "deepseek",
  "stream": false,
  "temperature": 0.2
}

This is because this feature is a non-OAI-compat, so doing this just in case OAI implement it differently in the future (which btw, they will definitely does, due to some childish politic conflicts)


Edit: and/or, we can also rely on "model": "deepseek-reasoner", which user will definitely set if they're using deepseek API, so we just provide a drop-in replacement for deepseek API here.

MoonRide303
MoonRide30399 days ago (edited 99 days ago)

Btw, I think it could be better to only enable this feature via a flag. For example, only return separated reasoning_content if the request response contains "reasoning_format": "deepseek":

{
  "messages": [
    {"role": "user", "content": "hello"}
  ],
  "reasoning_format": "deepseek",
  "stream": false,
  "temperature": 0.2
}

This is because this feature is a non-OAI-compat, so doing this just in case OAI implement it differently in the future (which btw, they will definitely does, due to some childish politic conflicts)

Edit: and/or, we can also rely on "model": "deepseek-reasoner", which user will definitely set if they're using deepseek API, so we just provide a drop-in replacement for deepseek API here.

It's also available as "deepseek/deepseek-r1" using standard openai client on OpenRouter - it just returns the content (no thinking part) without having to specify any extra parameters. Distilled versions like "deepseek/deepseek-r1-distill-llama-70b" work the same way. Local gguf files might be named like "DeepSeek-R1-Distill-Qwen-32B-IQ3_XS.gguf" or "FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview-IQ3_XS.gguf", so simplest and most universal solution for now (compatible with how people use those models via API) could be simply checking if lowercase([model_name]) contains "deepseek-reasoner", "deepseek-r1" or "deepseekr1", and then activating content splitting.

Being able to enforce this format manually (so it would work with models working alike R1, but using completely different names) sounds like a good idea, too (and later on other popular thinking models could added to auto-detection list - it might be stuff like Open-R1, etc.).

ochafik Merge branch 'master' into r1-toolcall
1f1f06aa
--think to force any model to return reasoning_content (or just parse…
9d7c3cc5
Merge branch 'r1-toolcall' of github.com:ochafik/llama.cpp into r1-to…
d20c2ce4
ochafik ochafik changed the title `tool-call`: fix DeepSeek R1 Qwen distills tool-call + return reasoning_content in API `tool-call`: fix DeepSeek R1 Qwen distills tool-call, --think to return reasoning_content w/ any model 98 days ago
fix test_thoughts
f3e9f8b6
ochafik
ochafik98 days ago (edited 98 days ago)👍 2

Btw, I think it could be better to only enable this feature via a flag

So... I've fenced the parsing of DeepSeek R1's <think> tags behind new experimental --think flag.

Then I thought... what about other models? Don't they deserve to think too? (seems weird that a flag would only affect a single model). So I added forced thoughts output to the generic chat handler when that flag is set (similar to what I was doing in #6389 ).

Marked as experimental / will probably need customisability (w/ templates??), but turns any model into a thinking model.

WDYT?

ochafik ochafik changed the title `tool-call`: fix DeepSeek R1 Qwen distills tool-call, --think to return reasoning_content w/ any model `tool-call`: fix DeepSeek R1 Qwen distills, add `--think` to return reasoning_content w/ any model 98 days ago
fix compiler warning about parens
3841a163
align Command R7B w/ --think / reasoning_content behaviour
e6d9b524
ochafik
ochafik98 days ago (edited 98 days ago)

Also aligned Command 7RB on this new thinking model (removed the tool_plan field added temporarily in #11585 )

Updated instructions, tests & README, --jinja --think (and possibly the right template override) is all one needs to get universal tool calls and thinking.

Update README.md
39b50c37
fix --think arg env
0917e0a8
disable some failing chatml tests
098629df
ochafik ochafik changed the title `tool-call`: fix DeepSeek R1 Qwen distills, add `--think` to return reasoning_content w/ any model `tool-call`: fix DeepSeek R1 Qwen distills, add `--think` to return reasoning_content w/ any model (native for Command 7RB & DeepSeek R1) 98 days ago
Update README.md
33efcb3c
use existing string_strip
994301da
revert tool example backfill change - command 7rb just needs the righ…
d1a06407
ochafik ochafik changed the title `tool-call`: fix DeepSeek R1 Qwen distills, add `--think` to return reasoning_content w/ any model (native for Command 7RB & DeepSeek R1) `server`: fix tool-call of DeepSeek R1 Qwen, add `--think` to return reasoning_content w/ any model (native for Command 7RB & DeepSeek R1) 98 days ago
ngxson
ngxson commented on 2025-02-06
common/arg.cpp
1983 [](common_params & params) {
1984 params.think = true;
1985 }
1986
).set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_MAIN}).set_env("LLAMA_ARG_THINK"));
ngxson97 days ago (edited 97 days ago)👍 1

IMO the --think flag is not very intuitive, it should be something like --reasoning-format, but personally I still prefer to do it per-request basis.

Also, for future-proof, we should make this flag accepts a param. For example, --reasoning-format deepseek will return it as reasoning_content. Again, this is because we are pretty sure that openai will break the whole thing in the near future.

HanClinto97 days ago

From a usability standpoint, --think feels a bit more intuitive / memorable to me about what it does.

Other alternatives might be --separate-thinking or --extract-reasoning or --format-reasoning or...?

ngxson97 days ago👍 1

I mean, what non-intuitive about it is that the model will still think even with --think not being added. This flag is supposed to force the model to start the reasoning process, but not to enable/disable it completely

ochafik97 days ago (edited 97 days ago)

Tbh I'm not really a fan of a query parameter yet, mostly because:

  • I see thinking mode as a static property of the model and its orchestration (much like the template)
  • It's a new non-standard API to define (more naming headaches)
  • It makes llama-server less compatible w/ DeepSeek's API, not more.
  • We can always decide to add it later if we start w/ a flag

Re/ flags, I think there may be room for two: one that controls the reasoning behaviour (extract, leave inline, force), and one for the return format. For the first one, how about:

  • --reasoning=extract → Parses DeepSeek R1 & Command R7B thoughts (default)
  • --reasoning=inline → Leaves thoughts inline in the content (format is model-specific)
  • --reasoning=force → Coerces non-thinking models to think (edit: maybe force-experimental for now)

As for the format, there are already two APIs out there:

  • DeepSeek's reasoning_content
  • Cohere's message.tool_plan, alas only as a prelude to tool calls (but it seems easy to trick R7B into thinking more generally, just needs a tweaked template)

I'd favour just sticking to reasoning_content for now until OpenAI announces their own format (and when they do, default to OpenAI's format and offer --reasoning-format=deepseek for backwards compatibility). OR decide to create our own format now / default --reasoning-format=llama.cpp that returns thoughts in the message.thoughts field, for instance.

WDYT?

ngxson96 days ago👍 1

Tbh I'm not really a fan of a query parameter yet

Seems ok for me, but eventually I think someone will gonna add this as a per-request param.

Re. having 2 flags for behavior / format, this seems more reasonable for me. From a functional programming perspective, it can be expressed as response = format(behavior(original_generated_content))

But I think your idea is still mixed between these 2 layers.

For--reasoning extract|inline|force:

  • I'm not sure what force is supposed to do, but seems like it needs some prompt engineering so I think you should consider that, maintaining prompts can be a burden
  • For inline does that means reasoning can appear middle of generation? Example, content..think..content..think. Please lmk if I understand correctly.

For --reasoning-format I don't get why we want to invent a new --reasoning-format llama.cpp that put things inside message.thoughts. IMO we should keep thing simple until openai drop their format. Probably we can have --reasoning-format deepseek|none and set deepseek as the default for now, then change the default to oai once we have that openai format

ngxson96 days ago (edited 96 days ago)👍 1

But I think your idea is still mixed between these 2 layers.

IMO if we want --reasoning to control the behavior, then it should affect the generation phrase (for example, control grammar/logits bias). So, it should 3 values:

  • enabled: the model behaves as usual
  • disabled: we never allow model to use <think> token ==> control via logits bias
  • force: force the model to think, maybe use grammar or prompt engineering?

Then for the --reasoning-format, it should only "transform" the result into the desired format (a.k.a a pure function), we can have 3 values:

  • deepseek: put content inside reasoning_content
  • none: do not format, simply forward all the generated tokens to user
  • and then oai can be added in the future
ochafik95 days ago

IMO if we want --reasoning to control the behavior, then it should affect the generation phrase (for example, control grammar/logits bias). So, it should 3 values:

  • enabled: the model behaves as usual
  • disabled: we never allow model to use <think> token ==> control via logits bias
  • force: force the model to think, maybe use grammar or prompt engineering?

@ngxson Makes sense, removed the forced thinking from this PR / will explore again as follow up (also, see more of a case of this option as a query param, while reasoning-format now has stronger flag vibes)

I'm not sure what force is supposed to do, but seems like it needs some prompt engineering so I think you should consider that, maintaining prompts can be a burden

Good point. In earlier experimentations I tried controlling the entire tool call process (even grammar generation) from Jinja templates, might play with this again.

Then for the --reasoning-format, it should only "transform" the result into the desired format (a.k.a a pure function), we can have 3 values:
deepseek: put content inside reasoning_content
none: do not format, simply forward all the generated tokens to user
and then oai can be added in the future

Updated the code (defaulting to deepseek), thanks!

ochafik92 days ago (edited 92 days ago)

Note that I’ve updated the code to the latest DeepSeek template changes (they added a trailing <think> 😅; updated minja accordingly: #11774 (comment) )

ochafik91 days ago

Hey @ngxson, any more concerns / suggestions about this PR?

ngxson91 days ago👍 1

Sorry I'm a bit busy recently, will do a review later today or tomorrow

ggerganov ggerganov removed review request from ggerganov ggerganov 97 days ago
ggerganov
ggerganov97 days ago (edited 97 days ago)👍 2

Tag me for review after @ngxson approves. I will do a quick pass after that, as I am not very familiar with the tool calling functionality yet and can't provide much meaningful feedback.

Sherlock-Holo
Sherlock-Holo96 days ago

just curious

why non-streaming api only for now? how much additional work need to do to support streaming api?

ochafik
ochafik96 days ago👍 1

just curious

why non-streaming api only for now? how much additional work need to do to support streaming api?

@Sherlock-Holo That’s next on my list, wanted to get non-streamed logic working well first, then will need to revamp the parsers to accept eg unclosed json list of “parallel” tool calls and stream them back one by one (bit of delta book keeping to do, tool call deltas give updates to the arguments for the current tool call, then move to the next, etc). Medium amount of work but probably gnarly haha.

WangxuP
WangxuP96 days ago👍 1

In streaming mode, the output data does not separate 'content' and 'reasoning content'
like this:

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"<think>"}}],"created":1739000016,"id":"chatcmpl-QPiD7T4WVir86Qga3YHuhmJ0DO7hNQHK","model":"DeepSeek-R1-UD-IQ1_M","system_fingerprint":"b0-unknown","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"\n"}}],"created":1739000017,"id":"chatcmpl-QPiD7T4WVir86Qga3YHuhmJ0DO7hNQHK","model":"DeepSeek-R1-UD-IQ1_M","system_fingerprint":"b0-unknown","object":"chat.completion.chunk"}
Merge remote-tracking branch 'origin/master' into r1-toolcall
cc2c712c
Use --reasoning-format, remove forced thinking for now
c0f972bb
return reasoning_content before content
af638860
update model template / format mapping
a59fde29
ochafik ochafik changed the title `server`: fix tool-call of DeepSeek R1 Qwen, add `--think` to return reasoning_content w/ any model (native for Command 7RB & DeepSeek R1) `server`: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless `--reasoning-format none` 95 days ago
fix test-chat
b829cab7
rm thoughts from generic parser
95cddfd8
sync: minja (https://github.com/google/minja/pull/52)
e598e7aa
tool-calls: allow r1 output to miss <think> opening tag (since latest…
91542ca2
sync: minja (https://github.com/ggerganov/llama.cpp/pull/11774)
8d82be90
rm wrong warning in command-r parser (when normal text)
30dcfaa5
update deepseek r1 templates (+ put update commands in ./scripts/get_…
e1bff8f6
fix server test_tool_calls.py
a29dc921
add models/templates/README.md
ea2f41e0
github-actions github-actions added script
fix test_calc_result & test_thoughts
8409bf18
fix test-chat (update delta to latest r1 template change)
01db4291
Merge remote-tracking branch 'origin/master' into r1-toolcall
37a4bb25
ochafik
ochafik91 days ago👍 1

In streaming mode, the output data does not separate 'content' and 'reasoning content' like this:

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"<think>"}}],"created":1739000016,"id":"chatcmpl-QPiD7T4WVir86Qga3YHuhmJ0DO7hNQHK","model":"DeepSeek-R1-UD-IQ1_M","system_fingerprint":"b0-unknown","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"\n"}}],"created":1739000017,"id":"chatcmpl-QPiD7T4WVir86Qga3YHuhmJ0DO7hNQHK","model":"DeepSeek-R1-UD-IQ1_M","system_fingerprint":"b0-unknown","object":"chat.completion.chunk"}

@WangxuP Based on my (limited) understanding of the delta format used by OpenAI (incl. for tool calls), the "correct" way to stream thoughts back would be to hold off on anything that might be an opening <think> tag, then send it as a reasoning_content delta. Hope we see how OpenAI stream their own thoughts in a near future (I have a few more things to crunch on before implementing streaming anyway).

ngxson
ngxson approved these changes on 2025-02-12
examples/server/README.md
11361137
11371138 | Template | Format |
11381139 |----------|--------|
1139 | CohereForAI-c4ai-command-r-plus-default.jinja | generic tool calls |
1140 | CohereForAI-c4ai-command-r-plus-rag.jinja | generic tool calls |
1141 | CohereForAI-c4ai-command-r-plus-tool_use.jinja | generic tool calls |
1142 | MiniMaxAI-MiniMax-Text-01.jinja | generic tool calls |
1143 | NexaAIDev-Octopus-v2.jinja | generic tool calls |
1144 | NousResearch-Hermes-2-Pro-Llama-3-8B-default.jinja | generic tool calls |
1145 | NousResearch-Hermes-2-Pro-Llama-3-8B-tool_use.jinja | hermes 2 pro tool calls |
1146 | NousResearch-Hermes-2-Pro-Mistral-7B-default.jinja | generic tool calls |
1147 | NousResearch-Hermes-2-Pro-Mistral-7B-tool_use.jinja | hermes 2 pro tool calls |
1148 | NousResearch-Hermes-3-Llama-3.1-70B-default.jinja | generic tool calls |
1149 | NousResearch-Hermes-3-Llama-3.1-70B-tool_use.jinja | hermes 2 pro tool calls |
1150 | OrionStarAI-Orion-14B-Chat.jinja | generic tool calls |
1151 | Qwen-QwQ-32B-Preview.jinja | hermes 2 pro tool calls |
1152 | Qwen-Qwen2-7B-Instruct.jinja | generic tool calls |
1153 | Qwen-Qwen2-VL-7B-Instruct.jinja | generic tool calls |
1154 | Qwen-Qwen2.5-7B-Instruct.jinja | hermes 2 pro tool calls |
1155 | Qwen-Qwen2.5-Math-7B-Instruct.jinja | hermes 2 pro tool calls |
1156 | TheBloke-FusionNet_34Bx2_MoE-AWQ.jinja | generic tool calls |
1157 | abacusai-Fewshot-Metamath-OrcaVicuna-Mistral.jinja | generic tool calls |
1158 | bofenghuang-vigogne-2-70b-chat.jinja | generic tool calls |
1159 | databricks-dbrx-instruct.jinja | generic tool calls |
1160 | deepseek-ai-DeepSeek-Coder-V2-Instruct.jinja | generic tool calls |
1161 | deepseek-ai-DeepSeek-R1-Distill-Llama-8B.jinja | deepseek r1 tool calls |
1162 | deepseek-ai-DeepSeek-R1-Distill-Qwen-32B.jinja | deepseek r1 tool calls |
1163 | deepseek-ai-DeepSeek-R1-Distill-Qwen-7B.jinja | deepseek r1 tool calls |
1164 | deepseek-ai-DeepSeek-V2.5.jinja | deepseek r1 tool calls |
1165 | deepseek-ai-deepseek-coder-33b-instruct.jinja | generic tool calls |
1166 | google-gemma-2-2b-it.jinja | generic tool calls |
1167 | google-gemma-7b-it.jinja | generic tool calls |
1168 | indischepartij-MiniCPM-3B-OpenHermes-2.5-v2.jinja | generic tool calls |
1169 | mattshumer-Reflection-Llama-3.1-70B.jinja | generic tool calls |
1170 | meetkai-functionary-medium-v3.2.jinja | functionary v3.2 tool calls |
1171 | meta-llama-Llama-3.1-8B-Instruct.jinja | llama 3.x tool calls (w/ builtin tools) |
1172 | meta-llama-Llama-3.2-3B-Instruct.jinja | llama 3.x tool calls |
1173 | meta-llama-Llama-3.3-70B-Instruct.jinja | llama 3.x tool calls (w/ builtin tools) |
1174 | meta-llama-Meta-Llama-3.1-8B-Instruct.jinja | llama 3.x tool calls (w/ builtin tools) |
1175 | microsoft-Phi-3-medium-4k-instruct.jinja | generic tool calls |
1176 | microsoft-Phi-3-mini-4k-instruct.jinja | generic tool calls |
1177 | microsoft-Phi-3-small-8k-instruct.jinja | generic tool calls |
1178 | microsoft-Phi-3.5-mini-instruct.jinja | generic tool calls |
1179 | microsoft-Phi-3.5-vision-instruct.jinja | generic tool calls |
1180 | mistralai-Mistral-7B-Instruct-v0.2.jinja | generic tool calls |
1181 | mistralai-Mistral-Large-Instruct-2407.jinja | mistral nemo tool calls |
1182 | mistralai-Mistral-Large-Instruct-2411.jinja | generic tool calls |
1183 | mistralai-Mistral-Nemo-Instruct-2407.jinja | mistral nemo tool calls |
1184 | mistralai-Mixtral-8x7B-Instruct-v0.1.jinja | generic tool calls |
1185 | mlabonne-AlphaMonarch-7B.jinja | generic tool calls |
1186 | nvidia-Llama-3.1-Nemotron-70B-Instruct-HF.jinja | llama 3.x tool calls (w/ builtin tools) |
1187 | openchat-openchat-3.5-0106.jinja | generic tool calls |
1188 | teknium-OpenHermes-2.5-Mistral-7B.jinja | generic tool calls |
1140
| Almawave-Velvet-14B.jinja | Hermes 2 Pro |
ngxson91 days ago❤ 1

Just noting here (no need to take any actions right now), but this README file is now too long and hard to follow for new users. I'm planning to break this into small files (like what we did with docs directory). Potentially we will end up with a main API docs, tool-calling docs and development docs.

ochafik91 days ago

Sounds great!! happy to help with this (if only reviewing)

common/chat.cpp
596 // Distill Qwen 7B & 32B models seem confused re/ syntax of their tool call opening tag,
597 // so we accept common variants (then it's all constrained)
598 builder.add_rule("root",
599
"( \"<|tool▁calls▁begin|>\" | \"<|tool_calls_begin|>\" | \"<|tool calls begin|>\" | \"<|tool\\\\_calls\\\\_begin|>\" ) "
ngxson91 days ago

Small nits, if you're doing multiple string concatenations, it's better to use std::ostringstream to reduce the number of copy.

ochafik91 days ago

Fair point, for now I've been favouring readability but will keep this in mind when doing an optimization pass (depending on how much this all ends up costing, we might want to cache various bits and/or create a grammar DSL that would bypass the string stage altogether; JSON schema conversion has lots of room for optimization & I'd also like to take the llguidance stuff into account: exciting prospects!)

common/chat.cpp
585 data.grammar = build_grammar([&](const common_grammar_builder & builder) {
586 std::vector<std::string> tool_rules;
587 foreach_function(inputs.tools, [&](const json & tool) {
588
const auto & function = tool["function"];
ngxson91 days ago❤ 1

Better to use .at() instead of operator[] when it's possible, as explained in https://github.com/nlohmann/json

In function from_json, use function at() to access the object values rather than operator[]. In case a key does not exist, at throws an exception that you can handle, whereas operator[] exhibits undefined behavior

ochafik91 days ago

Updated, thanks!

ngxson ngxson requested a review from ggerganov ggerganov 91 days ago
prefer json::at to operator[] in chat.cpp
d52579a9
Merge remote-tracking branch 'origin/master' into r1-toolcall
47002452
ggerganov
ggerganov approved these changes on 2025-02-13
Conversation is marked as resolved
Show resolved
examples/server/server.cpp
731 if (!msg.reasoning_content.empty()) {
732 message["reasoning_content"] = msg.reasoning_content;
733 }
734
if (msg.content == "" && !msg.tool_calls.empty()) {
ggerganov91 days ago
Suggested change
if (msg.content == "" && !msg.tool_calls.empty()) {
if (msg.content.empty() && !msg.tool_calls.empty()) {
Conversation is marked as resolved
Show resolved
common/chat.cpp
982 throw std::runtime_error("Cannot specify grammar with tools");
983 }
984 if (caps.supports_tool_calls && !caps.supports_tools) {
985
LOG_WRN("Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.");
ggerganov91 days ago
Suggested change
LOG_WRN("Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.");
LOG_WRN("Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.\n");
Conversation is marked as resolved
Show resolved
common/chat.cpp
127
128 if (!result.tool_calls.empty()) {
129 if (!string_strip(result.content).empty()) {
130
LOG_WRN("Content found with tool calls: %s", result.content.c_str());
ggerganov91 days ago
Suggested change
LOG_WRN("Content found with tool calls: %s", result.content.c_str());
LOG_WRN("Content found with tool calls: %s\n", result.content.c_str());
ochafik Apply suggestions from code review
043cb99f
ochafik ochafik merged c7f460ab into master 90 days ago
ngxson
ngxson90 days ago (edited 90 days ago)❤ 2👀 2

That’s next on my list, wanted to get non-streamed logic working well first, then will need to revamp the parsers to accept eg unclosed json list of “parallel” tool calls and stream them back one by one (bit of delta book keeping to do, tool call deltas give updates to the arguments for the current tool call, then move to the next, etc). Medium amount of work but probably gnarly haha.

Yes I would say this will be a hard approach. Specially because each model has their own format, so we can't really rely on regex much in stream mode.

Indeed, I assume that most of the complication will be about moving away from regex and use some kind of "state machine" to keep track of the generated text. From this perspective, I'm wondering, is it worth inventing our own implementation of regex? Regex is just a state machine under the hood, so by doing this we can fully manage the state on our own.

Written as (pseudo-) code, my idea looks like:

struct chat_regex tool_regex;
tool_regex.add_literal("<tool_name>")
tool_regex.add_string()
tool_regex.add_literal("</tool_name><tool_data>")
tool_regex.add_json()
tool_regex.add_literal("</tool_data>")
tool_regex.end()

Which will be compiled into:

flowchart LR

F@{ shape: dbl-circ, label: "end" }

A --&lt;tool_name&gt;--> B
B --string--> C
C --&lt;/tool_name&gt;&lt;tool_data&gt;--> D
D --json--> E
E --&lt;/tool_data&gt;--> F
Loading

We create a "state" object each time we want to use it (i.e. store it into the server slot):

slot.chat_parser_state = chat_parser_state(tool_regex); // initial state A
slot.chat_parser_state << slot.generated_text; // with "generated_text" is the "delta" generated content
slot.chat_parser_state.get_matched(); // not sure yet what it should return
ochafik
ochafik90 days ago

From this perspective, I'm wondering, is it worth inventing our own implementation of regex? Regex is just a state machine under the hood, so by doing this we can fully manage the state on our own.

@ngxson Yesss!! 🫡🫡🫡🫡🫡

So, my original dream was to write a recursive descent / backtracking parser based on the existing GBNF grammars, and use a crude naming convention to extract data out of rules:

  • *-tool-calls-N
  • then nested *-tool-call-N
  • then nested *-tool-call-name-N & *-tool-call-arguments-N

(the N is there to allow alternative tool call syntaxes to be extracted).

A bit adhoc and magic, but very limited modifications needed in the tool call code (just stick to a consistent naming, and delete all the regexp code) and a pretty simple parser to implement (can add some hooks to make it generic wrt/ the naming convention to extract any kind of data).

It would also make it trivial to support partial extractions / streaming by memorizing the parsing stack (& extracted variables) that consumed the longest text (when parsing fails).

(and +1 to keeping the state in the slot, although TBD whether that should be a parsing stack state - first stack that failed because of an EOF? - or just the JSON tree of the last response returned, doing a React-style full-DOM diff at every step; much slower but might be safer, to be investigated)

If we agree to explore this route, I might start by refactoring the grammar parsing code to output an easier intermediate grammar AST that can then be used directly by the recursive descent parser (and be trivially converted to the pointer-heavy sampling grammar structure).

ochafik
ochafik90 days ago (edited 90 days ago)

Written as (pseudo-) code, my idea looks like:

struct chat_regex tool_regex;
tool_regex.add_literal("<tool_name>")
tool_regex.add_string()
tool_regex.add_literal("</tool_name><tool_data>")
tool_regex.add_json()
tool_regex.add_literal("</tool_data>")
tool_regex.end()

@ngxson we could also explore this kind of syntax to build a DSL to create the dual-use GBNF grammar (possibly also llguidance grammar)

cc/ @mmoskal @HanClinto

ngxson
ngxson90 days ago (edited 90 days ago)

(and +1 to keeping the state in the slot, although TBD whether that should be a parsing stack state - first stack that failed because of an EOF? - or just the JSON tree of the last response returned, doing a React-style full-DOM diff at every step; much slower but might be safer, to be investigated)

I don't really understand the second part of your phrase about "first stack that failed because of an EOF", but IMO storing the parsing stack is fine. The React-style diff may sounds intuitive/safer, but I think nlohmann::json is not performant enough to do that efficiently. I'm even doubt that we may end up with a implementation slower than the javascript version used by react.

we could also explore this kind of syntax to build a DSL to create the dual-use GBNF grammar (possibly also llguidance grammar)

I have a quick look at all of the regex you're currently using in chat.cpp, but I think a DSL is not very needed at the moment because most of your regex can be expressed in a more intuitive way using my pseudo-code above. Furthermore, the maintenance cost may be high, given that we only gonna use it internally.

Most of your regex(es) use [\\s\\n\\r], [^something]+, ([\\s\\S\\r\\n]*?), which can be expressed as cpp functions like maybe_space(), match_until(...)

And to make it look even nicer, we can use cpp operator overloading, for example with operator->:

tool_regex -> literal("<tool_name>") -> string() -> literal("</tool_name>");

Another benefit of this approach is that some expressions can also be optimized during compile time.


Edit: on second thought, using -> could be a bad idea because it can be confused with pointer dereference. >> or << would be a better choice. Or maybe just chain call tool_regex.literal(...).string().literal(...) for simplicity

mmoskal
mmoskal90 days ago

@ochafik right! llguidance already does support streaming and emitting capture groups for subgrammars; it will even know to only emit "foo" when the tokens so far are "foo<", but then emit "<text" when "text" is sampled (and not "tool").

There is also some code in there to support general stop regexes using a (lazy) state machine.

Note that as people develop the tool calling more in models, they are likely to use special tokens for tool calling, JSON mode etc. Not sure gbnf handles that particularly well (that is the difference between "<|foo|>" and "<|" "foo" "|>").

ggerganov
ggerganov89 days ago👍 1

If we agree to explore this route, I might start by refactoring the grammar parsing code to output an easier intermediate grammar AST that can then be used directly by the recursive descent parser (and be trivially converted to the pointer-heavy sampling grammar structure).

Most of the grammar and tool functionalities are way over my head and I cannot provide a very meaningful feedback. But overall I think the llama-grammar module could use some deeper refactoring and maintenance. The main things I would lookout for is to keep the implementation simple, no extra dependencies and good performance. The general approach has been to prototype stuff in libcommon and when it becomes mature enough, to move it to libllama.

One more thought is that long-term we can also think about moving some core functionality about grammars to ggml. At some point I was considering it because I wanted to reuse grammar functionality in whisper.cpp. So it's something to think about, but very low-prio atm.

mmoskal
mmoskal88 days ago❤ 1👀 1

@ochafik here's how the lazy matching is handled in llguidance, see also docs

import llguidance
import huggingface_hub
import json

lark_grm = """

start: "<tool_name>" name "<tool_data>" data "</tool_data>"
name[capture, suffix="</tool_name>"]: /.*/
data[capture]: %json {
    "properties": {
        "foo": { "type": "string" }
    },
    "required": ["foo"]
}

"""


def main():
    tok_name = huggingface_hub.hf_hub_download(
        "microsoft/Phi-3.5-mini-instruct", "tokenizer.json"
    )
    with open(tok_name, "r") as f:
        text = f.read()
    tok = llguidance.LLTokenizer(text)

    interp = llguidance.LLInterpreter(
        tok,
        json.dumps({"grammars": [{"lark_grammar": lark_grm}]}),
        enable_ff_tokens=False,
        enable_backtrack=False,
        log_level=1,
    )
    interp.start_without_prompt()

    toks = tok.tokenize_str("<tool_name>foo<bar></tool_name><tool_data>{\"foo\": \"bar\"}</tool_data>")
    for t in toks:
        mask, r = interp.compute_mask()
        obj = json.loads(r)
        for p in obj["progress"]:
            if p["object"] != "text":
                print("\n  ", end="")
                print(p)
        # feeding token now
        print(tok.dbg_tokens([t]), end=" ")
        interp.commit_token(t)
    print("\n")

if __name__ == "__main__":
    main()

When you run it, you get:

⟦<⟧ ⟦tool⟧ ⟦_⟧ ⟦name⟧ ⟦>⟧ ⟦foo⟧ ⟦<⟧ ⟦bar⟧ ⟦></⟧ ⟦tool⟧ ⟦_⟧ ⟦name⟧ ⟦><⟧ 
  {'object': 'capture', 'name': 'name', 'str': 'foo<bar>', 'hex': '666f6f3c6261723e', 'log_prob': 0.0}
⟦tool⟧ ⟦_⟧ ⟦data⟧ ⟦>{⟧ ⟦"⟧ ⟦foo⟧ ⟦":⟧ ⟦ "⟧ ⟦bar⟧ ⟦"}⟧ 
  {'object': 'capture', 'name': 'data', 'str': '{"foo": "bar"}', 'hex': '7b22666f6f223a2022626172227d', 'log_prob': 0.0}
⟦</⟧ ⟦tool⟧ ⟦_⟧ ⟦data⟧ ⟦>⟧ 

The captures are generated immedietly after getting enough tokens.

If the model use special tokens, you need to write the grammar slightly differently:

start: <|assistant|> name <|end|> /\s*/ data
name[capture]: /.*/
data[capture]: %json {
    "properties": {
        "foo": { "type": "string" }
    },
    "required": ["foo"]
}

Note lack of suffix= on name - it will extend greedily, until it hits the <|end|> special token. Special tokens are never allowed by regular expressions.

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone