Looking at the deepseek documentation: https://api-docs.deepseek.com/guides/reasoning_model
We have the reasoning_content
and content
as the two output fields. Might be better to follow the same naming to maintain compatibility with tools that parse this output. Instead of the currents thoughts
field you are using.
Looking at the deepseek documentation: https://api-docs.deepseek.com/guides/reasoning_model
We have the
reasoning_content
andcontent
as the two output fields. Might be better to follow the same naming to maintain compatibility with tools that parse this output. Instead of the currentsthoughts
field you are using.
Aaaabsolutely, thanks, done! (somehow failed to find this bit in the doc, and also I don't have an API key yet đ¤Śââď¸)
I noticed that these r1 distill models also suck with tool calling in Cline. Will this help with that? Or will that need a cline specific template?
@ochafik This is an example response from the official deepseek API (steam + non-stream): https://gist.github.com/ngxson/89a568d22e02b9a93f845abdfd8427a6
Btw, I think it could be better to only enable this feature via a flag. For example, only return separated reasoning_content
if the request response contains "reasoning_format": "deepseek"
:
{
"messages": [
{"role": "user", "content": "hello"}
],
"reasoning_format": "deepseek",
"stream": false,
"temperature": 0.2
}
This is because this feature is a non-OAI-compat, so doing this just in case OAI implement it differently in the future (which btw, they will definitely does, due to some childish politic conflicts)
Edit: and/or, we can also rely on "model": "deepseek-reasoner"
, which user will definitely set if they're using deepseek API, so we just provide a drop-in replacement for deepseek API here.
Btw, I think it could be better to only enable this feature via a flag. For example, only return separated
reasoning_content
if the request response contains"reasoning_format": "deepseek"
:{ "messages": [ {"role": "user", "content": "hello"} ], "reasoning_format": "deepseek", "stream": false, "temperature": 0.2 }This is because this feature is a non-OAI-compat, so doing this just in case OAI implement it differently in the future (which btw, they will definitely does, due to some childish politic conflicts)
Edit: and/or, we can also rely on
"model": "deepseek-reasoner"
, which user will definitely set if they're using deepseek API, so we just provide a drop-in replacement for deepseek API here.
It's also available as "deepseek/deepseek-r1" using standard openai client on OpenRouter - it just returns the content (no thinking part) without having to specify any extra parameters. Distilled versions like "deepseek/deepseek-r1-distill-llama-70b" work the same way. Local gguf files might be named like "DeepSeek-R1-Distill-Qwen-32B-IQ3_XS.gguf" or "FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview-IQ3_XS.gguf", so simplest and most universal solution for now (compatible with how people use those models via API) could be simply checking if lowercase([model_name]) contains "deepseek-reasoner", "deepseek-r1" or "deepseekr1", and then activating content splitting.
Being able to enforce this format manually (so it would work with models working alike R1, but using completely different names) sounds like a good idea, too (and later on other popular thinking models could added to auto-detection list - it might be stuff like Open-R1, etc.).
Btw, I think it could be better to only enable this feature via a flag
So... I've fenced the parsing of DeepSeek R1's <think>
tags behind new experimental --think
flag.
Then I thought... what about other models? Don't they deserve to think too? (seems weird that a flag would only affect a single model). So I added forced thoughts output to the generic chat handler when that flag is set (similar to what I was doing in #6389 ).
Marked as experimental / will probably need customisability (w/ templates??), but turns any model into a thinking model.
WDYT?
1983 | [](common_params & params) { | ||
1984 | params.think = true; | ||
1985 | } | ||
1986 | ).set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_MAIN}).set_env("LLAMA_ARG_THINK")); |
IMO the --think
flag is not very intuitive, it should be something like --reasoning-format
, but personally I still prefer to do it per-request basis.
Also, for future-proof, we should make this flag accepts a param. For example, --reasoning-format deepseek
will return it as reasoning_content
. Again, this is because we are pretty sure that openai will break the whole thing in the near future.
From a usability standpoint, --think
feels a bit more intuitive / memorable to me about what it does.
Other alternatives might be --separate-thinking
or --extract-reasoning
or --format-reasoning
or...?
I mean, what non-intuitive about it is that the model will still think even with --think not being added. This flag is supposed to force the model to start the reasoning process, but not to enable/disable it completely
Tbh I'm not really a fan of a query parameter yet, mostly because:
Re/ flags, I think there may be room for two: one that controls the reasoning behaviour (extract, leave inline, force), and one for the return format. For the first one, how about:
--reasoning=extract
â Parses DeepSeek R1 & Command R7B thoughts (default)--reasoning=inline
â Leaves thoughts inline in the content (format is model-specific)--reasoning=force
â Coerces non-thinking models to think (edit: maybe force-experimental
for now)As for the format, there are already two APIs out there:
reasoning_content
I'd favour just sticking to reasoning_content
for now until OpenAI announces their own format (and when they do, default to OpenAI's format and offer --reasoning-format=deepseek
for backwards compatibility). OR decide to create our own format now / default --reasoning-format=llama.cpp
that returns thoughts in the message.thoughts
field, for instance.
WDYT?
Tbh I'm not really a fan of a query parameter yet
Seems ok for me, but eventually I think someone will gonna add this as a per-request param.
Re. having 2 flags for behavior / format, this seems more reasonable for me. From a functional programming perspective, it can be expressed as response = format(behavior(original_generated_content))
But I think your idea is still mixed between these 2 layers.
For--reasoning extract|inline|force
:
force
is supposed to do, but seems like it needs some prompt engineering so I think you should consider that, maintaining prompts can be a burdeninline
does that means reasoning can appear middle of generation? Example, content..think..content..think
. Please lmk if I understand correctly.For --reasoning-format
I don't get why we want to invent a new --reasoning-format llama.cpp
that put things inside message.thoughts
. IMO we should keep thing simple until openai drop their format. Probably we can have --reasoning-format deepseek|none
and set deepseek
as the default for now, then change the default to oai
once we have that openai format
But I think your idea is still mixed between these 2 layers.
IMO if we want --reasoning
to control the behavior, then it should affect the generation phrase (for example, control grammar/logits bias). So, it should 3 values:
enabled
: the model behaves as usualdisabled
: we never allow model to use <think>
token ==> control via logits biasforce
: force the model to think, maybe use grammar or prompt engineering?Then for the --reasoning-format
, it should only "transform" the result into the desired format (a.k.a a pure function), we can have 3 values:
deepseek
: put content inside reasoning_content
none
: do not format, simply forward all the generated tokens to useroai
can be added in the futureIMO if we want
--reasoning
to control the behavior, then it should affect the generation phrase (for example, control grammar/logits bias). So, it should 3 values:
enabled
: the model behaves as usualdisabled
: we never allow model to use<think>
token ==> control via logits biasforce
: force the model to think, maybe use grammar or prompt engineering?
@ngxson Makes sense, removed the forced thinking from this PR / will explore again as follow up (also, see more of a case of this option as a query param, while reasoning-format now has stronger flag vibes)
I'm not sure what force is supposed to do, but seems like it needs some prompt engineering so I think you should consider that, maintaining prompts can be a burden
Good point. In earlier experimentations I tried controlling the entire tool call process (even grammar generation) from Jinja templates, might play with this again.
Then for the --reasoning-format, it should only "transform" the result into the desired format (a.k.a a pure function), we can have 3 values:
deepseek: put content inside reasoning_content
none: do not format, simply forward all the generated tokens to user
and then oai can be added in the future
Updated the code (defaulting to deepseek), thanks!
Note that Iâve updated the code to the latest DeepSeek template changes (they added a trailing <think>
đ
; updated minja accordingly: #11774 (comment) )
Hey @ngxson, any more concerns / suggestions about this PR?
Sorry I'm a bit busy recently, will do a review later today or tomorrow
Tag me for review after @ngxson approves. I will do a quick pass after that, as I am not very familiar with the tool calling functionality yet and can't provide much meaningful feedback.
just curious
why non-streaming api only for now? how much additional work need to do to support streaming api?
just curious
why non-streaming api only for now? how much additional work need to do to support streaming api?
@Sherlock-Holo Thatâs next on my list, wanted to get non-streamed logic working well first, then will need to revamp the parsers to accept eg unclosed json list of âparallelâ tool calls and stream them back one by one (bit of delta book keeping to do, tool call deltas give updates to the arguments for the current tool call, then move to the next, etc). Medium amount of work but probably gnarly haha.
In streaming mode, the output data does not separate 'content' and 'reasoning content'
like this:
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"<think>"}}],"created":1739000016,"id":"chatcmpl-QPiD7T4WVir86Qga3YHuhmJ0DO7hNQHK","model":"DeepSeek-R1-UD-IQ1_M","system_fingerprint":"b0-unknown","object":"chat.completion.chunk"}
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"\n"}}],"created":1739000017,"id":"chatcmpl-QPiD7T4WVir86Qga3YHuhmJ0DO7hNQHK","model":"DeepSeek-R1-UD-IQ1_M","system_fingerprint":"b0-unknown","object":"chat.completion.chunk"}
In streaming mode, the output data does not separate 'content' and 'reasoning content' like this:
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"<think>"}}],"created":1739000016,"id":"chatcmpl-QPiD7T4WVir86Qga3YHuhmJ0DO7hNQHK","model":"DeepSeek-R1-UD-IQ1_M","system_fingerprint":"b0-unknown","object":"chat.completion.chunk"} data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"\n"}}],"created":1739000017,"id":"chatcmpl-QPiD7T4WVir86Qga3YHuhmJ0DO7hNQHK","model":"DeepSeek-R1-UD-IQ1_M","system_fingerprint":"b0-unknown","object":"chat.completion.chunk"}
@WangxuP Based on my (limited) understanding of the delta format used by OpenAI (incl. for tool calls), the "correct" way to stream thoughts back would be to hold off on anything that might be an opening <think>
tag, then send it as a reasoning_content delta. Hope we see how OpenAI stream their own thoughts in a near future (I have a few more things to crunch on before implementing streaming anyway).
1136 | 1137 | ||
1137 | 1138 | | Template | Format | | |
1138 | 1139 | |----------|--------| | |
1139 | | CohereForAI-c4ai-command-r-plus-default.jinja | generic tool calls | | ||
1140 | | CohereForAI-c4ai-command-r-plus-rag.jinja | generic tool calls | | ||
1141 | | CohereForAI-c4ai-command-r-plus-tool_use.jinja | generic tool calls | | ||
1142 | | MiniMaxAI-MiniMax-Text-01.jinja | generic tool calls | | ||
1143 | | NexaAIDev-Octopus-v2.jinja | generic tool calls | | ||
1144 | | NousResearch-Hermes-2-Pro-Llama-3-8B-default.jinja | generic tool calls | | ||
1145 | | NousResearch-Hermes-2-Pro-Llama-3-8B-tool_use.jinja | hermes 2 pro tool calls | | ||
1146 | | NousResearch-Hermes-2-Pro-Mistral-7B-default.jinja | generic tool calls | | ||
1147 | | NousResearch-Hermes-2-Pro-Mistral-7B-tool_use.jinja | hermes 2 pro tool calls | | ||
1148 | | NousResearch-Hermes-3-Llama-3.1-70B-default.jinja | generic tool calls | | ||
1149 | | NousResearch-Hermes-3-Llama-3.1-70B-tool_use.jinja | hermes 2 pro tool calls | | ||
1150 | | OrionStarAI-Orion-14B-Chat.jinja | generic tool calls | | ||
1151 | | Qwen-QwQ-32B-Preview.jinja | hermes 2 pro tool calls | | ||
1152 | | Qwen-Qwen2-7B-Instruct.jinja | generic tool calls | | ||
1153 | | Qwen-Qwen2-VL-7B-Instruct.jinja | generic tool calls | | ||
1154 | | Qwen-Qwen2.5-7B-Instruct.jinja | hermes 2 pro tool calls | | ||
1155 | | Qwen-Qwen2.5-Math-7B-Instruct.jinja | hermes 2 pro tool calls | | ||
1156 | | TheBloke-FusionNet_34Bx2_MoE-AWQ.jinja | generic tool calls | | ||
1157 | | abacusai-Fewshot-Metamath-OrcaVicuna-Mistral.jinja | generic tool calls | | ||
1158 | | bofenghuang-vigogne-2-70b-chat.jinja | generic tool calls | | ||
1159 | | databricks-dbrx-instruct.jinja | generic tool calls | | ||
1160 | | deepseek-ai-DeepSeek-Coder-V2-Instruct.jinja | generic tool calls | | ||
1161 | | deepseek-ai-DeepSeek-R1-Distill-Llama-8B.jinja | deepseek r1 tool calls | | ||
1162 | | deepseek-ai-DeepSeek-R1-Distill-Qwen-32B.jinja | deepseek r1 tool calls | | ||
1163 | | deepseek-ai-DeepSeek-R1-Distill-Qwen-7B.jinja | deepseek r1 tool calls | | ||
1164 | | deepseek-ai-DeepSeek-V2.5.jinja | deepseek r1 tool calls | | ||
1165 | | deepseek-ai-deepseek-coder-33b-instruct.jinja | generic tool calls | | ||
1166 | | google-gemma-2-2b-it.jinja | generic tool calls | | ||
1167 | | google-gemma-7b-it.jinja | generic tool calls | | ||
1168 | | indischepartij-MiniCPM-3B-OpenHermes-2.5-v2.jinja | generic tool calls | | ||
1169 | | mattshumer-Reflection-Llama-3.1-70B.jinja | generic tool calls | | ||
1170 | | meetkai-functionary-medium-v3.2.jinja | functionary v3.2 tool calls | | ||
1171 | | meta-llama-Llama-3.1-8B-Instruct.jinja | llama 3.x tool calls (w/ builtin tools) | | ||
1172 | | meta-llama-Llama-3.2-3B-Instruct.jinja | llama 3.x tool calls | | ||
1173 | | meta-llama-Llama-3.3-70B-Instruct.jinja | llama 3.x tool calls (w/ builtin tools) | | ||
1174 | | meta-llama-Meta-Llama-3.1-8B-Instruct.jinja | llama 3.x tool calls (w/ builtin tools) | | ||
1175 | | microsoft-Phi-3-medium-4k-instruct.jinja | generic tool calls | | ||
1176 | | microsoft-Phi-3-mini-4k-instruct.jinja | generic tool calls | | ||
1177 | | microsoft-Phi-3-small-8k-instruct.jinja | generic tool calls | | ||
1178 | | microsoft-Phi-3.5-mini-instruct.jinja | generic tool calls | | ||
1179 | | microsoft-Phi-3.5-vision-instruct.jinja | generic tool calls | | ||
1180 | | mistralai-Mistral-7B-Instruct-v0.2.jinja | generic tool calls | | ||
1181 | | mistralai-Mistral-Large-Instruct-2407.jinja | mistral nemo tool calls | | ||
1182 | | mistralai-Mistral-Large-Instruct-2411.jinja | generic tool calls | | ||
1183 | | mistralai-Mistral-Nemo-Instruct-2407.jinja | mistral nemo tool calls | | ||
1184 | | mistralai-Mixtral-8x7B-Instruct-v0.1.jinja | generic tool calls | | ||
1185 | | mlabonne-AlphaMonarch-7B.jinja | generic tool calls | | ||
1186 | | nvidia-Llama-3.1-Nemotron-70B-Instruct-HF.jinja | llama 3.x tool calls (w/ builtin tools) | | ||
1187 | | openchat-openchat-3.5-0106.jinja | generic tool calls | | ||
1188 | | teknium-OpenHermes-2.5-Mistral-7B.jinja | generic tool calls | | ||
1140 | | Almawave-Velvet-14B.jinja | Hermes 2 Pro | |
Just noting here (no need to take any actions right now), but this README file is now too long and hard to follow for new users. I'm planning to break this into small files (like what we did with docs
directory). Potentially we will end up with a main API docs, tool-calling docs and development docs.
Sounds great!! happy to help with this (if only reviewing)
596 | // Distill Qwen 7B & 32B models seem confused re/ syntax of their tool call opening tag, | ||
597 | // so we accept common variants (then it's all constrained) | ||
598 | builder.add_rule("root", | ||
599 | "( \"<ď˝toolâcallsâbeginď˝>\" | \"<ď˝tool_calls_beginď˝>\" | \"<ď˝tool calls beginď˝>\" | \"<ď˝tool\\\\_calls\\\\_beginď˝>\" ) " |
Small nits, if you're doing multiple string concatenations, it's better to use std::ostringstream
to reduce the number of copy.
Fair point, for now I've been favouring readability but will keep this in mind when doing an optimization pass (depending on how much this all ends up costing, we might want to cache various bits and/or create a grammar DSL that would bypass the string stage altogether; JSON schema conversion has lots of room for optimization & I'd also like to take the llguidance stuff into account: exciting prospects!)
585 | data.grammar = build_grammar([&](const common_grammar_builder & builder) { | ||
586 | std::vector<std::string> tool_rules; | ||
587 | foreach_function(inputs.tools, [&](const json & tool) { | ||
588 | const auto & function = tool["function"]; |
Better to use .at()
instead of operator[]
when it's possible, as explained in https://github.com/nlohmann/json
In function from_json, use function at() to access the object values rather than operator[]. In case a key does not exist, at throws an exception that you can handle, whereas operator[] exhibits undefined behavior
Updated, thanks!
Thatâs next on my list, wanted to get non-streamed logic working well first, then will need to revamp the parsers to accept eg unclosed json list of âparallelâ tool calls and stream them back one by one (bit of delta book keeping to do, tool call deltas give updates to the arguments for the current tool call, then move to the next, etc). Medium amount of work but probably gnarly haha.
Yes I would say this will be a hard approach. Specially because each model has their own format, so we can't really rely on regex much in stream mode.
Indeed, I assume that most of the complication will be about moving away from regex and use some kind of "state machine" to keep track of the generated text. From this perspective, I'm wondering, is it worth inventing our own implementation of regex? Regex is just a state machine under the hood, so by doing this we can fully manage the state on our own.
Written as (pseudo-) code, my idea looks like:
struct chat_regex tool_regex;
tool_regex.add_literal("<tool_name>")
tool_regex.add_string()
tool_regex.add_literal("</tool_name><tool_data>")
tool_regex.add_json()
tool_regex.add_literal("</tool_data>")
tool_regex.end()
Which will be compiled into:
flowchart LR
F@{ shape: dbl-circ, label: "end" }
A --<tool_name>--> B
B --string--> C
C --</tool_name><tool_data>--> D
D --json--> E
E --</tool_data>--> F
We create a "state" object each time we want to use it (i.e. store it into the server slot
):
slot.chat_parser_state = chat_parser_state(tool_regex); // initial state A
slot.chat_parser_state << slot.generated_text; // with "generated_text" is the "delta" generated content
slot.chat_parser_state.get_matched(); // not sure yet what it should return
From this perspective, I'm wondering, is it worth inventing our own implementation of regex? Regex is just a state machine under the hood, so by doing this we can fully manage the state on our own.
@ngxson Yesss!! đŤĄđŤĄđŤĄđŤĄđŤĄ
So, my original dream was to write a recursive descent / backtracking parser based on the existing GBNF grammars, and use a crude naming convention to extract data out of rules:
*-tool-calls-N
*-tool-call-N
*-tool-call-name-N
& *-tool-call-arguments-N
(the N
is there to allow alternative tool call syntaxes to be extracted).
A bit adhoc and magic, but very limited modifications needed in the tool call code (just stick to a consistent naming, and delete all the regexp code) and a pretty simple parser to implement (can add some hooks to make it generic wrt/ the naming convention to extract any kind of data).
It would also make it trivial to support partial extractions / streaming by memorizing the parsing stack (& extracted variables) that consumed the longest text (when parsing fails).
(and +1 to keeping the state in the slot, although TBD whether that should be a parsing stack state - first stack that failed because of an EOF? - or just the JSON tree of the last response returned, doing a React-style full-DOM diff at every step; much slower but might be safer, to be investigated)
If we agree to explore this route, I might start by refactoring the grammar parsing code to output an easier intermediate grammar AST that can then be used directly by the recursive descent parser (and be trivially converted to the pointer-heavy sampling grammar structure).
Written as (pseudo-) code, my idea looks like:
struct chat_regex tool_regex; tool_regex.add_literal("<tool_name>") tool_regex.add_string() tool_regex.add_literal("</tool_name><tool_data>") tool_regex.add_json() tool_regex.add_literal("</tool_data>") tool_regex.end()
@ngxson we could also explore this kind of syntax to build a DSL to create the dual-use GBNF grammar (possibly also llguidance grammar)
cc/ @mmoskal @HanClinto
(and +1 to keeping the state in the slot, although TBD whether that should be a parsing stack state - first stack that failed because of an EOF? - or just the JSON tree of the last response returned, doing a React-style full-DOM diff at every step; much slower but might be safer, to be investigated)
I don't really understand the second part of your phrase about "first stack that failed because of an EOF", but IMO storing the parsing stack is fine. The React-style diff may sounds intuitive/safer, but I think nlohmann::json
is not performant enough to do that efficiently. I'm even doubt that we may end up with a implementation slower than the javascript version used by react.
we could also explore this kind of syntax to build a DSL to create the dual-use GBNF grammar (possibly also llguidance grammar)
I have a quick look at all of the regex you're currently using in chat.cpp
, but I think a DSL is not very needed at the moment because most of your regex can be expressed in a more intuitive way using my pseudo-code above. Furthermore, the maintenance cost may be high, given that we only gonna use it internally.
Most of your regex(es) use [\\s\\n\\r]
, [^something]+
, ([\\s\\S\\r\\n]*?)
, which can be expressed as cpp functions like maybe_space()
, match_until(...)
And to make it look even nicer, we can use cpp operator overloading, for example with operator->
:
tool_regex -> literal("<tool_name>") -> string() -> literal("</tool_name>");
Another benefit of this approach is that some expressions can also be optimized during compile time.
Edit: on second thought, using ->
could be a bad idea because it can be confused with pointer dereference. >>
or <<
would be a better choice. Or maybe just chain call tool_regex.literal(...).string().literal(...)
for simplicity
@ochafik right! llguidance already does support streaming and emitting capture groups for subgrammars; it will even know to only emit "foo" when the tokens so far are "foo<", but then emit "<text" when "text" is sampled (and not "tool").
There is also some code in there to support general stop regexes using a (lazy) state machine.
Note that as people develop the tool calling more in models, they are likely to use special tokens for tool calling, JSON mode etc. Not sure gbnf handles that particularly well (that is the difference between "<|foo|>" and "<|" "foo" "|>").
If we agree to explore this route, I might start by refactoring the grammar parsing code to output an easier intermediate grammar AST that can then be used directly by the recursive descent parser (and be trivially converted to the pointer-heavy sampling grammar structure).
Most of the grammar and tool functionalities are way over my head and I cannot provide a very meaningful feedback. But overall I think the llama-grammar
module could use some deeper refactoring and maintenance. The main things I would lookout for is to keep the implementation simple, no extra dependencies and good performance. The general approach has been to prototype stuff in libcommon
and when it becomes mature enough, to move it to libllama
.
One more thought is that long-term we can also think about moving some core functionality about grammars to ggml
. At some point I was considering it because I wanted to reuse grammar functionality in whisper.cpp
. So it's something to think about, but very low-prio atm.
@ochafik here's how the lazy matching is handled in llguidance, see also docs
import llguidance
import huggingface_hub
import json
lark_grm = """
start: "<tool_name>" name "<tool_data>" data "</tool_data>"
name[capture, suffix="</tool_name>"]: /.*/
data[capture]: %json {
"properties": {
"foo": { "type": "string" }
},
"required": ["foo"]
}
"""
def main():
tok_name = huggingface_hub.hf_hub_download(
"microsoft/Phi-3.5-mini-instruct", "tokenizer.json"
)
with open(tok_name, "r") as f:
text = f.read()
tok = llguidance.LLTokenizer(text)
interp = llguidance.LLInterpreter(
tok,
json.dumps({"grammars": [{"lark_grammar": lark_grm}]}),
enable_ff_tokens=False,
enable_backtrack=False,
log_level=1,
)
interp.start_without_prompt()
toks = tok.tokenize_str("<tool_name>foo<bar></tool_name><tool_data>{\"foo\": \"bar\"}</tool_data>")
for t in toks:
mask, r = interp.compute_mask()
obj = json.loads(r)
for p in obj["progress"]:
if p["object"] != "text":
print("\n ", end="")
print(p)
# feeding token now
print(tok.dbg_tokens([t]), end=" ")
interp.commit_token(t)
print("\n")
if __name__ == "__main__":
main()
When you run it, you get:
âŚ<â§ âŚtoolâ§ âŚ_â§ âŚnameâ§ âŚ>â§ âŚfooâ§ âŚ<â§ âŚbarâ§ âŚ></â§ âŚtoolâ§ âŚ_â§ âŚnameâ§ âŚ><â§
{'object': 'capture', 'name': 'name', 'str': 'foo<bar>', 'hex': '666f6f3c6261723e', 'log_prob': 0.0}
âŚtoolâ§ âŚ_â§ âŚdataâ§ âŚ>{â§ âŚ"â§ âŚfooâ§ âŚ":⧠⌠"â§ âŚbarâ§ âŚ"}â§
{'object': 'capture', 'name': 'data', 'str': '{"foo": "bar"}', 'hex': '7b22666f6f223a2022626172227d', 'log_prob': 0.0}
âŚ</â§ âŚtoolâ§ âŚ_â§ âŚdataâ§ âŚ>â§
The captures are generated immedietly after getting enough tokens.
If the model use special tokens, you need to write the grammar slightly differently:
start: <|assistant|> name <|end|> /\s*/ data
name[capture]: /.*/
data[capture]: %json {
"properties": {
"foo": { "type": "string" }
},
"required": ["foo"]
}
Note lack of suffix=
on name - it will extend greedily, until it hits the <|end|>
special token. Special tokens are never allowed by regular expressions.
Login to write a write a comment.
--reasoning-format FORMAT
flag that populatesmessage.reasoning_content
in the response using native<think>
tags for DeepSeek R1 and<|START_THINKING|>
for Command R7B if the format isdeepseek
(default), otherwise leaves thinking traces as they are inmessage.content
if format isnone
tool_plan
field added temporarily #11585 into the newreasoning_content
(non-streaming api only for now).
Usage
Get and build this PR's branch
Run with (add
--verbose
to inspect prompt / grammars used):Call the API and profit
Show result w/ `DeepSeek-R1-Distill-Qwen-32B-GGUF:Q6_K_L`
Which is this code:
Not too bad, but it didn't do lower-case and word split is a bit poor.
Trying again w/ the following extra args to make the sampling greedy:
We have a winner:
And the thoughts:
Implementation notes
#11641
<ď˝toolâoutputâendď˝>
or<ď˝toolâcallâendď˝>
(need to close the list of outputs / calls w/ plural<ď˝toolâoutputsâendď˝>
/<ď˝toolâcallsâendď˝>
, respectively, and then missing end of sentence + optional add_generation_prompt)models/templates/llama-cpp-deepseek-r1.jinja
)--reasoning-format
flag, which controls output ofreasoning_content
in the API (seetest_thoughts
)<think>...</think>
tags for DeepSeek R1 and<|START_THINKING|>...<|END_THINKING|>
for Command R7B.tool_plan
field / now flowing intoreasoning_content
(was added in #11585)test_calc_result
(checking models make some use of tool call results, which some struggle a bit with)TODOs:
Possible follow ups
Document differences between stream & non-stream modes (thought & tool_plan not sent in stream)
look at the Llama distill more closely (see #11591)
Reintroduce forced thinking in generic handler under some
--reasoning
flag (+ explore @ngxson's idea to support adisabled
value that biases thinking tags)