fix: clamp max_tokens to the context window for OpenAI-compatible providers#3393
Open
Sayt-0 wants to merge 1 commit into
Open
fix: clamp max_tokens to the context window for OpenAI-compatible providers#3393Sayt-0 wants to merge 1 commit into
Sayt-0 wants to merge 1 commit into
Conversation
…viders max_tokens is the per-response output budget, not the context window. OpenAI-compatible servers such as vLLM require prompt_tokens + max_tokens to fit the context window, so a max_tokens set equal to the window leaves no room for the prompt and rejects every request with a "maximum context length" error (surfaced as a context-window-exceeded warning). Changes: - Clamp max_tokens to (context window - headroom) in the OpenAI client, on both the chat-completions and responses paths, when the window is known via provider_opts.context_size or the models.dev catalogue. - Fix context_size parsing: goccy/go-yaml decodes a positive YAML integer as uint64, which the previous switch dropped to 0, so a YAML context_size was silently ignored (this also affected proactive compaction). Share one parser between the runtime and the provider clients. - Warn at config load time when max_tokens is set to a context-window-sized value. - Clarify the max_tokens description in agent-schema.json. Fixes #3387
dgageot
approved these changes
Jul 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A self-hosted vLLM user (#3387) got "the conversation has exceeded the model's context window" on a bare "hello". Root cause:
max_tokens(the per-response output budget) was set equal to the context window and forwarded verbatim. vLLM requiresprompt_tokens + max_tokens <= context_window, so no room was left for the prompt.Root cause
max_tokensforwarded unclampedcontext_sizeread as 0 (uint64not handled)Changes
pkg/model/provider/openai/client.go: clampmax_tokenstowindow - 1024when the window is known (provider_opts.context_sizefirst, then models.dev), on both the chat-completions and responses paths. Window unknown, value left unchanged.pkg/config/latest/types.go:ContextSizeFromProviderOpts, a single parser now used by the runtime and the client. Handlesuint64/uint/uint32(goccy/go-yaml decodes positive YAML integers asuint64), so a YAMLcontext_sizeis honored. This also restored proactive compaction for those configs.pkg/runtime/session_compaction.go:providerContextLimitdelegates to the shared parser.pkg/config/max_tokens_warning.go: load-time warning whenmax_tokens >= context_size, or when it is set to a context-window-sized value with no discoverable window.agent-schema.json:max_tokensdocumented as the output budget, not the context window.Reproduction (mock enforcing vLLM's rule,
max_model_len = 262144)context_size)12 + 262144 > 262144context_size: 26214412 + 261120 <= 262144Design note (open to maintainer preference)
The clamp reserves a fixed 1024-token headroom (
window - 1024), matching the Anthropic client'sclampMaxTokens. It is deliberately prompt-agnostic (no token-count round-trip): it guarantees room for a small prompt, not necessarily a very large one. A large agent prompt under a known window could still overflow and would fall through to the existing overflow detection and compaction. If a percentage-based margin (for examplewindow - max(1024, window/8)) is preferred to better fit large agent prompts, it is a one-line change.Testing
pkg/model/provider/openai: clamp fires when the window is known, verbatim when unknown, plus a unit test of the clamp math.pkg/config/latest: parser covers int/uint64/float64/string and a real YAML round-trip.pkg/config: load-time warning cases.pkg/runtime:uint64case added to the existing context-limit test.task buildandtask lintpass.Fixes #3387