Skip to content

fix: clamp max_tokens to the context window for OpenAI-compatible providers#3393

Open
Sayt-0 wants to merge 1 commit into
mainfrom
fix/3387-clamp-max-tokens-context-window
Open

fix: clamp max_tokens to the context window for OpenAI-compatible providers#3393
Sayt-0 wants to merge 1 commit into
mainfrom
fix/3387-clamp-max-tokens-context-window

Conversation

@Sayt-0

@Sayt-0 Sayt-0 commented Jul 1, 2026

Copy link
Copy Markdown
Member

Summary

A self-hosted vLLM user (#3387) got "the conversation has exceeded the model's context window" on a bare "hello". Root cause: max_tokens (the per-response output budget) was set equal to the context window and forwarded verbatim. vLLM requires prompt_tokens + max_tokens <= context_window, so no room was left for the prompt.

Root cause

Cause Handled by Status
max_tokens forwarded unclamped clamp in OpenAI client fixed
YAML context_size read as 0 (uint64 not handled) shared parser handles unsigned ints fixed
Misleading config, no early signal load-time warning + schema doc fixed

Changes

  • pkg/model/provider/openai/client.go: clamp max_tokens to window - 1024 when the window is known (provider_opts.context_size first, then models.dev), on both the chat-completions and responses paths. Window unknown, value left unchanged.
  • pkg/config/latest/types.go: ContextSizeFromProviderOpts, a single parser now used by the runtime and the client. Handles uint64/uint/uint32 (goccy/go-yaml decodes positive YAML integers as uint64), so a YAML context_size is honored. This also restored proactive compaction for those configs.
  • pkg/runtime/session_compaction.go: providerContextLimit delegates to the shared parser.
  • pkg/config/max_tokens_warning.go: load-time warning when max_tokens >= context_size, or when it is set to a context-window-sized value with no discoverable window.
  • agent-schema.json: max_tokens documented as the output budget, not the context window.

Reproduction (mock enforcing vLLM's rule, max_model_len = 262144)

Scenario max_tokens sent Server Result
Reporter config (no context_size) 262144 reject 12 + 262144 > 262144 bug reproduced, warning shown
With context_size: 262144 261120 (clamped) accept 12 + 261120 <= 262144 reply returned

Design note (open to maintainer preference)

The clamp reserves a fixed 1024-token headroom (window - 1024), matching the Anthropic client's clampMaxTokens. It is deliberately prompt-agnostic (no token-count round-trip): it guarantees room for a small prompt, not necessarily a very large one. A large agent prompt under a known window could still overflow and would fall through to the existing overflow detection and compaction. If a percentage-based margin (for example window - max(1024, window/8)) is preferred to better fit large agent prompts, it is a one-line change.

Testing

  • pkg/model/provider/openai: clamp fires when the window is known, verbatim when unknown, plus a unit test of the clamp math.
  • pkg/config/latest: parser covers int/uint64/float64/string and a real YAML round-trip.
  • pkg/config: load-time warning cases.
  • pkg/runtime: uint64 case added to the existing context-limit test.
  • task build and task lint pass.

Fixes #3387

…viders

max_tokens is the per-response output budget, not the context window.
OpenAI-compatible servers such as vLLM require prompt_tokens + max_tokens
to fit the context window, so a max_tokens set equal to the window leaves
no room for the prompt and rejects every request with a "maximum context
length" error (surfaced as a context-window-exceeded warning).

Changes:
- Clamp max_tokens to (context window - headroom) in the OpenAI client, on
  both the chat-completions and responses paths, when the window is known
  via provider_opts.context_size or the models.dev catalogue.
- Fix context_size parsing: goccy/go-yaml decodes a positive YAML integer
  as uint64, which the previous switch dropped to 0, so a YAML context_size
  was silently ignored (this also affected proactive compaction). Share one
  parser between the runtime and the provider clients.
- Warn at config load time when max_tokens is set to a context-window-sized
  value.
- Clarify the max_tokens description in agent-schema.json.

Fixes #3387
@aheritier aheritier added area/config For configuration parsing, YAML, environment variables area/providers/openai For features/issues/fixes related to the usage of OpenAI models area/runtime Runtime engine, agent loop execution, tool dispatch, loop detection kind/fix PR fixes a bug (maps to fix:). Use on PRs only. labels Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/config For configuration parsing, YAML, environment variables area/providers/openai For features/issues/fixes related to the usage of OpenAI models area/runtime Runtime engine, agent loop execution, tool dispatch, loop detection kind/fix PR fixes a bug (maps to fix:). Use on PRs only.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Getting error "Conversation has exceeded model's context window..." for a simple "hello"

3 participants