fix: clamp max_tokens to the context window for OpenAI-compatible providers by Sayt-0 · Pull Request #3393 · docker/docker-agent

Sayt-0 · 2026-07-01T17:43:05Z

Summary

A self-hosted vLLM user (#3387) got "the conversation has exceeded the model's context window" on a bare "hello". Root cause: max_tokens (the per-response output budget) was set equal to the context window and forwarded verbatim. vLLM requires prompt_tokens + max_tokens <= context_window, so no room was left for the prompt.

Root cause

Cause	Handled by	Status
`max_tokens` forwarded unclamped	clamp in OpenAI client	fixed
YAML `context_size` read as 0 (`uint64` not handled)	shared parser handles unsigned ints	fixed
Misleading config, no early signal	load-time warning + schema doc	fixed

Changes

pkg/model/provider/openai/client.go: clamp max_tokens to window - 1024 when the window is known (provider_opts.context_size first, then models.dev), on both the chat-completions and responses paths. Window unknown, value left unchanged.
pkg/config/latest/types.go: ContextSizeFromProviderOpts, a single parser now used by the runtime and the client. Handles uint64/uint/uint32 (goccy/go-yaml decodes positive YAML integers as uint64), so a YAML context_size is honored. This also restored proactive compaction for those configs.
pkg/runtime/session_compaction.go: providerContextLimit delegates to the shared parser.
pkg/config/max_tokens_warning.go: load-time warning when max_tokens >= context_size, or when it is set to a context-window-sized value with no discoverable window.
agent-schema.json: max_tokens documented as the output budget, not the context window.

Reproduction (mock enforcing vLLM's rule, `max_model_len = 262144`)

Scenario	max_tokens sent	Server	Result
Reporter config (no `context_size`)	262144	reject `12 + 262144 > 262144`	bug reproduced, warning shown
With `context_size: 262144`	261120 (clamped)	accept `12 + 261120 <= 262144`	reply returned

Design note (open to maintainer preference)

The clamp reserves a fixed 1024-token headroom (window - 1024), matching the Anthropic client's clampMaxTokens. It is deliberately prompt-agnostic (no token-count round-trip): it guarantees room for a small prompt, not necessarily a very large one. A large agent prompt under a known window could still overflow and would fall through to the existing overflow detection and compaction. If a percentage-based margin (for example window - max(1024, window/8)) is preferred to better fit large agent prompts, it is a one-line change.

Testing

pkg/model/provider/openai: clamp fires when the window is known, verbatim when unknown, plus a unit test of the clamp math.
pkg/config/latest: parser covers int/uint64/float64/string and a real YAML round-trip.
pkg/config: load-time warning cases.
pkg/runtime: uint64 case added to the existing context-limit test.
task build and task lint pass.

Fixes #3387

…viders max_tokens is the per-response output budget, not the context window. OpenAI-compatible servers such as vLLM require prompt_tokens + max_tokens to fit the context window, so a max_tokens set equal to the window leaves no room for the prompt and rejects every request with a "maximum context length" error (surfaced as a context-window-exceeded warning). Changes: - Clamp max_tokens to (context window - headroom) in the OpenAI client, on both the chat-completions and responses paths, when the window is known via provider_opts.context_size or the models.dev catalogue. - Fix context_size parsing: goccy/go-yaml decodes a positive YAML integer as uint64, which the previous switch dropped to 0, so a YAML context_size was silently ignored (this also affected proactive compaction). Share one parser between the runtime and the provider clients. - Warn at config load time when max_tokens is set to a context-window-sized value. - Clarify the max_tokens description in agent-schema.json. Fixes #3387

Sayt-0 requested a review from a team as a code owner July 1, 2026 17:43

Sayt-0 mentioned this pull request Jul 1, 2026

Getting error "Conversation has exceeded model's context window..." for a simple "hello" #3387

Open

dgageot approved these changes Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: clamp max_tokens to the context window for OpenAI-compatible providers#3393

fix: clamp max_tokens to the context window for OpenAI-compatible providers#3393
Sayt-0 wants to merge 1 commit into
mainfrom
fix/3387-clamp-max-tokens-context-window

Sayt-0 commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Sayt-0 commented Jul 1, 2026

Summary

Root cause

Changes

Reproduction (mock enforcing vLLM's rule, max_model_len = 262144)

Design note (open to maintainer preference)

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Reproduction (mock enforcing vLLM's rule, `max_model_len = 262144`)