Problem
#23178 and #23249 both found regressions caused by using EmitTo::First(batch_size) during hash aggregate terminal output.
The short-term fixes materialize aggregate output once with EmitTo::All and then slice the materialized RecordBatch. This avoids repeated EmitTo::First work for final and partial terminal output paths.
However, this does not solve the broader API/design problem:
EmitTo::First(n) sounds like cheap prefix emission, but it also mutates internal state.
- It removes emitted groups and renumbers remaining group indexes.
- It can require expensive copying, shifting, allocation, and lookup-state maintenance.
- It is easy to misuse in terminal output paths where no more groups will be interned.
- It also makes
GroupValues and GroupsAccumulator implementations harder to maintain.
Related discussion
In #23249, @Rachelint suggested it may be better to eliminate EmitTo::First entirely, including sorted cases, because it is expensive, hard to optimize, and difficult to support.
@2010YOUY01 suggested that after blocked state management, the ideal shape may be:
- no
EmitTo::First
- O(1)
EmitTo::FirstBlock
Goal
Track the long-term design for replacing or narrowing EmitTo::First usage in aggregation.
Non-goals
This issue is not intended to block the short-term regression fixes in #23178 or #23249.
Those fixes can continue using materialize-once-and-slice. This issue tracks the broader follow-up design and incremental migration path.
Problem
#23178 and #23249 both found regressions caused by using
EmitTo::First(batch_size)during hash aggregate terminal output.The short-term fixes materialize aggregate output once with
EmitTo::Alland then slice the materializedRecordBatch. This avoids repeatedEmitTo::Firstwork for final and partial terminal output paths.However, this does not solve the broader API/design problem:
EmitTo::First(n)sounds like cheap prefix emission, but it also mutates internal state.GroupValuesandGroupsAccumulatorimplementations harder to maintain.Related discussion
In #23249, @Rachelint suggested it may be better to eliminate
EmitTo::Firstentirely, including sorted cases, because it is expensive, hard to optimize, and difficult to support.@2010YOUY01 suggested that after blocked state management, the ideal shape may be:
EmitTo::FirstEmitTo::FirstBlockGoal
Track the long-term design for replacing or narrowing
EmitTo::Firstusage in aggregation.Non-goals
This issue is not intended to block the short-term regression fixes in #23178 or #23249.
Those fixes can continue using materialize-once-and-slice. This issue tracks the broader follow-up design and incremental migration path.