Skip to content

Track long-term replacement for destructive EmitTo::First in aggregation #23251

Description

@hhhizzz

Problem

#23178 and #23249 both found regressions caused by using EmitTo::First(batch_size) during hash aggregate terminal output.

The short-term fixes materialize aggregate output once with EmitTo::All and then slice the materialized RecordBatch. This avoids repeated EmitTo::First work for final and partial terminal output paths.

However, this does not solve the broader API/design problem:

  • EmitTo::First(n) sounds like cheap prefix emission, but it also mutates internal state.
  • It removes emitted groups and renumbers remaining group indexes.
  • It can require expensive copying, shifting, allocation, and lookup-state maintenance.
  • It is easy to misuse in terminal output paths where no more groups will be interned.
  • It also makes GroupValues and GroupsAccumulator implementations harder to maintain.

Related discussion

In #23249, @Rachelint suggested it may be better to eliminate EmitTo::First entirely, including sorted cases, because it is expensive, hard to optimize, and difficult to support.

@2010YOUY01 suggested that after blocked state management, the ideal shape may be:

  • no EmitTo::First
  • O(1) EmitTo::FirstBlock

Goal

Track the long-term design for replacing or narrowing EmitTo::First usage in aggregation.

Non-goals

This issue is not intended to block the short-term regression fixes in #23178 or #23249.

Those fixes can continue using materialize-once-and-slice. This issue tracks the broader follow-up design and incremental migration path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions