Skip to content

Support collect_set and collect_set for windows execution #23261

Description

@comphead

Is your feature request related to a problem or challenge?

Summary

Support collect_list and collect_set as window functions in DataFusion.

These are commonly used in Spark and other query engines to collect values within a window frame and enable use cases such as rolling lists, session analysis, and sequence-based analytics.

Example

SELECT
    user_id,
    ts,
    collect_list(event) OVER (
        PARTITION BY user_id
        ORDER BY ts
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS events
FROM t;
SELECT
    user_id,
    ts,
    collect_set(event) OVER (
        PARTITION BY user_id
        ORDER BY ts
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS unique_events
FROM t;

Motivation

Supporting these functions would improve Spark compatibility and align DataFusion with other query engines that support aggregate window functions, including DuckDB (list), Trino/Presto (array_agg), PostgreSQL (array_agg), BigQuery (ARRAY_AGG), and Snowflake.

Acceptance Criteria

  • Support collect_list(...) OVER (...)
  • Support collect_set(...) OVER (...)
  • Support standard window frames (ROWS and RANGE) where applicable
  • Add SQL and DataFrame tests covering common window specifications

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request
No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions