You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is somewhat awkward for spill backends that are naturally async, such as remote object stores. For example, when writing spill files to S3, GCS, Azure, or another object_store implementation, uploads are async operations, but the current SpillWriter API requires adapting them to a blocking std::io::Write interface.
I found this while working on an example showing how to write spill files to an object store:
Remote/object-store spill implementations do not need to buffer all spill bytes in memory just to bridge from synchronous Write to async upload APIs
Existing local disk spilling continues to work
The public API change is documented in the upgrading guide if needed
Describe alternatives you've considered
Keep the current synchronous SpillWriter API. This works well for local files, but makes remote spill backends harder to implement efficiently because they must either block on async uploads or buffer data and upload it later.
Add only an object-store-specific spill implementation. That would help one backend, but the underlying mismatch is in the spill file abstraction itself.
Is your feature request related to a problem or challenge?
Noticed while reviewing this PR from @pantShrey
SpillFilecurrently has an async-style API for reading spill data:but writing spill data is synchronous:
This is somewhat awkward for spill backends that are naturally async, such as remote object stores. For example, when writing spill files to S3, GCS, Azure, or another
object_storeimplementation, uploads are async operations, but the currentSpillWriterAPI requires adapting them to a blockingstd::io::Writeinterface.I found this while working on an example showing how to write spill files to an object store:
Describe the solution you'd like
Add an async spill writing API so custom spill backends can write data without forcing async storage systems through a synchronous
Writeabstraction.For example, DataFusion could introduce an async writer trait like
or use an existing async I/O trait if there is a good fit.
Suggested steps:
datafusion-executionSpillFileAcceptance criteria:
Writeto async upload APIsDescribe alternatives you've considered
Keep the current synchronous
SpillWriterAPI. This works well for local files, but makes remote spill backends harder to implement efficiently because they must either block on async uploads or buffer data and upload it later.Add only an object-store-specific spill implementation. That would help one backend, but the underlying mismatch is in the spill file abstraction itself.
Additional context
Related PRs / comments: