Skip to content

Spark DataSource backed by a DataFusion TableProvider over ADBC #112

Description

@timsaucer

Is your feature request related to a problem or challenge?

Spark users want to read data from a DataFusion TableProvider as a native Spark DataSourceV2. Today there is no first-class path; options are either a bespoke per-operation JNI surface (more native surface to maintain) or copying data out of process.

Describe the solution you'd like

A Spark DataSourceV2 connector that places the native boundary at a standard ADBC driver. Spark talks to the upstream arrow-adbc Java driver manager (adbc-core + adbc-driver-jni), which loads a native DataFusion ADBC cdylib and returns arrow-java ArrowReaders consumed zero-copy as ArrowColumnVectors on the cluster-provided Arrow. This reuses the upstream ADBC bindings rather than reproducing them.

Scope:

  • adbc-datafusion format registered as a DataSourceV2; schema probed on the driver.
  • Projection / filter / limit pushdown via Substrait, with a SQL fallback.
  • Multi-partition reads (executePartitioned / readPartition) and a target_partitions option.
  • Per-executor connection pool to amortize driver/database setup across task slots.
  • An example DataFusion ADBC driver cdylib plus end-to-end (PySpark) coverage.

Describe alternatives you've considered

A plain-C scan ABI + hand-written JNI shim (discussed on #103 / #104). The ADBC approach reuses standard, separately-reviewed bindings and a stable driver contract instead.

Additional context

Implemented in #111.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions