fix: Spark-compatible HALF_UP rounding for round() on float types by sjhddh · Pull Request #22813 · apache/datafusion

sjhddh · 2026-06-08T00:09:03Z

Which issue does this PR close?

Rationale for this change

round_float used naive binary-float arithmetic (value * factor).round() / factor, which diverges from Apache Spark's RoundBase. Spark evaluates BigDecimal(d).setScale(scale, HALF_UP), and BigDecimal(Double) parses the shortest round-trip decimal string of the double. As a result:

SELECT round(1.255::double, 2::int);  -- Spark 1.26, was 1.25
SELECT round(1.005::double, 2::int);  -- Spark 1.01, was 1.0

What changes are included in this PR?

Reimplement round_float to mirror Spark: widen to f64 (matching Spark's f.toDouble for FloatType), pass NaN/Inf through unchanged, then round via a BigDecimal built from the value's shortest-string representation with RoundingMode::HalfUp (ties away from zero).
scale is clamped to ±340 before constructing the decimal. A finite f64 carries at most ~324 fractional digits and saturates above ~1e309, so any larger magnitude is already a no-op or collapses to zero. This also avoids an unbounded 10^scale BigInt allocation on adversarial input like round(x, i32::MAX).
Unit tests for the divergent cases, regression guards (2.675, 8.35), negative values and scales, ties-away-from-zero, NaN/Inf pass-through, and bounded extreme scales.
sqllogictest coverage for the double path.

Are these changes tested?

Yes. New unit tests in round.rs and new cases in spark/math/round.slt. The full datafusion-spark test suite passes and clippy is clean locally.

Are there any user-facing changes?

round() on float/double now matches Spark at the half-way point. This is a correctness fix; results that previously diverged from Spark will change.

Note on performance: the BigDecimal path is heavier than the prior multiply/divide. If maintainers prefer, I'm happy to add a fast path for the common small-scale case and fall back to BigDecimal only when needed — let me know your preference.

`round_float` used naive binary-float arithmetic `(value * factor).round() / factor`, which diverges from Apache Spark's `RoundBase`. Spark evaluates `BigDecimal(d).setScale(scale, HALF_UP)` where `BigDecimal(Double)` parses the shortest round-trip decimal string, so e.g. `round(1.255, 2)` is 1.26 in Spark but produced 1.25 here (and `round(1.005, 2)` gave 1.0 instead of 1.01). Reimplement `round_float` to match Spark: widen to f64 (mirrors Spark's `f.toDouble` for FloatType), guard NaN/Inf as pass-through, then round via `BigDecimal` built from the value's shortest-string representation using HALF_UP. The function's existing doc comment already described this BigDecimal/HALF_UP behaviour; the code now matches it. `scale` is clamped to +/-340 before constructing the decimal: a finite f64 carries at most ~324 fractional digits and saturates above ~1e309, so any larger magnitude is a no-op or collapses to zero. This also prevents an unbounded `10^scale` BigInt allocation on adversarial input such as `round(x, i32::MAX)`. Add unit tests for the divergent cases, regression guards, negative values and scales, ties-away-from-zero, NaN/Inf, and bounded extreme scales; add sqllogictest coverage for the double path. Signed-off-by: sjhddh <151469562+sjhddh@users.noreply.github.com>

Jefffrey · 2026-06-09T01:43:15Z

+    // A finite f64 carries at most ~324 fractional decimal digits and saturates
+    // below ~1e309 in magnitude, so any `scale` past those bounds is already a
+    // no-op (large positive) or collapses the value to zero (large negative).
+    // Clamp before `with_scale_round` so adversarial input such as
+    // `round(x, i32::MAX)` cannot drive an unbounded `10^scale` BigInt
+    // allocation. The clamp is exact for every finite f64.
+    let clamped_scale = i64::from(scale).clamp(-340, 340);
+


We might need to error if following Spark semantics here?

>>> spark.version '4.1.2' >>> spark.sql("select round(1.255::double, 2147483647)").show() Traceback (most recent call last): File "<python-input-4>", line 1, in <module> spark.sql("select round(1.255::double, 2147483647)").show() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^ File "/Users/jeffrey/.cache/uv/archive-v0/GIQgMkXRrHZBaiUVcMOta/lib/python3.13/site-packages/pyspark/sql/classic/dataframe.py", line 285, in show print(self._show_string(n, truncate, vertical)) ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/jeffrey/.cache/uv/archive-v0/GIQgMkXRrHZBaiUVcMOta/lib/python3.13/site-packages/pyspark/sql/classic/dataframe.py", line 303, in _show_string return self._jdf.showString(n, 20, vertical) ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/Users/jeffrey/.cache/uv/archive-v0/GIQgMkXRrHZBaiUVcMOta/lib/python3.13/site-packages/py4j/java_gateway.py", line 1362, in __call__ return_value = get_return_value( answer, self.gateway_client, self.target_id, self.name) File "/Users/jeffrey/.cache/uv/archive-v0/GIQgMkXRrHZBaiUVcMOta/lib/python3.13/site-packages/pyspark/errors/exceptions/captured.py", line 269, in deco raise converted from None pyspark.errors.exceptions.captured.ArithmeticException: BigInteger would overflow supported range

Good observation, Spark 4.1.2 has ANSI mode ON by default, in Datafusion we just started to support it.

Jefffrey · 2026-06-09T01:44:14Z

+    // Widen to f64 first. For f32 inputs this matches Spark's `f.toDouble`
+    // step (FloatType: `BigDecimal(f.toDouble).setScale(..).toFloat`), which
+    // exposes the binary-float error before rounding. For f64 it is a no-op.
+    let Some(d) = value.to_f64() else {


we could also do it like so

fn round_float<T: num_traits::Float + Into<f64>>(value: T, scale: i32) -> T { // Spark returns NaN / ±Inf unchanged; BigDecimal cannot represent them. if !value.is_finite() { return value; } // Spark always widens f32: `BigDecimal(f.toDouble).setScale(..).toFloat` // This exposes the binary-float error before rounding. let d: f64 = value.into();

can check finiteness without f64 coversion

can cast from f32 to f64 without loss so dont need the option unwrapping

streamline comment

Jefffrey · 2026-06-09T01:46:52Z

+    // `d.to_string()` produces the shortest round-trip decimal string, matching
+    // Scala's `BigDecimal(d) = java.math.BigDecimal.valueOf(d)` semantics. So
+    // `round(1.255_f64, 2)` parses "1.255" and rounds to 1.26 (not the naive
+    // binary-float 1.25).
+    let Ok(bd) = BigDecimal::from_str(&d.to_string()) else {
+        // Should not happen for a finite f64, but fall back gracefully.
+        return value;
+    };


something i find interesting is apparently the spark code for this differs a bit. for nullSafeEval:

case DoubleType => val d = input1.asInstanceOf[Double] if (d.isNaN || d.isInfinite) { d } else { BigDecimal(d).setScale(_scale, mode).toDouble }

https://github.com/apache/spark/blob/0993d4345969dfe16b334598dc80a452e4a270f7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L1609-L1615

uses BigDecimal(double val)

meanwhile for doCodeGen:

case DoubleType => // if child eval to NaN or Infinity, just return it. s""" if (Double.isNaN(${ce.value}) || Double.isInfinite(${ce.value})) { ${ev.value} = ${ce.value}; } else { ${ev.value} = java.math.BigDecimal.valueOf(${ce.value}). setScale(${_scale}, java.math.BigDecimal.${modeStr}).doubleValue(); }"""

https://github.com/apache/spark/blob/0993d4345969dfe16b334598dc80a452e4a270f7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L1670-L1678

uses BigDecimal.valueOf(double val)

do we need to consider this?

comphead · 2026-06-09T02:44:24Z

I would like to check it out

kosiew

Looks good overall. I have one small suggestion that could make the invariant a bit clearer.

kosiew · 2026-06-09T09:02:40Z

+    // Scala's `BigDecimal(d) = java.math.BigDecimal.valueOf(d)` semantics. So
+    // `round(1.255_f64, 2)` parses "1.255" and rounds to 1.26 (not the naive
+    // binary-float 1.25).
+    let Ok(bd) = BigDecimal::from_str(&d.to_string()) else {


Since we've already guarded against non-finite f64 values, BigDecimal::from_str(&d.to_string()) should always succeed. Rust's display output for a finite float is valid decimal input for BigDecimal.

Returning the original value here makes that invariant a little less explicit and could potentially hide a future regression. Would it make sense to encode the assumption directly with something like:

let bd = BigDecimal::from_str(&d.to_string()) .expect("finite f64 Display parses as BigDecimal");

Alternatively, a debug_assert! plus an explicit fallback could work if panicking is not desirable on this path.

comphead · 2026-06-09T15:21:45Z

+    // Widen to f64 first. For f32 inputs this matches Spark's `f.toDouble`
+    // step (FloatType: `BigDecimal(f.toDouble).setScale(..).toFloat`), which
+    // exposes the binary-float error before rounding. For f64 it is a no-op.
+    let Some(d) = value.to_f64() else {


appreciate if we can name vars more meaningfully than d, bd

comphead · 2026-06-09T15:24:45Z

    }
 }
+
+#[cfg(test)]


SLT tests should be enough.

We keep rust tests for SQL functions only if SLT coverage is not sufficient and SLT constraints full test coverage

alamb · 2026-06-16T19:19:19Z

@sjhddh can you please address @comphead 's comments so we can merge this PR?

alamb · 2026-06-22T14:58:39Z

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

github-actions Bot added sqllogictest SQL Logic Tests (.slt) spark labels Jun 8, 2026

Jefffrey reviewed Jun 9, 2026

View reviewed changes

comphead self-requested a review June 9, 2026 02:44

kosiew approved these changes Jun 9, 2026

View reviewed changes

comphead reviewed Jun 9, 2026

View reviewed changes

comphead mentioned this pull request Jun 9, 2026

feat: Support IEEE 754 negative zero semantics #22835

Merged

alamb marked this pull request as draft June 22, 2026 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Spark-compatible HALF_UP rounding for round() on float types#22813

fix: Spark-compatible HALF_UP rounding for round() on float types#22813
sjhddh wants to merge 1 commit into
apache:mainfrom
sjhddh:fix/spark-round-float-halfup

sjhddh commented Jun 8, 2026

Uh oh!

Jefffrey Jun 9, 2026

Uh oh!

comphead Jun 9, 2026

Uh oh!

Jefffrey Jun 9, 2026

Uh oh!

Jefffrey Jun 9, 2026

Uh oh!

comphead commented Jun 9, 2026

Uh oh!

kosiew left a comment

Uh oh!

kosiew Jun 9, 2026

Uh oh!

comphead Jun 9, 2026

Uh oh!

comphead Jun 9, 2026

Uh oh!

alamb commented Jun 16, 2026

Uh oh!

alamb commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

sjhddh commented Jun 8, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Jefffrey Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

comphead Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

comphead commented Jun 9, 2026

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

kosiew Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

comphead Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

comphead Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 16, 2026

Uh oh!

alamb commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants