fix: Spark-compatible HALF_UP rounding for round() on float types#22813
fix: Spark-compatible HALF_UP rounding for round() on float types#22813sjhddh wants to merge 1 commit into
Conversation
`round_float` used naive binary-float arithmetic `(value * factor).round() / factor`, which diverges from Apache Spark's `RoundBase`. Spark evaluates `BigDecimal(d).setScale(scale, HALF_UP)` where `BigDecimal(Double)` parses the shortest round-trip decimal string, so e.g. `round(1.255, 2)` is 1.26 in Spark but produced 1.25 here (and `round(1.005, 2)` gave 1.0 instead of 1.01). Reimplement `round_float` to match Spark: widen to f64 (mirrors Spark's `f.toDouble` for FloatType), guard NaN/Inf as pass-through, then round via `BigDecimal` built from the value's shortest-string representation using HALF_UP. The function's existing doc comment already described this BigDecimal/HALF_UP behaviour; the code now matches it. `scale` is clamped to +/-340 before constructing the decimal: a finite f64 carries at most ~324 fractional digits and saturates above ~1e309, so any larger magnitude is a no-op or collapses to zero. This also prevents an unbounded `10^scale` BigInt allocation on adversarial input such as `round(x, i32::MAX)`. Add unit tests for the divergent cases, regression guards, negative values and scales, ties-away-from-zero, NaN/Inf, and bounded extreme scales; add sqllogictest coverage for the double path. Signed-off-by: sjhddh <151469562+sjhddh@users.noreply.github.com>
| // A finite f64 carries at most ~324 fractional decimal digits and saturates | ||
| // below ~1e309 in magnitude, so any `scale` past those bounds is already a | ||
| // no-op (large positive) or collapses the value to zero (large negative). | ||
| // Clamp before `with_scale_round` so adversarial input such as | ||
| // `round(x, i32::MAX)` cannot drive an unbounded `10^scale` BigInt | ||
| // allocation. The clamp is exact for every finite f64. | ||
| let clamped_scale = i64::from(scale).clamp(-340, 340); | ||
|
|
There was a problem hiding this comment.
We might need to error if following Spark semantics here?
>>> spark.version
'4.1.2'
>>> spark.sql("select round(1.255::double, 2147483647)").show()
Traceback (most recent call last):
File "<python-input-4>", line 1, in <module>
spark.sql("select round(1.255::double, 2147483647)").show()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
File "/Users/jeffrey/.cache/uv/archive-v0/GIQgMkXRrHZBaiUVcMOta/lib/python3.13/site-packages/pyspark/sql/classic/dataframe.py", line 285, in show
print(self._show_string(n, truncate, vertical))
~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jeffrey/.cache/uv/archive-v0/GIQgMkXRrHZBaiUVcMOta/lib/python3.13/site-packages/pyspark/sql/classic/dataframe.py", line 303, in _show_string
return self._jdf.showString(n, 20, vertical)
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/Users/jeffrey/.cache/uv/archive-v0/GIQgMkXRrHZBaiUVcMOta/lib/python3.13/site-packages/py4j/java_gateway.py", line 1362, in __call__
return_value = get_return_value(
answer, self.gateway_client, self.target_id, self.name)
File "/Users/jeffrey/.cache/uv/archive-v0/GIQgMkXRrHZBaiUVcMOta/lib/python3.13/site-packages/pyspark/errors/exceptions/captured.py", line 269, in deco
raise converted from None
pyspark.errors.exceptions.captured.ArithmeticException: BigInteger would overflow supported rangeThere was a problem hiding this comment.
Good observation, Spark 4.1.2 has ANSI mode ON by default, in Datafusion we just started to support it.
| // Widen to f64 first. For f32 inputs this matches Spark's `f.toDouble` | ||
| // step (FloatType: `BigDecimal(f.toDouble).setScale(..).toFloat`), which | ||
| // exposes the binary-float error before rounding. For f64 it is a no-op. | ||
| let Some(d) = value.to_f64() else { |
There was a problem hiding this comment.
we could also do it like so
fn round_float<T: num_traits::Float + Into<f64>>(value: T, scale: i32) -> T {
// Spark returns NaN / ±Inf unchanged; BigDecimal cannot represent them.
if !value.is_finite() {
return value;
}
// Spark always widens f32: `BigDecimal(f.toDouble).setScale(..).toFloat`
// This exposes the binary-float error before rounding.
let d: f64 = value.into();- can check finiteness without f64 coversion
- can cast from f32 to f64 without loss so dont need the option unwrapping
- streamline comment
| // `d.to_string()` produces the shortest round-trip decimal string, matching | ||
| // Scala's `BigDecimal(d) = java.math.BigDecimal.valueOf(d)` semantics. So | ||
| // `round(1.255_f64, 2)` parses "1.255" and rounds to 1.26 (not the naive | ||
| // binary-float 1.25). | ||
| let Ok(bd) = BigDecimal::from_str(&d.to_string()) else { | ||
| // Should not happen for a finite f64, but fall back gracefully. | ||
| return value; | ||
| }; |
There was a problem hiding this comment.
something i find interesting is apparently the spark code for this differs a bit. for nullSafeEval:
case DoubleType =>
val d = input1.asInstanceOf[Double]
if (d.isNaN || d.isInfinite) {
d
} else {
BigDecimal(d).setScale(_scale, mode).toDouble
}- https://github.com/apache/spark/blob/0993d4345969dfe16b334598dc80a452e4a270f7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L1609-L1615
- uses
BigDecimal(double val)
meanwhile for doCodeGen:
case DoubleType => // if child eval to NaN or Infinity, just return it.
s"""
if (Double.isNaN(${ce.value}) || Double.isInfinite(${ce.value})) {
${ev.value} = ${ce.value};
} else {
${ev.value} = java.math.BigDecimal.valueOf(${ce.value}).
setScale(${_scale}, java.math.BigDecimal.${modeStr}).doubleValue();
}"""- https://github.com/apache/spark/blob/0993d4345969dfe16b334598dc80a452e4a270f7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L1670-L1678
- uses
BigDecimal.valueOf(double val)
do we need to consider this?
|
I would like to check it out |
kosiew
left a comment
There was a problem hiding this comment.
Looks good overall. I have one small suggestion that could make the invariant a bit clearer.
| // Scala's `BigDecimal(d) = java.math.BigDecimal.valueOf(d)` semantics. So | ||
| // `round(1.255_f64, 2)` parses "1.255" and rounds to 1.26 (not the naive | ||
| // binary-float 1.25). | ||
| let Ok(bd) = BigDecimal::from_str(&d.to_string()) else { |
There was a problem hiding this comment.
Since we've already guarded against non-finite f64 values, BigDecimal::from_str(&d.to_string()) should always succeed. Rust's display output for a finite float is valid decimal input for BigDecimal.
Returning the original value here makes that invariant a little less explicit and could potentially hide a future regression. Would it make sense to encode the assumption directly with something like:
let bd = BigDecimal::from_str(&d.to_string())
.expect("finite f64 Display parses as BigDecimal");Alternatively, a debug_assert! plus an explicit fallback could work if panicking is not desirable on this path.
| // Widen to f64 first. For f32 inputs this matches Spark's `f.toDouble` | ||
| // step (FloatType: `BigDecimal(f.toDouble).setScale(..).toFloat`), which | ||
| // exposes the binary-float error before rounding. For f64 it is a no-op. | ||
| let Some(d) = value.to_f64() else { |
There was a problem hiding this comment.
appreciate if we can name vars more meaningfully than d, bd
| } | ||
| } | ||
|
|
||
| #[cfg(test)] |
There was a problem hiding this comment.
SLT tests should be enough.
We keep rust tests for SQL functions only if SLT coverage is not sufficient and SLT constraints full test coverage
|
Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look |
Which issue does this PR close?
Closes #22812.
Rationale for this change
round_floatused naive binary-float arithmetic(value * factor).round() / factor, which diverges from Apache Spark'sRoundBase. Spark evaluatesBigDecimal(d).setScale(scale, HALF_UP), andBigDecimal(Double)parses the shortest round-trip decimal string of the double. As a result:What changes are included in this PR?
round_floatto mirror Spark: widen tof64(matching Spark'sf.toDoubleforFloatType), pass NaN/Inf through unchanged, then round via aBigDecimalbuilt from the value's shortest-string representation withRoundingMode::HalfUp(ties away from zero).scaleis clamped to ±340 before constructing the decimal. A finitef64carries at most ~324 fractional digits and saturates above ~1e309, so any larger magnitude is already a no-op or collapses to zero. This also avoids an unbounded10^scaleBigIntallocation on adversarial input likeround(x, i32::MAX).2.675,8.35), negative values and scales, ties-away-from-zero, NaN/Inf pass-through, and bounded extreme scales.Are these changes tested?
Yes. New unit tests in
round.rsand new cases inspark/math/round.slt. The fulldatafusion-sparktest suite passes andclippyis clean locally.Are there any user-facing changes?
round()onfloat/doublenow matches Spark at the half-way point. This is a correctness fix; results that previously diverged from Spark will change.Note on performance: the
BigDecimalpath is heavier than the prior multiply/divide. If maintainers prefer, I'm happy to add a fast path for the common small-scale case and fall back toBigDecimalonly when needed — let me know your preference.