Skip to content

Widen narrow integer operands in PPL +/-/* to prevent overflow#5603

Open
ahkcs wants to merge 1 commit into
opensearch-project:mainfrom
ahkcs:pr/integer-arithmetic-overflow-widening
Open

Widen narrow integer operands in PPL +/-/* to prevent overflow#5603
ahkcs wants to merge 1 commit into
opensearch-project:mainfrom
ahkcs:pr/integer-arithmetic-overflow-widening

Conversation

@ahkcs

@ahkcs ahkcs commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Description

Integer arithmetic (+, -, *) on narrow integer types silently overflowed because the operators kept Calcite's LEAST_RESTRICTIVE result type (SMALLINT * SMALLINT -> SMALLINT, INTEGER * INTEGER -> INTEGER). For example eval area = ResolutionWidth * ResolutionHeight (both short) computed 28,777,836 but wrapped into the 16-bit range, so where area > 2000000 returned 0 rows. The reported workaround was cast(... as long).

This widens narrow integer operands before the operation so the result cannot overflow the (mis-inferred) narrow type:

operands result
byte/short int
any int/long involved long (bigint)
any float/double/decimal/datetime unchanged (defers to Calcite)

The widening is a CAST injected on the operands in the SQL-plugin lowering (PPLFuncImpTable), applied to +, -, * and the named add/subtract/multiply functions. The string-concat + and the DATETIME - DATETIME variant are unchanged. Arithmetic inside higher-order-function lambda bodies (reduce/mvmap) is left unwidened — those operands are lambda-parameter placeholders resolved by the HOF's own type inference, and they execute JVM-side where operands already promote to int.

Why widening (and not error-on-overflow)

An alternative is to detect overflow and error (Calcite's CHECKED_PLUS/CHECKED_MINUS/CHECKED_MULTIPLYMath.addExact, explored in #5604). Widening was chosen because it is the only approach that fixes both execution engines from a single change:

  • The CAST is baked into the RelNode before the engine fork. PPL runs on two paths — the Calcite/Enumerable engine and the analytics-engine (DataFusion) backend. Because widening happens in PPLFuncImpTable (shared lowering, upstream of the fork), the widened operand types flow through to both — including through Substrait to DataFusion. Verified live: short * short (30000×30000) returns int 900000000 on the Calcite path and on the DataFusion route (profile confirms *(CAST($n):INTEGER, ...) with backends=[datafusion]); the reported | where area > 2000000 now returns the matching row on both.
  • Error-on-overflow cannot be done consistently across engines. The checked operators are a Calcite-codegen mechanism (Math.addExact in generated Java). CHECKED_* has no Substrait representation, so on the analytics-engine route the arithmetic executes natively in DataFusion and the checked rewrite never reaches it — leaving the overflow unfixed there. Making DataFusion error would require separate backend work (a custom Rust checked-arithmetic UDF). So erroring would need two independent implementations (Calcite checked-arith + a DataFusion UDF) and still diverge; widening needs one and is consistent.

Widening fully resolves the reported short/int overflow on both engines. It does not address long * long overflow (no wider primitive type) — that residual is out of scope here and, if needed, is a separate error-on-overflow effort.

Verification

query before after (both engines)
eval area = short * short (30000²) wraps (smallint, negative) int 900000000
eval area = short * short | where area > 2000000 0 rows 1 row
eval x = int * int (2e9²) wraps bigint, correct
test result
CalcitePPLBuiltinFunctionIT widening ITs (new) pass
CalcitePPLEvalTest/ArrayFunctionTest/DedupTest/etc. + UnifiedQueryPlannerSqlV2Test (plan snapshots) pass (updated for widening CAST)
CalciteExplainIT / CalciteNoPushdownIT (explain fixtures) pass (regenerated)
collection.md doctest (reduce/mvmap unaffected) pass

Related Issues

Resolves the "eval integer arithmetic overflow (no type widening)" bug. See #5164 / #5604 for the alternative error-on-overflow approach (Calcite-path only).

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • Commits are signed per the DCO using --signoff or -s.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

(Review updated until commit a2e7706)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Possible Issue

The coerceNumericComparisonOperands method casts both sides of a comparison to their least-restrictive common integer type when one side is a scalar subquery. However, the method does not verify that the common type returned by leastRestrictive is actually an integer type. If leastRestrictive returns a non-integer type (e.g., DOUBLE when mixing INT and DOUBLE), the cast proceeds anyway, potentially causing incorrect comparisons or runtime errors. This can occur when a subquery returns a floating-point result compared against an integer column.

private static RexNode[] coerceNumericComparisonOperands(
    RexBuilder builder, BuiltinFunctionName functionName, RexNode... args) {
  if (args.length != 2 || !BuiltinFunctionName.COMPARATORS.contains(functionName)) {
    return args;
  }
  if (!containsSubQuery(args[0]) && !containsSubQuery(args[1])) {
    return args;
  }
  RelDataType leftType = args[0].getType();
  RelDataType rightType = args[1].getType();
  if (!SqlTypeName.INT_TYPES.contains(leftType.getSqlTypeName())
      || !SqlTypeName.INT_TYPES.contains(rightType.getSqlTypeName())
      || leftType.getSqlTypeName() == rightType.getSqlTypeName()) {
    return args;
  }
  RelDataType commonType =
      builder.getTypeFactory().leastRestrictive(List.of(leftType, rightType));
  if (commonType == null) {
    return args;
  }
  return new RexNode[] {
    builder.makeCast(
        TYPE_FACTORY.createTypeWithNullability(commonType, leftType.isNullable()), args[0]),
    builder.makeCast(
        TYPE_FACTORY.createTypeWithNullability(commonType, rightType.isNullable()), args[1])
  };
}
Possible Issue

The coerce method performs numeric conversions without checking for overflow or precision loss. For example, converting a BIGINT value to TINYINT via num.byteValue() silently truncates the value if it exceeds the byte range (-128 to 127). This can produce incorrect array elements when widened integer operands (e.g., from a BIGINT arithmetic result) are coerced down to a narrower target element type. The issue arises when mvappend is called with mixed-width integer operands and the inferred element type is narrower than the actual operand values.

private static Object coerce(Object value, SqlTypeName elementType) {
  if (elementType == null || !(value instanceof Number)) {
    return value;
  }
  Number num = (Number) value;
  return switch (elementType) {
    case TINYINT -> num.byteValue();
    case SMALLINT -> num.shortValue();
    case INTEGER -> num.intValue();
    case BIGINT -> num.longValue();
    case FLOAT, REAL -> num.floatValue();
    case DOUBLE -> num.doubleValue();
    case DECIMAL -> num instanceof BigDecimal ? num : BigDecimal.valueOf(num.doubleValue());
    default -> value;
  };
}

@RyanL1997 RyanL1997 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ahkcs , thanks for the change, and here are somer reviews.

// Register ADDFUNCTION for numeric addition only. Widen narrow integer operands so the sum
// cannot overflow the (mis-)inferred SMALLINT/TINYINT result type; see
// registerWideningIntegerOperator.
registerWideningIntegerOperator(ADDFUNCTION, SqlStdOperatorTable.PLUS);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this means the behavior-change note in the PR description applies to SQL too, not just PPL eval?

RelNode root = getRelNode(ppl);
String expectedLogical =
"LogicalProject(EMPNO=[$0], total=[+(1, +(2, 3))])\n"
"LogicalProject(EMPNO=[$0], total=[+(1, +(2:BIGINT, 3:BIGINT))])\n"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This snapshot is the one I couldn't reconcile: the inner +(2:BIGINT, 3:BIGINT) widens both literals, but the outer operand stays plain 1 (INTEGER), whereas selectExpressionWithoutFrom's 1 + 1 widens both to 1:BIGINT. If the outer + in the sum reduction goes through the same widening operator, I'd expect +(1:BIGINT, +(2:BIGINT, 3:BIGINT)). Can you confirm whether the outer add is intentionally not widened (and that a single + mixing INTEGER + BIGINT operands is expected)?

@ahkcs

ahkcs commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author

Superseded by #5604. After discussion on #5164/#5204, the approach changed from operand-widening (silently promoting short→int, int→long, which also changed result column types) to overflow detection via Calcite checked arithmetic (error on overflow, matching the maintainer direction in #5204). Closing this in favor of #5604.

@ahkcs ahkcs closed this Jul 1, 2026
@ahkcs ahkcs reopened this Jul 1, 2026
@ahkcs

ahkcs commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author

Reopening. Verified this widening approach fixes the overflow on both execution paths from a single change in the SQL-plugin lowering (PPLFuncImpTable), because the operand CASTs are baked into the RelNode before the engine fork:

  • Calcite/v2 path: short * short (30000×30000) → result type int, value 900000000 (correct, no wrap).
  • Analytics-engine / DataFusion path: same query routes to DataFusion as *(CAST($n):INTEGER, CAST($m):INTEGER) (confirmed via profile backends=[datafusion]) → int 900000000; the reported | where area > 2000000 now returns the matching row instead of 0. int * int widens to bigint.

Note: this is the widening approach; #5604 explores the alternative of erroring on overflow via Calcite checked arithmetic. The checked approach does not reach the DataFusion backend (CHECKED_* has no Substrait representation), so it only fixes the Calcite path. Flagging both for a direction decision.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 6d53547

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Latest suggestions up to a2e7706

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Verify scalar subquery before widening

The method only checks if either operand contains a subquery but doesn't verify that
the comparison actually involves a scalar subquery. Non-scalar subqueries (e.g., IN
with a multi-row result) should not trigger this widening logic, as they have
different semantics.

core/src/main/java/org/opensearch/sql/expression/function/PPLFuncImpTable.java [633-640]

 private static RexNode[] coerceNumericComparisonOperands(
     RexBuilder builder, BuiltinFunctionName functionName, RexNode... args) {
   if (args.length != 2 || !BuiltinFunctionName.COMPARATORS.contains(functionName)) {
     return args;
   }
-  if (!containsSubQuery(args[0]) && !containsSubQuery(args[1])) {
+  boolean hasScalarSubquery = (args[0] instanceof RexSubQuery && ((RexSubQuery) args[0]).getKind() == SqlKind.SCALAR_QUERY)
+      || (args[1] instanceof RexSubQuery && ((RexSubQuery) args[1]).getKind() == SqlKind.SCALAR_QUERY);
+  if (!hasScalarSubquery) {
     return args;
   }
   ...
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion identifies a valid concern that the method doesn't distinguish between scalar and non-scalar subqueries. The improved code adds a check for SqlKind.SCALAR_QUERY, which would make the logic more precise. However, the existing containsSubQuery helper may already handle this correctly through Calcite's decorrelation logic, so the impact is moderate.

Medium
Preserve precision in DECIMAL coercion

The DECIMAL case uses doubleValue() which can lose precision for large integers. For
numeric types that aren't already BigDecimal, consider using
BigDecimal.valueOf(num.longValue()) for integer types or checking the actual runtime
type to preserve precision.

core/src/main/java/org/opensearch/sql/expression/function/CollectionUDF/MVAppendCore.java [60-75]

 private static Object coerce(Object value, SqlTypeName elementType) {
   if (elementType == null || !(value instanceof Number)) {
     return value;
   }
   Number num = (Number) value;
   return switch (elementType) {
     case TINYINT -> num.byteValue();
     case SMALLINT -> num.shortValue();
     case INTEGER -> num.intValue();
     case BIGINT -> num.longValue();
     case FLOAT, REAL -> num.floatValue();
     case DOUBLE -> num.doubleValue();
-    case DECIMAL -> num instanceof BigDecimal ? num : BigDecimal.valueOf(num.doubleValue());
+    case DECIMAL -> num instanceof BigDecimal ? num : 
+        (num instanceof Float || num instanceof Double) ? 
+        BigDecimal.valueOf(num.doubleValue()) : 
+        BigDecimal.valueOf(num.longValue());
     default -> value;
   };
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies a potential precision loss when coercing to DECIMAL using doubleValue() for integer types. However, the impact is moderate since this affects edge cases with large integers, and the improved code adds complexity. The suggestion is valid but not critical.

Low
Handle nested lambda expressions

The method recursively checks for lambda parameters but doesn't handle RexLambda
nodes themselves. If a lambda expression is nested within the arithmetic operation,
the check should also inspect the lambda body to avoid missing lambda parameter
references.

core/src/main/java/org/opensearch/sql/expression/function/PPLFuncImpTable.java [847-855]

 private static boolean referencesLambdaParameter(RexNode node) {
   if (node instanceof RexLambdaRef) {
     return true;
+  }
+  if (node instanceof RexLambda lambda) {
+    return referencesLambdaParameter(lambda.getExpression());
   }
   if (node instanceof RexCall call) {
     return call.getOperands().stream().anyMatch(AbstractBuilder::referencesLambdaParameter);
   }
   return false;
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion points out that RexLambda nodes are not explicitly handled in the recursive check. While this is a valid observation, the existing code already handles RexCall which would include lambda expressions as operands. The improvement is minor and may not be necessary given the current implementation context.

Low

Previous suggestions

Suggestions up to commit 82ed0c1
CategorySuggestion                                                                                                                                    Impact
General
Preserve integer precision in DECIMAL coercion

The DECIMAL case uses doubleValue() which can lose precision for large integers.
When coercing integer types to DECIMAL, use BigDecimal.valueOf(num.longValue()) to
preserve exact integer values without floating-point rounding errors.

core/src/main/java/org/opensearch/sql/expression/function/CollectionUDF/MVAppendCore.java [60-75]

 private static Object coerce(Object value, SqlTypeName elementType) {
   if (elementType == null || !(value instanceof Number)) {
     return value;
   }
   Number num = (Number) value;
   return switch (elementType) {
     case TINYINT -> num.byteValue();
     case SMALLINT -> num.shortValue();
     case INTEGER -> num.intValue();
     case BIGINT -> num.longValue();
     case FLOAT, REAL -> num.floatValue();
     case DOUBLE -> num.doubleValue();
-    case DECIMAL -> num instanceof BigDecimal ? num : BigDecimal.valueOf(num.doubleValue());
+    case DECIMAL -> num instanceof BigDecimal ? num : 
+        (num instanceof Byte || num instanceof Short || num instanceof Integer || num instanceof Long) 
+        ? BigDecimal.valueOf(num.longValue()) 
+        : BigDecimal.valueOf(num.doubleValue());
     default -> value;
   };
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a precision loss issue when coercing large integers to DECIMAL via doubleValue(). Using longValue() for integer types preserves exact values. However, the improved code could be simplified and the impact is moderate since DECIMAL coercion is a less common path.

Medium
Document subquery-specific coercion rationale

The method only coerces when a subquery is present, but the join key type mismatch
can occur in other scenarios. Consider broadening the condition or documenting why
this specific case requires special handling while others don't.

core/src/main/java/org/opensearch/sql/expression/function/PPLFuncImpTable.java [633-640]

 private static RexNode[] coerceNumericComparisonOperands(
     RexBuilder builder, BuiltinFunctionName functionName, RexNode... args) {
   if (args.length != 2 || !BuiltinFunctionName.COMPARATORS.contains(functionName)) {
     return args;
   }
+  // Only coerce for subquery comparisons where decorrelation creates join keys
   if (!containsSubQuery(args[0]) && !containsSubQuery(args[1])) {
     return args;
   }
   ...
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion asks for additional documentation, but the existing comment at lines 617-632 already explains why subquery comparisons require special handling (decorrelation creates join keys with mismatched types). The improved code only adds a redundant comment without substantive improvement.

Low
Suggestions up to commit 6d53547
CategorySuggestion                                                                                                                                    Impact
General
Skip redundant casts for matching types

The cast operation may fail or produce unexpected results if the source type is
already wider than the target. Add a check to skip casting when args[i] already has
the promoted type to avoid redundant operations and potential issues.

core/src/main/java/org/opensearch/sql/expression/function/PPLFuncImpTable.java [761-772]

 private static RexNode[] widenIntegerOperands(RexBuilder builder, RexNode... args) {
   SqlTypeName promoted = promotedIntegerType(args);
   if (promoted == null) {
     return args;
   }
   RexNode[] widened = new RexNode[args.length];
   for (int i = 0; i < args.length; i++) {
-    RelDataType target = TYPE_FACTORY.createSqlType(promoted, args[i].getType().isNullable());
-    widened[i] = builder.makeCast(target, args[i]);
+    if (args[i].getType().getSqlTypeName() == promoted) {
+      widened[i] = args[i];
+    } else {
+      RelDataType target = TYPE_FACTORY.createSqlType(promoted, args[i].getType().isNullable());
+      widened[i] = builder.makeCast(target, args[i]);
+    }
   }
   return widened;
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that redundant casts could be avoided when the operand already has the promoted type. However, this is a minor optimization that doesn't fix a bug or significantly improve performance. The current implementation is correct and the suggested optimization would only marginally improve efficiency in specific cases.

Low

@ahkcs ahkcs force-pushed the pr/integer-arithmetic-overflow-widening branch from 6d53547 to 2ea4a2d Compare July 2, 2026 09:29
@ahkcs ahkcs added the bug Something isn't working label Jul 2, 2026
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 2ea4a2d

@ahkcs ahkcs force-pushed the pr/integer-arithmetic-overflow-widening branch from 2ea4a2d to 82ed0c1 Compare July 2, 2026 22:02
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 82ed0c1

PPL integer arithmetic (`+`, `-`, `*` and the named add/subtract/multiply
functions) inferred `SMALLINT op SMALLINT -> SMALLINT` (and likewise for
TINYINT) because the operators were registered with Calcite's stock
`SqlStdOperatorTable.PLUS/MINUS/MULTIPLY`, whose return-type inference
(`ReturnTypes.NULLABLE_SUM` / `PRODUCT_NULLABLE`) falls through to
`LEAST_RESTRICTIVE` for non-decimal integers.

As a result the product/sum of two narrow-integer columns overflowed the
inferred type on every backend, just differently:
  - DataFusion / analytics-engine: silently wraps the i16 result, so e.g.
    `eval area = ResolutionWidth * ResolutionHeight | where area > 2000000`
    returned 0 rows instead of the matching row.
  - Calcite Enumerable engine: throws `ArithmeticException: value out of
    range`.
  - v2 legacy engine: `ExprShortValue` narrows via `shortValue()`, wrapping.

Fix in the SQL-plugin lowering so all backends are corrected at once: widen
the operands (byte/short -> INTEGER, any int/long -> BIGINT) before applying
the operator. Casting the operands rather than only relabelling the result
type is required, otherwise DataFusion still computes the narrow multiply and
wraps before any outer cast. Non-integral operands (float/double/decimal/
datetime/mixed) are left untouched and defer to Calcite's default inference.
The string-concat `ADD` variant and the DATETIME-DATETIME `SUBTRACT` variant
are unchanged.

mvindex's internal array-index arithmetic now uses the raw Calcite
PLUS/MINUS operators so array indices stay INTEGER for ITEM/ARRAY_SLICE
codegen (the widened result would otherwise be rejected as a long index).

Note: this changes user-visible result column types for integer arithmetic
(int-operand expressions now report bigint), which is the intended trade-off
for overflow-safe, backend-consistent results.

Adds CalcitePPLBuiltinFunctionIT coverage for the short->int and int->bigint
widening tiers and updates the affected logical-plan / Spark-SQL snapshots.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs ahkcs force-pushed the pr/integer-arithmetic-overflow-widening branch from 82ed0c1 to a2e7706 Compare July 2, 2026 22:14
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit a2e7706

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants