Skip to content

Overriding profile endpoint with analyze endpoint with operator tree and profiling#5568

Open
Krish-Gandhi wants to merge 9 commits into
opensearch-project:mainfrom
Krish-Gandhi:feature/analyze-endpoint
Open

Overriding profile endpoint with analyze endpoint with operator tree and profiling#5568
Krish-Gandhi wants to merge 9 commits into
opensearch-project:mainfrom
Krish-Gandhi:feature/analyze-endpoint

Conversation

@Krish-Gandhi

@Krish-Gandhi Krish-Gandhi commented Jun 19, 2026

Copy link
Copy Markdown

Description

  • Introduces the analyze endpoint for PPL queries, which can be activated by passing "analyze": true as a request body parameter.
  • Overrides the existing profile endpoint to run analyze functionality.
  • Propagate PPLanalyze flag through transport and request parsing.
  • Combines logical plan and physical plan nodes by walking each tree and grouping nodes into an operator_tree.
  • Maps PPL query segments and estimated/actual rows to operator_tree nodes.
  • Combined current profile plan response with operator_tree nodes to calculate time taken by each node.

Note: This PR provides a simple implementation of creating the operator_tree, which works for simpler queries. The logic doesn't hold for queries that produce non-linear physical plan trees (for example, JOINs). This only affects the operator_tree, meaning the subset of the response that corresponds to profile is 100% functional. Therefore, in this scenario, the analyze endpoint will "fallback" to the profile output by returning a response mirroring the profile endpoint (by only including the profile, schema, datarows, total, and size fields in the response). This operator_tree issue will be addressed in a later PR.


Important

This PR overrides the existing profile endpoint by routing all requests with either "analyze": true or "profile": true (or both) to pplService.analyze() in TransportPPLQueryAction.java. For current end-users of the profile endpoint, this will make little difference, as the response from profile is a subset of the response from analyze. In simpler terms, current end-users can make no changes and have similar results.

All of the existing code for profile still exists and is in place. This makes it extremely simple to separate analyze and profile again. This can be done by removing the || transformedRequest.profile() part of the boolean expression in line 200 of TransportPPLQueryAction.java.

plugin/src/main/java/org/opensearch/sql/plugin/transport/TransportPPLQueryAction.java:193-209:

    if (transformedRequest.isExplainRequest()) {
      pplService.explain(
          transformedRequest, createExplainResponseListener(transformedRequest, clearingListener));
    /**
     * Removing  `|| transformedRequest.profile()` from line 200 will
     * separate the `profile` and `analyze` endpoints. See PR #5568.
     */
    } else if (transformedRequest.analyze() || transformedRequest.profile()) {
      pplService.analyze(
        transformedRequest, createAnalyzeResponseListener(transformedRequest, clearingListener));
    } else {
      pplService.execute(
          transformedRequest,
          createListener(transformedRequest, clearingListener),
          createExplainResponseListener(transformedRequest, clearingListener));
    }
  }

Example Query and Response

The following curl command will run the query source=accounts | where age < 30 | eval full_name = firstname + \" \" + lastname | fields full_name, email, age on the analyze endpoint:

curl -X POST "localhost:9200/_plugins/_ppl" \
  -H "Content-Type: application/json" \
  -d '{"query": "source=accounts | where age < 30 | eval full_name = firstname + \" \" + lastname | fields full_name, email, age", "analyze": true}'

The response of this will be as follows. (NOTE: The "logicalPlan" and "physicalPlan" fields are included for debugging purposes and should not be included in the final version of this endpoint.)

{
  "query": "source=accounts | where age < 30 | eval full_name = firstname + \" \" + lastname | fields full_name, email, age",
  "querySegments": [
    {
      "nodeType": "SearchFrom",
      "source": "source=accounts"
    },
    {
      "nodeType": "WhereCommand",
      "source": "where age < 30"
    },
    {
      "nodeType": "EvalCommand",
      "source": "eval full_name = firstname + \" \" + lastname"
    },
    {
      "nodeType": "FieldsCommand",
      "source": "fields full_name, email, age"
    }
  ],
  "logicalPlan": [
    "LogicalSystemLimit(fetch=[10000], type=[QUERY_SIZE_LIMIT]): rowcount = 5000.0, cumulative cost = {114000.0 rows, 145000.0 cpu, 0.0 io}, id = 103",
    "LogicalProject(full_name=[||(||($0, ' '), $4)], email=[$3], age=[$2]): rowcount = 5000.0, cumulative cost = {109000.0 rows, 25000.0 cpu, 0.0 io}, id = 102",
    "LogicalFilter(condition=[<($2, 30)]): rowcount = 5000.0, cumulative cost = {104000.0 rows, 10000.0 cpu, 0.0 io}, id = 100",
    "CalciteLogicalIndexScan(table=[[OpenSearch, accounts]]): rowcount = 10000.0, cumulative cost = {99000.0 rows, 0.0 cpu, 0.0 io}, id = 99"
  ],
  "physicalPlan": [
    "EnumerableCalc(expr#0..3=[{inputs}], expr#4=[' '], expr#5=[||($t0, $t4)], expr#6=[||($t5, $t3)], full_name=[$t6], email=[$t2], age=[$t1]): rowcount = 5000.0, cumulative cost = {22996.4 rows, 50000.0 cpu, 0.0 io}, id = 193",
    "CalciteEnumerableIndexScan(table=[[OpenSearch, accounts]], PushDownContext=[[PROJECT->[firstname, age, email, lastname], FILTER-><($1, 30), LIMIT->10000], OpenSearchRequestBuilder(sourceBuilder={\"from\":0,\"size\":10000,\"timeout\":\"1m\",\"query\":{
  ],
  "profile": {
    "summary": {
      "total_time_ms": 777.18
    },
    "phases": {
      "analyze": {
        "time_ms": 477.57
      },
      "optimize": {
        "time_ms": 225.46
      },
      "execute": {
        "time_ms": 73.41
      },
      "format": {
        "time_ms": 0.11
      }
    },
    "plan": {
      "node": "EnumerableCalc",
      "time_ms": 64.19,
      "rows": 3,
      "children": [
        {
          "node": "CalciteEnumerableIndexScan",
          "time_ms": 64.0,
          "rows": 3
        }
      ]
    }
  },
  "operator_tree": [
    {
      "source": "source=accounts | where age < 30",
      "node_type": [
        "SearchFrom",
        "WhereCommand"
      ],
      "description": [
        "CalciteLogicalIndexScan(table=[[OpenSearch, accounts]]): rowcount = 10000.0, cumulative cost = {99000.0 rows, 0.0 cpu, 0.0 io}, id = 99",
        "LogicalFilter(condition=[<($2, 30)]): rowcount = 5000.0, cumulative cost = {104000.0 rows, 10000.0 cpu, 0.0 io}, id = 100"
      ],
      "estimated_rows": 5000,
      "actual_time_ms": "64.00 ms",
      "actual_rows": 3,
      "is_pushed_down": true
    },
    {
      "source": "eval full_name = firstname + \" \" + lastname | fields full_name, email, age",
      "node_type": [
        "EvalCommand",
        "FieldsCommand"
      ],
      "description": [
        "LogicalProject(full_name=[||(||($0, ' '), $4)], email=[$3], age=[$2]): rowcount = 5000.0, cumulative cost = {109000.0 rows, 25000.0 cpu, 0.0 io}, id = 102"
      ],
      "estimated_rows": 5000,
      "actual_time_ms": "0.19 ms",
      "actual_rows": 3
    }
  ],
  "recommendations": [
    {
      "serverity": "INFO",
      "rule": "Pushdown visibility",
      "message": "1 of 2 stages pushed down; 1 ran in-memory"
    },
    {
      "serverity": "INFO",
      "rule": "Bottleneck stage",
      "message": "87% of time is in the *SearchFrom, WhereCommand* stage",
      "affected_node": "source=accounts | where age < 30",
      "suggestion": "Consider optimizing the SearchFrom, WhereCommand operation"
    }
  ],
  "schema": [
    {
      "name": "full_name",
      "type": "STRING"
    },
    {
      "name": "email",
      "type": "STRING"
    },
    {
      "name": "age",
      "type": "LONG"
    }
  ],
  "datarows": [
    [
      "Jane Smith",
      "jane@example.com",
      28
    ],
    [
      "Kyle Miller",
      "goat@example.com",
      22
    ],
    [
      "Joette Kap",
      "coast2coast@example.com",
      22
    ]
  ],
  "total": 3,
  "size": 3
}

Performance of analyze

After writing a benchmarking script to run 20 queries on a sample 10 GB dataset, the results were as follows. This shows that analyze has a small enough overhead to justify running it by default on the Discover tab on the OpenSearch Dashboard.

=== PPL Query Benchmark ===

--- Results ---

normal      avg=4489.8ms  p50=206.9ms  p95=1083.1ms  min=28.5ms  max=84080.9ms
profile     avg=4407.9ms  p50=197.9ms  p95=749.1ms  min=19.1ms  max=83598.1ms
analyze     avg=4410.2ms  p50=213.3ms  p95=801.3ms  min=38.3ms  max=83146.3ms

Related Issues

#5500
Resolves #4343

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…and profiling

Signed-off-by: Krish Gandhi <kjg2352@gmail.com>
Signed-off-by: Krish Gandhi <kjg2352@gmail.com>
@Krish-Gandhi Krish-Gandhi marked this pull request as ready for review June 19, 2026 22:16
@ahkcs ahkcs added the enhancement New feature or request label Jun 24, 2026
Comment thread core/src/main/java/org/opensearch/sql/executor/QueryService.java Fixed
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

(Review updated until commit ff87829)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Possible Issue

In visitChildren, when tracking is enabled, the method iterates over children and calls child.accept(this, context) for each, but only the last child's result is returned. If a node has multiple children, all but the last child's results are discarded. This breaks the visitor pattern for multi-child nodes (e.g., joins, unions). The non-tracking path calls super.visitChildren(node, context) which properly handles multiple children.

if (context.isTrackingEnabled() && node instanceof UnresolvedPlan) {
  // Track each child's total contribution (the subtree it produces)
  RelNode result = null;
  for (org.opensearch.sql.ast.Node child : node.getChild()) {
    int idBefore = context.relBuilder.size() > 0 ? context.relBuilder.peek().getId() : -1;
    RelNode childResult = child.accept(this, context);
    result = childResult;
    // After child.accept returns, the child's visit* method has fully completed,
    // so all RelNodes produced by that child (including ITS children) are on the stack.
    int idAfter = context.relBuilder.peek().getId();
    if (child instanceof UnresolvedPlan) {
      java.util.List<Integer> producedIds = new java.util.ArrayList<>();
      for (int id = idBefore + 1; id <= idAfter; id++) {
        producedIds.add(id);
      }
      context.recordMapping(child.getClass().getSimpleName(), producedIds);
    }
  }
  if (node instanceof UnresolvedPlan plan) {
    mapPathMaterializer.materializePaths(plan, context);
  }
  return result;
}
Thread Safety

QueryContext.setProfile(true) at line 256 modifies a shared context without synchronization. If multiple threads call analyzeWithCalcite concurrently, they will race on this global flag. One thread's profile setting can leak into another thread's execution, causing incorrect profiling behavior or missing profile data.

QueryContext.setProfile(true);
Resource Leak

At line 367, OpenSearchRelRunners.run(context, calcitePlan) returns a PreparedStatement that is immediately closed in a try-with-resources block. However, if an exception occurs between line 367 and the try block's opening, the statement is never closed. The Hook.Closeable at line 359 is properly closed, but the PreparedStatement is not guaranteed cleanup if run() throws before the try block executes.

    OpenSearchRelRunners.run(context, calcitePlan)) {
} catch (java.sql.SQLException e) {
  throw new RuntimeException(e);
Possible Issue

buildOperatorTree assumes a linear plan structure (single-child chains). When isLinearPlanTree returns false (line 310), the method returns early with a fallback response. However, the fallback response omits query, querySegments, logicalPlan, physicalPlan, and operator_tree fields (lines 330-336), which breaks the documented API contract. Clients expecting these fields will receive incomplete responses for non-linear queries like JOINs.

if (profile != null && profile.getPlan() != null && !isLinearPlanTree(profile)) {
  List<AnalyzeResponse.SchemaColumn> schema = new ArrayList<>();
  if (queryResponse.getSchema() != null) {
    for (ExecutionEngine.Schema.Column col : queryResponse.getSchema().getColumns()) {
      schema.add(
          AnalyzeResponse.SchemaColumn.builder()
              .name(col.getName())
              .type(col.getExprType().typeName())
              .build());
    }
  }
  Object[][] datarows = new Object[queryResponse.getResults().size()][];
  int rowIdx = 0;
  for (var exprValue : queryResponse.getResults()) {
    datarows[rowIdx++] =
        exprValue.tupleValue().entrySet().stream()
            .map(e -> e.getValue().value())
            .toArray(Object[]::new);
  }
  listener.onResponse(
      AnalyzeResponse.builder()
          // .query(query)
          .profile(profile)
          .schema(schema)
          .datarows(datarows)
          .total(datarows.length)
          .size(datarows.length)
          .build());
  return;
}
Possible Issue

In buildOperatorTree, the calculation of pushedNodeCount at line 486 assumes logicalDepth - physicalDepth equals the number of pushed-down logical nodes. This assumption breaks when the physical plan has a different structure than expected (e.g., additional physical operators inserted by the optimizer). If pushedNodeCount is negative or exceeds the actual number of segments, the loop at line 503 will produce incorrect pushedSegments, leading to wrong pushdown attribution in the operator tree.

int physicalDepth = getLinearDepth(physicalPlan);
int logicalDepth = getLinearDepth(logicalPlan);
int pushedNodeCount = logicalDepth - physicalDepth;

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Latest suggestions up to ff87829

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against empty builder stack

Check if context.relBuilder is empty before calling peek() after accept() returns.
If the builder is empty, peek() will throw NoSuchElementException, causing the
analyze operation to fail unexpectedly.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [222-232]

 if (context.isTrackingEnabled()) {
   int idBefore = context.relBuilder.size() > 0 ? context.relBuilder.peek().getId() : -1;
   RelNode result = unresolved.accept(this, context);
+  if (context.relBuilder.size() == 0) {
+    return result;
+  }
   int idAfter = context.relBuilder.peek().getId();
   java.util.List<Integer> producedIds = new java.util.ArrayList<>();
   for (int id = idBefore + 1; id <= idAfter; id++) {
     producedIds.add(id);
   }
   context.recordMapping(unresolved.getClass().getSimpleName(), producedIds);
   return result;
 }
Suggestion importance[1-10]: 8

__

Why: This addresses a potential NoSuchElementException when calling peek() on an empty relBuilder stack after accept() returns. The check is necessary to prevent runtime failures during tracking.

Medium
Add timeout to latch await

Add a timeout to latch.await() to prevent indefinite blocking if the async callback
never completes. Without a timeout, the thread could hang forever if
executeWithCalcite fails to invoke either callback, leading to resource exhaustion.

core/src/main/java/org/opensearch/sql/executor/QueryService.java [292-297]

-CountDownLatch latch = new CountDownLatch(1);
-
-executeWithCalcite(
-    plan,
-    queryType,
-    null,
-    new ResponseListener<>() {
-      @Override
-      public void onResponse(ExecutionEngine.QueryResponse response) {
-        ...
-        latch.countDown();
-      }
-
-      @Override
-      public void onFailure(Exception e) {
-        errorRef.set(e);
-        latch.countDown();
-      }
-    });
-
 try {
-  latch.await();
+  if (!latch.await(60, TimeUnit.SECONDS)) {
+    listener.onFailure(new RuntimeException("Timeout waiting for query execution"));
+    return;
+  }
 } catch (InterruptedException e) {
   Thread.currentThread().interrupt();
   listener.onFailure(new RuntimeException("Interrupted while waiting for query execution", e));
   return;
 }
Suggestion importance[1-10]: 7

__

Why: Adding a timeout prevents indefinite blocking if the async callback fails to complete, which is a valid concern for production systems. However, the 60-second timeout value is arbitrary and may need tuning based on actual query execution times.

Medium
General
Prevent negative pushed node count

Validate that pushedNodeCount is non-negative before using it as a loop boundary. If
physicalDepth exceeds logicalDepth due to unexpected plan structure, the negative
value could cause incorrect operator tree grouping or infinite loops.

core/src/main/java/org/opensearch/sql/executor/QueryService.java [486-508]

-int pushedNodeCount = logicalDepth - physicalDepth;
+int pushedNodeCount = Math.max(0, logicalDepth - physicalDepth);
 ...
 long pushedLogicalNodes = 0;
 int pushedSegments = 0;
 for (int idx = 0; idx < querySegments.size() && pushedLogicalNodes < pushedNodeCount; idx++) {
   Set<Integer> ids = idx < exclusiveIds.size() ? exclusiveIds.get(idx) : Set.of();
   long planNodeCount = ids.stream().filter(idToDescription::containsKey).count();
   pushedLogicalNodes += planNodeCount;
   pushedSegments++;
 }
Suggestion importance[1-10]: 6

__

Why: Using Math.max(0, ...) prevents negative values that could cause incorrect operator tree grouping. While this is a defensive programming practice, the scenario where physicalDepth exceeds logicalDepth may indicate a deeper issue that should be investigated rather than silently handled.

Low
Include empty operator tree field

Set operator_tree to an empty list instead of omitting it when falling back to
profile-only mode. Consumers expecting the field to always be present may encounter
null pointer exceptions or deserialization errors if it's missing.

core/src/main/java/org/opensearch/sql/executor/QueryService.java [329-338]

-if (profile != null && profile.getPlan() != null && !isLinearPlanTree(profile)) {
-  ...
-  listener.onResponse(
-      AnalyzeResponse.builder()
-          .profile(profile)
-          .schema(schema)
-          .datarows(datarows)
-          .total(datarows.length)
-          .size(datarows.length)
-          .build());
-  return;
-}
+listener.onResponse(
+    AnalyzeResponse.builder()
+        .profile(profile)
+        .operator_tree(List.of())
+        .schema(schema)
+        .datarows(datarows)
+        .total(datarows.length)
+        .size(datarows.length)
+        .build());
Suggestion importance[1-10]: 5

__

Why: Setting operator_tree to an empty list instead of omitting it improves API consistency and prevents potential null pointer exceptions in consumers. However, the AnalyzeResponse.builder() may already handle null fields appropriately, and the documentation should clarify the field's optionality.

Low

Previous suggestions

Suggestions up to commit 7f164a0
CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against empty builder stack

Accessing context.relBuilder.peek() after unresolved.accept() can throw
NoSuchElementException if the builder stack is empty. Verify the builder has
elements before calling peek() to prevent runtime crashes during tracking.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [222-232]

 if (context.isTrackingEnabled()) {
   int idBefore = context.relBuilder.size() > 0 ? context.relBuilder.peek().getId() : -1;
   RelNode result = unresolved.accept(this, context);
+  if (context.relBuilder.size() == 0) {
+    return result;
+  }
   int idAfter = context.relBuilder.peek().getId();
   java.util.List<Integer> producedIds = new java.util.ArrayList<>();
   for (int id = idBefore + 1; id <= idAfter; id++) {
     producedIds.add(id);
   }
   context.recordMapping(unresolved.getClass().getSimpleName(), producedIds);
   return result;
 }
Suggestion importance[1-10]: 8

__

Why: This is a critical safety check. After unresolved.accept() at line 224, the builder stack could be empty, and calling peek() at line 225 would throw NoSuchElementException. The guard prevents a potential runtime crash during tracking.

Medium
Add timeout to latch await

Add a timeout to latch.await() to prevent indefinite blocking if the async callback
never completes. Without a timeout, the thread could hang forever if
executeWithCalcite fails to invoke either callback, leading to resource exhaustion.

core/src/main/java/org/opensearch/sql/executor/QueryService.java [291-297]

-CountDownLatch latch = new CountDownLatch(1);
-
-executeWithCalcite(
-    plan,
-    queryType,
-    null,
-    new ResponseListener<>() {
-      @Override
-      public void onResponse(ExecutionEngine.QueryResponse response) {
-        ...
-        latch.countDown();
-      }
-
-      @Override
-      public void onFailure(Exception e) {
-        errorRef.set(e);
-        latch.countDown();
-      }
-    });
-
 try {
-  latch.await();
+  if (!latch.await(60, TimeUnit.SECONDS)) {
+    listener.onFailure(new RuntimeException("Query execution timed out after 60 seconds"));
+    return;
+  }
 } catch (InterruptedException e) {
   Thread.currentThread().interrupt();
   listener.onFailure(new RuntimeException("Interrupted while waiting for query execution", e));
   return;
 }
Suggestion importance[1-10]: 7

__

Why: Adding a timeout prevents indefinite blocking if the async callback fails to complete, which is a valid concern for production code. However, the 60-second timeout value is arbitrary and may need tuning based on actual query execution patterns.

Medium
Prevent negative pushed node count

When pushedNodeCount is negative (if physicalDepth > logicalDepth), the loop
condition pushedLogicalNodes < pushedNodeCount will never be satisfied, causing all
segments to be marked as pushed down incorrectly. Add validation to ensure
pushedNodeCount is non-negative.

core/src/main/java/org/opensearch/sql/executor/QueryService.java [484-508]

-int pushedNodeCount = logicalDepth - physicalDepth;
+int pushedNodeCount = Math.max(0, logicalDepth - physicalDepth);
 ...
 long pushedLogicalNodes = 0;
 int pushedSegments = 0;
 for (int idx = 0; idx < querySegments.size() && pushedLogicalNodes < pushedNodeCount; idx++) {
   Set<Integer> ids = idx < exclusiveIds.size() ? exclusiveIds.get(idx) : Set.of();
   long planNodeCount = ids.stream().filter(idToDescription::containsKey).count();
   pushedLogicalNodes += planNodeCount;
   pushedSegments++;
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that pushedNodeCount could be negative if physicalDepth > logicalDepth, which would cause incorrect pushdown marking. Using Math.max(0, ...) at line 486 ensures the value is non-negative and prevents logic errors in the subsequent loop.

Medium
General
Remove or uncomment query field

The commented-out .query(query) field should either be included or removed entirely.
Leaving commented code in production suggests incomplete implementation and may
confuse future maintainers about whether this field should be present.

core/src/main/java/org/opensearch/sql/executor/QueryService.java [329-338]

 if (profile != null && profile.getPlan() != null && !isLinearPlanTree(profile)) {
   ...
   listener.onResponse(
       AnalyzeResponse.builder()
-          // .query(query)
+          .query(query)
           .profile(profile)
           .schema(schema)
           .datarows(datarows)
           .total(datarows.length)
           .size(datarows.length)
           .build());
   return;
 }
Suggestion importance[1-10]: 3

__

Why: While removing commented code improves maintainability, the commented field at line 332 appears intentional (possibly for debugging). The suggestion correctly identifies that the field should be included in the fallback response, as it is included in the main response at line 419.

Low
Suggestions up to commit 727e3da
CategorySuggestion                                                                                                                                    Impact
Possible issue
Prevent NoSuchElementException on empty builder

The code assumes context.relBuilder.peek() is always valid after
unresolved.accept(), but if the builder is empty, this will throw
NoSuchElementException. Add a null/empty check before accessing peek() to prevent
runtime failures when the builder stack is unexpectedly empty.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [222-232]

 if (context.isTrackingEnabled()) {
   int idBefore = context.relBuilder.size() > 0 ? context.relBuilder.peek().getId() : -1;
   RelNode result = unresolved.accept(this, context);
+  if (context.relBuilder.size() == 0) {
+    return result;
+  }
   int idAfter = context.relBuilder.peek().getId();
   java.util.List<Integer> producedIds = new java.util.ArrayList<>();
   for (int id = idBefore + 1; id <= idAfter; id++) {
     producedIds.add(id);
   }
   context.recordMapping(unresolved.getClass().getSimpleName(), producedIds);
   return result;
 }
Suggestion importance[1-10]: 8

__

Why: The code assumes context.relBuilder.peek() is valid after unresolved.accept(), but if the builder is empty, this will throw NoSuchElementException. This is a critical runtime safety issue that could cause crashes during tracking.

Medium
Guard against empty builder after child processing

Similar to the analyze method, this code accesses context.relBuilder.peek() without
verifying the builder is non-empty after child.accept(). If the child processing
empties the builder, peek() will throw NoSuchElementException. Add a size check
before calling peek().

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [242-247]

 int idBefore = context.relBuilder.size() > 0 ? context.relBuilder.peek().getId() : -1;
 RelNode childResult = child.accept(this, context);
 result = childResult;
+if (context.relBuilder.size() == 0) {
+  continue;
+}
 int idAfter = context.relBuilder.peek().getId();
Suggestion importance[1-10]: 8

__

Why: Similar to the first suggestion, this code accesses context.relBuilder.peek() without verifying the builder is non-empty after child.accept(). This is a critical safety issue that could cause NoSuchElementException at runtime.

Medium
General
Count only segments with plan nodes

The loop increments pushedSegments even when planNodeCount is zero, potentially
counting segments that contribute no logical nodes. This can lead to incorrect
operator tree grouping. Only increment pushedSegments when planNodeCount > 0 to
accurately track segments with actual plan nodes.

core/src/main/java/org/opensearch/sql/executor/QueryService.java [461-467]

 int pushedSegments = 0;
 for (int idx = 0; idx < querySegments.size() && pushedLogicalNodes < pushedNodeCount; idx++) {
   Set<Integer> ids = idx < exclusiveIds.size() ? exclusiveIds.get(idx) : Set.of();
   long planNodeCount = ids.stream().filter(idToDescription::containsKey).count();
-  pushedLogicalNodes += planNodeCount;
-  pushedSegments++;
+  if (planNodeCount > 0) {
+    pushedLogicalNodes += planNodeCount;
+    pushedSegments++;
+  }
 }
Suggestion importance[1-10]: 7

__

Why: The loop increments pushedSegments even when planNodeCount is zero, which can lead to incorrect operator tree grouping. Only counting segments with actual plan nodes improves the accuracy of the pushdown detection logic.

Medium
Clean up profiling state on interruption

After catching InterruptedException and re-interrupting the thread, the method
returns without cleaning up QueryProfiling state. This could leave profiling in an
inconsistent state. Ensure QueryProfiling.noop() or similar cleanup is called before
returning to prevent resource leaks.

core/src/main/java/org/opensearch/sql/executor/QueryService.java [284-290]

 try {
   latch.await();
 } catch (InterruptedException e) {
   Thread.currentThread().interrupt();
+  QueryProfiling.noop();
   listener.onFailure(new RuntimeException("Interrupted while waiting for query execution", e));
   return;
 }
Suggestion importance[1-10]: 6

__

Why: After catching InterruptedException, the method returns without cleaning up QueryProfiling state. Adding QueryProfiling.noop() before returning ensures consistent state management and prevents potential resource leaks.

Low
Suggestions up to commit 127dfe7
CategorySuggestion                                                                                                                                    Impact
Possible issue
Prevent peek on empty builder

Check if context.relBuilder is empty before calling peek() after accept(). The
accept() call might not push any nodes onto the builder, causing peek() to throw
NoSuchElementException.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [223-225]

 int idBefore = context.relBuilder.size() > 0 ? context.relBuilder.peek().getId() : -1;
 RelNode result = unresolved.accept(this, context);
-int idAfter = context.relBuilder.peek().getId();
+int idAfter = context.relBuilder.size() > 0 ? context.relBuilder.peek().getId() : -1;
Suggestion importance[1-10]: 8

__

Why: The code calls context.relBuilder.peek() after accept() without checking if the builder is empty, which could throw NoSuchElementException. This is a potential runtime error that should be prevented.

Medium
Add timeout to latch await

Add a timeout to latch.await() to prevent indefinite blocking if the async execution
never completes. Without a timeout, the thread could hang forever if the callback is
never invoked due to an unexpected error path.

core/src/main/java/org/opensearch/sql/executor/QueryService.java [283-290]

-CountDownLatch latch = new CountDownLatch(1);
-
-executeWithCalcite(
-    plan,
-    queryType,
-    null,
-    new ResponseListener<>() {
-      @Override
-      public void onResponse(ExecutionEngine.QueryResponse response) {
-        ...
-        latch.countDown();
-      }
-
-      @Override
-      public void onFailure(Exception e) {
-        errorRef.set(e);
-        latch.countDown();
-      }
-    });
-
 try {
-  latch.await();
+  if (!latch.await(60, TimeUnit.SECONDS)) {
+    listener.onFailure(new RuntimeException("Query execution timed out after 60 seconds"));
+    return;
+  }
 } catch (InterruptedException e) {
   Thread.currentThread().interrupt();
   listener.onFailure(new RuntimeException("Interrupted while waiting for query execution", e));
   return;
 }
Suggestion importance[1-10]: 7

__

Why: Adding a timeout to latch.await() prevents indefinite blocking if the async callback never completes. This is a valid defensive programming practice that improves robustness, though the system may have other safeguards in place.

Medium
General
Validate ID range before iteration

Validate that idAfter >= idBefore before creating the range. If node IDs are not
monotonically increasing or if the builder state is inconsistent, this could produce
incorrect or negative ranges.

core/src/main/java/org/opensearch/sql/executor/QueryService.java [227-229]

-for (int id = idBefore + 1; id <= idAfter; id++) {
-  producedIds.add(id);
+if (idAfter >= idBefore) {
+  for (int id = idBefore + 1; id <= idAfter; id++) {
+    producedIds.add(id);
+  }
 }
Suggestion importance[1-10]: 6

__

Why: While validating idAfter >= idBefore adds safety, the logic assumes monotonically increasing IDs. The check prevents incorrect ranges but may indicate a deeper issue if the condition fails. This is a reasonable defensive check.

Low
Restore profiling state after tracking

Ensure QueryProfiling.noop() is properly restored after the tracking phase
completes. If an exception occurs during the second phase, profiling state might
remain in noop mode, affecting subsequent queries.

core/src/main/java/org/opensearch/sql/executor/QueryService.java [305-313]

 QueryProfiling.noop();
-CalciteClassLoaderHelper.withCalciteClassLoader(
-    () -> {
-      CalcitePlanContext context =
-          CalcitePlanContext.create(
-              buildFrameworkConfig(), SysLimit.fromSettings(settings), queryType);
-      context.setTrackingEnabled(true);
-      RelNode relNode = analyze(plan, context);
-      ...
-    },
-    QueryService.class);
+try {
+  CalciteClassLoaderHelper.withCalciteClassLoader(
+      () -> {
+        CalcitePlanContext context =
+            CalcitePlanContext.create(
+                buildFrameworkConfig(), SysLimit.fromSettings(settings), queryType);
+        context.setTrackingEnabled(true);
+        RelNode relNode = analyze(plan, context);
+        ...
+      },
+      QueryService.class);
+} finally {
+  QueryProfiling.current();
+}
Suggestion importance[1-10]: 5

__

Why: The suggestion to restore profiling state in a finally block is reasonable, but the improved_code calls QueryProfiling.current() which may not properly restore the previous state. The concern is valid but the solution needs verification.

Low
Suggestions up to commit 50ddb92
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add null safety check

The code assumes context.relBuilder.peek() is always non-null after
unresolved.accept(), which may not be true if the visitor empties the builder stack.
Add a null check before calling getId() to prevent potential NullPointerException.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [222-232]

 if (context.isTrackingEnabled()) {
   int idBefore = context.relBuilder.size() > 0 ? context.relBuilder.peek().getId() : -1;
   RelNode result = unresolved.accept(this, context);
-  int idAfter = context.relBuilder.peek().getId();
-  java.util.List<Integer> producedIds = new java.util.ArrayList<>();
-  for (int id = idBefore + 1; id <= idAfter; id++) {
-    producedIds.add(id);
+  if (context.relBuilder.size() > 0) {
+    int idAfter = context.relBuilder.peek().getId();
+    java.util.List<Integer> producedIds = new java.util.ArrayList<>();
+    for (int id = idBefore + 1; id <= idAfter; id++) {
+      producedIds.add(id);
+    }
+    context.recordMapping(unresolved.getClass().getSimpleName(), producedIds);
   }
-  context.recordMapping(unresolved.getClass().getSimpleName(), producedIds);
   return result;
 }
Suggestion importance[1-10]: 8

__

Why: The code assumes context.relBuilder.peek() is non-null after unresolved.accept(), which could cause a NullPointerException if the visitor empties the builder stack. This is a critical safety issue that should be addressed.

Medium
Prevent null pointer access

Similar to the analyze method, context.relBuilder.peek() may be null after
child.accept() completes. Verify the builder is non-empty before accessing
peek().getId() to avoid NullPointerException.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [242-247]

 int idBefore = context.relBuilder.size() > 0 ? context.relBuilder.peek().getId() : -1;
 RelNode childResult = child.accept(this, context);
 result = childResult;
-int idAfter = context.relBuilder.peek().getId();
+if (context.relBuilder.size() > 0) {
+  int idAfter = context.relBuilder.peek().getId();
+  ...
+}
Suggestion importance[1-10]: 8

__

Why: Similar to the first suggestion, context.relBuilder.peek() may be null after child.accept() completes. This is a critical safety issue that could cause a NullPointerException in production.

Medium
General
Validate hook callback type

The hook callback casts obj to RelRoot without validation. If the hook receives an
unexpected object type, this will throw ClassCastException. Add type checking before
casting to handle unexpected hook invocations gracefully.

core/src/main/java/org/opensearch/sql/executor/QueryService.java [316-330]

-AtomicReference<RelNode> physicalRelRef = new AtomicReference<>();
 try (Hook.Closeable closeable =
     Hook.PLAN_BEFORE_IMPLEMENTATION.addThread(
         obj -> {
-          RelRoot relRoot = (RelRoot) obj;
-          physicalRelRef.set(relRoot.rel);
-          physicalPlanRef.set(
-              RelOptUtil.toString(relRoot.rel, SqlExplainLevel.ALL_ATTRIBUTES));
+          if (obj instanceof RelRoot relRoot) {
+            physicalRelRef.set(relRoot.rel);
+            physicalPlanRef.set(
+                RelOptUtil.toString(relRoot.rel, SqlExplainLevel.ALL_ATTRIBUTES));
+          }
         })) {
-  try (java.sql.PreparedStatement ignored =
-      OpenSearchRelRunners.run(context, calcitePlan)) {
-  } catch (java.sql.SQLException e) {
-    throw new RuntimeException(e);
-  }
+  ...
 }
Suggestion importance[1-10]: 7

__

Why: The unchecked cast to RelRoot could throw ClassCastException if the hook receives an unexpected object type. Adding type validation improves robustness and prevents unexpected runtime failures.

Medium
Cleanup profiling on interruption

After catching InterruptedException and re-interrupting the thread, the method
returns without cleaning up QueryProfiling state. Ensure QueryProfiling.noop() or
equivalent cleanup is called before returning to prevent resource leaks.

core/src/main/java/org/opensearch/sql/executor/QueryService.java [284-290]

 try {
   latch.await();
 } catch (InterruptedException e) {
   Thread.currentThread().interrupt();
+  QueryProfiling.noop();
   listener.onFailure(new RuntimeException("Interrupted while waiting for query execution", e));
   return;
 }
Suggestion importance[1-10]: 6

__

Why: Adding QueryProfiling.noop() cleanup before returning on interruption prevents potential resource leaks and ensures consistent state management, though the impact is moderate since interruption is an exceptional case.

Low
Suggestions up to commit 726b7e6
CategorySuggestion                                                                                                                                    Impact
Possible issue
Ensure latch countdown on all exceptions

If executeWithCalcite throws an exception before invoking the listener callback, the
latch will never count down, causing the subsequent latch.await() to block
indefinitely. Wrap the executeWithCalcite call in a try-catch block and ensure
latch.countDown() is called in all error paths.

core/src/main/java/org/opensearch/sql/executor/QueryService.java [255-282]

 AtomicReference<ExecutionEngine.QueryResponse> queryResponseRef = new AtomicReference<>();
 AtomicReference<QueryProfile> profileRef = new AtomicReference<>();
 AtomicReference<Exception> errorRef = new AtomicReference<>();
 CountDownLatch latch = new CountDownLatch(1);
 
-executeWithCalcite(
-    plan,
-    queryType,
-    null,
-    new ResponseListener<>() {
-      @Override
-      public void onResponse(ExecutionEngine.QueryResponse response) {
-        ...
-        latch.countDown();
-      }
+try {
+  executeWithCalcite(
+      plan,
+      queryType,
+      null,
+      new ResponseListener<>() {
+        @Override
+        public void onResponse(ExecutionEngine.QueryResponse response) {
+          ...
+          latch.countDown();
+        }
 
-      @Override
-      public void onFailure(Exception e) {
-        errorRef.set(e);
-        latch.countDown();
-      }
-    });
+        @Override
+        public void onFailure(Exception e) {
+          errorRef.set(e);
+          latch.countDown();
+        }
+      });
+} catch (Exception e) {
+  errorRef.set(e);
+  latch.countDown();
+}
Suggestion importance[1-10]: 9

__

Why: If executeWithCalcite throws before invoking the listener, the latch never counts down, causing latch.await() to block indefinitely. Wrapping in try-catch and ensuring countdown in all error paths prevents deadlock.

High
Prevent null pointer on empty builder

The code assumes context.relBuilder.peek() will not be null after
unresolved.accept(), but if the accept method doesn't push any nodes, this will
throw a NoSuchElementException. Add a null/empty check before accessing peek() to
prevent crashes when tracking is enabled for nodes that don't produce RelNodes.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [222-232]

 if (context.isTrackingEnabled()) {
   int idBefore = context.relBuilder.size() > 0 ? context.relBuilder.peek().getId() : -1;
   RelNode result = unresolved.accept(this, context);
-  int idAfter = context.relBuilder.peek().getId();
-  java.util.List<Integer> producedIds = new java.util.ArrayList<>();
-  for (int id = idBefore + 1; id <= idAfter; id++) {
-    producedIds.add(id);
+  if (context.relBuilder.size() > 0) {
+    int idAfter = context.relBuilder.peek().getId();
+    java.util.List<Integer> producedIds = new java.util.ArrayList<>();
+    for (int id = idBefore + 1; id <= idAfter; id++) {
+      producedIds.add(id);
+    }
+    context.recordMapping(unresolved.getClass().getSimpleName(), producedIds);
   }
-  context.recordMapping(unresolved.getClass().getSimpleName(), producedIds);
   return result;
 }
Suggestion importance[1-10]: 8

__

Why: The code assumes context.relBuilder.peek() will not throw after unresolved.accept(), but if no nodes are pushed, peek() will throw NoSuchElementException. Adding a size check prevents a potential crash in tracking mode.

Medium
Add timeout to prevent indefinite blocking

The latch.await() call blocks indefinitely without a timeout, which can cause the
thread to hang forever if the async callback never completes. Add a timeout to
await() and handle the timeout case to prevent resource exhaustion and improve
system resilience.

core/src/main/java/org/opensearch/sql/executor/QueryService.java [284-290]

 try {
-  latch.await();
+  if (!latch.await(60, java.util.concurrent.TimeUnit.SECONDS)) {
+    listener.onFailure(new RuntimeException("Query execution timed out after 60 seconds"));
+    return;
+  }
 } catch (InterruptedException e) {
   Thread.currentThread().interrupt();
   listener.onFailure(new RuntimeException("Interrupted while waiting for query execution", e));
   return;
 }
Suggestion importance[1-10]: 7

__

Why: The latch.await() blocks indefinitely without a timeout. If the async callback never completes, the thread hangs forever. Adding a timeout improves resilience and prevents resource exhaustion.

Medium

Signed-off-by: Krish Gandhi <kjg2352@gmail.com>
Signed-off-by: Krish Gandhi <kjg2352@gmail.com>
Signed-off-by: Krish Gandhi <kjg2352@gmail.com>
@Krish-Gandhi Krish-Gandhi force-pushed the feature/analyze-endpoint branch from 726b7e6 to 50ddb92 Compare June 25, 2026 17:58
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit ff87829.

PathLineSeverityDescription
core/src/main/java/org/opensearch/sql/executor/QueryService.java280mediumThe analyze endpoint silently routes existing `profile: true` requests to the new analyze handler (`transformedRequest.analyze() || transformedRequest.profile()`), changing the response format for all current profile consumers. While the profile sub-object is preserved, additional fields (logicalPlan, physicalPlan, operator_tree, datarows, schema, etc.) are now returned unconditionally. Consumers expecting the old profile-only response shape may inadvertently expose or log internal query plan details (table structures, filter predicates, cost estimates, row counts) that were not previously surfaced.
core/src/main/java/org/opensearch/sql/executor/QueryService.java302lowCountDownLatch.await() is called with no timeout. If the inner executeWithCalcite callback never fires (e.g., thread pool exhaustion or a hung query), the calling thread blocks indefinitely, potentially exhausting the thread pool and causing a denial-of-service condition.
core/src/main/java/org/opensearch/sql/executor/QueryService.java358lowThe analyze path executes the user query twice — once via executeWithCalcite for profiling and a second time via OpenSearchRelRunners.run() for plan capture — doubling datastore load per analyze request. On expensive queries this doubles resource consumption, which could be exploited to amplify resource usage compared to normal query execution.

The table above displays the top 10 most important findings.

Total: 3 | Critical: 0 | High: 0 | Medium: 1 | Low: 2


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 50ddb92

Signed-off-by: Krish Gandhi <kjg2352@gmail.com>
@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 127dfe7

@Krish-Gandhi Krish-Gandhi marked this pull request as draft June 25, 2026 21:56
@Krish-Gandhi Krish-Gandhi marked this pull request as ready for review June 30, 2026 16:21
Signed-off-by: Krish Gandhi <kjg2352@gmail.com>
@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 727e3da

@ahkcs

ahkcs commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Let's update the endpoint.md(https://github.com/opensearch-project/sql/blob/main/docs/user/ppl/interfaces/endpoint.md) and integ-test/src/yamlRestTest/resources/rest-api-spec/test/api/ppl.profile.yml file for our profile API change

@ahkcs

ahkcs commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Let's also document the current limitation for join, also we can provide fallback option to original profile API when we meet limitations

…' on complex queries

Signed-off-by: Krish Gandhi <kjg2352@gmail.com>
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 7f164a0

…, which is not a part of this PR

Signed-off-by: Krish Gandhi <kjg2352@gmail.com>
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit ff87829

@Krish-Gandhi

Copy link
Copy Markdown
Author

Updated docs/user/ppl/interfaces/endpoint.md to add documentation for current state of analyze and update note for profile endpoint, as well as a note about when the logic of analyze falls apart and fallback to profile. Also updated integ-test/src/yamlRestTest/resources/rest-api-spec/test/api/ppl.profile.yml and added integ-test/src/yamlRestTest/resources/rest-api-spec/test/api/ppl.analyze.yml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Support analyze alongside explain

3 participants