Skip to content

Optimize XML tool parsers with incremental streaming and fast-path buffering#4664

Draft
lvhan028 wants to merge 1 commit into
InternLM:mainfrom
lvhan028:optimize-xml-tool-parser
Draft

Optimize XML tool parsers with incremental streaming and fast-path buffering#4664
lvhan028 wants to merge 1 commit into
InternLM:mainfrom
lvhan028:optimize-xml-tool-parser

Conversation

@lvhan028

@lvhan028 lvhan028 commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Motivation

XML tool parsers (GLM47, Qwen3Coder) run on the API server after each streamed decode step. The previous implementation rescanned the entire accumulated payload on every chunk and re-coerced all parameters, giving O(n²) CPU cost as tool argument values grow or arrive token-by-token.

In real streaming workloads—especially long parameter values split across hundreds or thousands of small chunks—this added unnecessary CPU overhead on the API server hot path. The goal is to keep parsing semantics identical while reducing per-chunk work to roughly O(n) and skipping work entirely while plain value text is still being buffered.

Modification

  • xml_tool_parser.py: Added incremental streaming infrastructure:

    • Payload buffering via _payload_parts with deferred join
    • Fast path for in-progress parameter values that contain no XML structural characters (<, >, /)
    • Coercion cache (_coerced_args) so only newly completed parameters are coerced
    • Cheaper string coercion: skip json.loads unless the value looks like a JSON string
  • glm47_tool_parser.py / qwen3coder_tool_parser.py:

    • Incremental parse state (function name, open param/value, scan cursor)
    • Complete open parameters only when closing tags arrive
    • Fix scan cursor advancing past incomplete parameter headers during split-chunk streaming
    • Qwen3Coder: strip outer <tool_call> tag once instead of on every chunk
  • benchmark/benchmark_xml_tool_parser.py: Added benchmark comparing legacy (main-branch) vs optimized parsers on long-value and tokenized streaming scenarios (~3.5x average speedup on 2048-char values).

Copilot AI review requested due to automatic review settings June 9, 2026 14:41
@lvhan028 lvhan028 marked this pull request as draft June 9, 2026 14:41

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the XML-like tool parsers used during streamed decoding by adding incremental parsing state, buffering plain value text to avoid repeated full rescans, and caching schema coercion so only newly completed parameters are coerced.

Changes:

  • Added shared incremental streaming infrastructure to XmlToolParser (payload part buffering, “in-progress value” fast-path, and coerced-args caching).
  • Reworked Glm47ToolParser and Qwen3CoderToolParser to maintain incremental parse state and only finalize parameters when closing tags arrive.
  • Added a benchmark script to compare legacy vs optimized implementations under long-value and tokenized streaming scenarios.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
lmdeploy/serve/parsers/tool_parser/xml_tool_parser.py Adds shared buffering / incremental parsing hooks and coercion caching for streamed XML tool parsing.
lmdeploy/serve/parsers/tool_parser/glm47_tool_parser.py Implements incremental state tracking for GLM-4.7 XML tool payloads.
lmdeploy/serve/parsers/tool_parser/qwen3coder_tool_parser.py Implements incremental state tracking and outer-tag stripping changes for Qwen3Coder XML tool payloads.
benchmark/benchmark_xml_tool_parser.py Adds a local benchmark harness comparing legacy vs optimized streaming behavior and performance.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +59 to +63
def _payload_text(self) -> str:
if not self._payload_parts:
return self._tool_payload
return ''.join(self._payload_parts)

Comment on lines +105 to +106
search_idx = 0
while True:
Comment on lines +112 to +113
search_idx = 0
while True:
Comment on lines +61 to +68
def _strip_outer_open_tag_once(self, payload: str) -> str:
if self._qwen_open_tag_stripped:
return payload
open_tag = self.get_tool_open_tag()
if open_tag and payload.startswith(open_tag):
self._qwen_open_tag_stripped = True
return payload[len(open_tag):]
return payload
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants