+ "details": "### Summary\n\njusthtml through 1.9.1 allows denial of service via deeply nested HTML. During parsing, `JustHTML.__init__()` always reaches `TreeBuilder.finish()`, which unconditionally calls `_populate_selectedcontent()`. That function recursively traverses the DOM via `_find_elements()` / `_find_element()` without a depth bound, allowing attacker-controlled deeply nested input to trigger an unhandled `RecursionError` on CPython. Depending on the host application's exception handling, this can abort parsing, fail requests, or terminate a worker/process.\n\n### Details\n\n`TreeBuilder.finish()` ([`treebuilder.py#L476`](https://github.com/EmilStenstrom/justhtml/blob/a866b6077770d9ec4cb6b6f9bfe7c918f98455e4/src/justhtml/treebuilder.py#L476)) unconditionally calls `_populate_selectedcontent(self.document)` at [line 494](https://github.com/EmilStenstrom/justhtml/blob/a866b6077770d9ec4cb6b6f9bfe7c918f98455e4/src/justhtml/treebuilder.py#L494). `_populate_selectedcontent()` ([`treebuilder.py#L1243`](https://github.com/EmilStenstrom/justhtml/blob/a866b6077770d9ec4cb6b6f9bfe7c918f98455e4/src/justhtml/treebuilder.py#L1243)) calls `_find_elements()` ([`treebuilder.py#L1280`](https://github.com/EmilStenstrom/justhtml/blob/a866b6077770d9ec4cb6b6f9bfe7c918f98455e4/src/justhtml/treebuilder.py#L1280)) to recursively search the DOM tree for `<select>` elements:\n\n```python\ndef _find_elements(self, node: Any, name: str, result: list[Any]) -> None:\n \"\"\"Recursively find all elements with given name.\"\"\"\n if node.name == name:\n result.append(node)\n if node.has_child_nodes():\n for child in node.children:\n self._find_elements(child, name, result) # recursive call\n```\n\nWhen the DOM tree depth exceeds CPython's default recursion limit (1000), this raises an unhandled `RecursionError`. The full call path is:\n\n`JustHTML(html)` → `tokenizer.run()` → `tree_builder.finish()` → `_populate_selectedcontent(document)` → `_find_elements(root, \"select\", selects)` (recursive)\n\nDeeply nested DOM trees can be produced by nesting `<div>` tags ~1000 levels deep. On CPython with the default recursion limit, approximately 11 KB of `<div>` nesting is sufficient to trigger the error. The exact depth threshold is environment-dependent (CPython version, recursion limit setting, call stack depth at invocation).\n\nAdditional recursive functions are affected on already-parsed deep trees:\n- `Node.clone_node(deep=True)` ([`node.py#L523`](https://github.com/EmilStenstrom/justhtml/blob/a866b6077770d9ec4cb6b6f9bfe7c918f98455e4/src/justhtml/node.py#L523)) — called during sanitization\n- `_node_to_html()` ([`serialize.py#L580`](https://github.com/EmilStenstrom/justhtml/blob/a866b6077770d9ec4cb6b6f9bfe7c918f98455e4/src/justhtml/serialize.py#L580)) — used by `to_html(pretty=True)`\n- `_to_markdown_walk()` ([`node.py#L817`](https://github.com/EmilStenstrom/justhtml/blob/a866b6077770d9ec4cb6b6f9bfe7c918f98455e4/src/justhtml/node.py#L817)) — used by `to_markdown()`\n\nNote: the library already uses iterative traversal in several comparable functions (e.g., `_node_to_html_compact` at [`serialize.py#L197`](https://github.com/EmilStenstrom/justhtml/blob/a866b6077770d9ec4cb6b6f9bfe7c918f98455e4/src/justhtml/serialize.py#L197), `_to_text_collect` at [`node.py#L161`](https://github.com/EmilStenstrom/justhtml/blob/a866b6077770d9ec4cb6b6f9bfe7c918f98455e4/src/justhtml/node.py#L161), `_is_blocky_element` at [`serialize.py#L405`](https://github.com/EmilStenstrom/justhtml/blob/a866b6077770d9ec4cb6b6f9bfe7c918f98455e4/src/justhtml/serialize.py#L405), `apply_to_children` at [`transforms.py#L1642`](https://github.com/EmilStenstrom/justhtml/blob/a866b6077770d9ec4cb6b6f9bfe7c918f98455e4/src/justhtml/transforms.py#L1642)), demonstrating the correct pattern.\n\n### PoC\n\n```python\nfrom justhtml import JustHTML\n\nhtml = \"<div>\" * 1000 + \"x\" + \"</div>\" * 1000\ndoc = JustHTML(html) # raises RecursionError\n```\n\nTest environment: CPython 3.14.3, macOS ARM64 (Apple Silicon), justhtml 1.9.1, default recursion limit (1000)\n\n| Input | Size | Result |\n|-------|------|--------|\n| `<div>` × 500 | 5,501 bytes | OK |\n| `<div>` × 800 | 8,801 bytes | OK |\n| `<div>` × 1000 | 11,001 bytes | RecursionError |\n\nThe error occurs with both `sanitize=True` (default) and `sanitize=False`.\n\n### Impact\n\nAn attacker who can supply HTML for parsing can trigger an unhandled `RecursionError` during `JustHTML()` construction. The error is triggered during construction and is not avoided by `justhtml` configuration alone; mitigating it requires host-application exception handling or input constraints. Depending on the host application's exception handling, this can abort parsing, fail requests, or terminate a worker/process.\n\n### Suggested Fix\n\nConvert the recursive tree traversal functions to iterative implementations using an explicit stack. Example for `_find_elements`:\n\n```python\ndef _find_elements(self, node: Any, name: str, result: list[Any]) -> None:\n stack = [node]\n while stack:\n current = stack.pop()\n if current.name == name:\n result.append(current)\n if current.has_child_nodes():\n stack.extend(reversed(current.children))\n```\n\nThe same conversion should be applied to `_find_element`, `clone_node(deep=True)`, `_node_to_html()`, and `_to_markdown_walk()`.",
0 commit comments