fix: drop processing instructions in HTML parser by assinscreedFC · Pull Request #4361 · Unstructured-IO/unstructured

assinscreedFC · 2026-06-05T11:50:10Z

Summary

partition_html() crashes when the input contains a processing instruction such as <?xml version="1.0" encoding="UTF-8"?>:

AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing'

Closes #4358.

Root cause

The HTML parser (unstructured/partition/html/parser.py) assigns custom element classes via ElementNamespaceClassLookup with an ElementDefaultClassLookup(element=DefaultElement) fallback. That fallback only covers elements — a processing instruction reaches the tree as a plain lxml.etree._ProcessingInstruction node, which lacks the parser's custom interface (.is_phrasing). Phrasing/flow iteration then hits while q and q[0].is_phrasing: and raises AttributeError.

Fix

html_parser = etree.HTMLParser(remove_comments=True, remove_pis=True)

Add remove_pis=True, mirroring the existing remove_comments=True. Processing instructions carry no rendered text, so stripping them at parse time matches browser behavior and keeps non-element nodes out of the element iteration entirely (a single-point fix rather than guarding every iteration site).

Tests

Added test_partition_html_ignores_processing_instructions (regression for bug/xml indentation causes parsing error #4358): a <?xml ...?> inside the body no longer crashes and the surrounding text is extracted. Fails on main (AttributeError), passes with the fix.
test_parser.py + test_partition.py suites: 236 passed, no new failures (one unrelated pre-existing failure, test_partition_html_on_ideas_page, fails identically on main in this environment).
ruff check / ruff format --check clean; bumped __version__ + CHANGELOG.md to 0.22.32 per CONTRIBUTING.

Summary by cubic

Fixes a crash in partition_html() and partition_md() when documents include processing instructions like <?xml ...?>. We now drop them during HTML parsing to prevent AttributeError and match browser behavior.

Bug Fixes
- Pass remove_pis=True to lxml.etree.HTMLParser to drop processing instructions and avoid _ProcessingInstruction nodes without .is_phrasing. Fixes bug/xml indentation causes parsing error #4358.
- Add regression tests for HTML and Markdown entry points to confirm PIs are ignored.

^{Written for commit aa7c725. Summary will update on new commits.}

partition_html() crashed with 'lxml.etree._ProcessingInstruction object has no attribute is_phrasing' when the document contained a processing instruction such as '<?xml ...?>'. The HTML parser's element-class lookup only maps real elements, so a processing instruction reaches the tree as a plain lxml _ProcessingInstruction node without the parser's custom interface, then crashes the phrasing/flow iteration. Pass remove_pis=True to the HTMLParser so processing instructions are stripped during parsing, mirroring the existing remove_comments=True. They carry no rendered text, so dropping them matches browser behavior. Fixes Unstructured-IO#4358

cubic-dev-ai

No issues found across 4 files

_{Shadow auto-approve: would auto-approve. This PR adds a single parameter remove_pis=True to the HTML parser, mirroring the existing remove_comments=True, to safely drop processing instructions and prevent an AttributeError crash with no change to business logic or data flow.

Re-trigger cubic}

…#4358) The markdown entry path routes through the same HTML parser, so add a regression test that exercises partition_md directly with a <?xml ?> PI.

cubic-dev-ai

0 issues found across 1 file (changes from recent commits).

_{Shadow auto-approve: would auto-approve. This change adds a single parameter to drop processing instructions during HTML parsing, fixing a specific crash without affecting other functionality, and includes regression tests.

Re-trigger cubic}

assinscreedFC mentioned this pull request Jun 5, 2026

bug/xml indentation causes parsing error #4358

Open

cubic-dev-ai Bot reviewed Jun 5, 2026

View reviewed changes

test: cover partition_md processing-instruction path (Unstructured-IO…

aa7c725

…#4358) The markdown entry path routes through the same HTML parser, so add a regression test that exercises partition_md directly with a <?xml ?> PI.

cubic-dev-ai Bot reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: drop processing instructions in HTML parser#4361

fix: drop processing instructions in HTML parser#4361
assinscreedFC wants to merge 2 commits into
Unstructured-IO:mainfrom
assinscreedFC:fix/html-parser-processing-instruction

assinscreedFC commented Jun 5, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

assinscreedFC commented Jun 5, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Tests

Summary by cubic

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

assinscreedFC commented Jun 5, 2026 •

edited by cubic-dev-ai Bot

Loading