Skip to content

fix: drop processing instructions in HTML parser#4361

Open
assinscreedFC wants to merge 2 commits into
Unstructured-IO:mainfrom
assinscreedFC:fix/html-parser-processing-instruction
Open

fix: drop processing instructions in HTML parser#4361
assinscreedFC wants to merge 2 commits into
Unstructured-IO:mainfrom
assinscreedFC:fix/html-parser-processing-instruction

Conversation

@assinscreedFC
Copy link
Copy Markdown

@assinscreedFC assinscreedFC commented Jun 5, 2026

Summary

partition_html() crashes when the input contains a processing instruction such as <?xml version="1.0" encoding="UTF-8"?>:

AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing'

Closes #4358.

Root cause

The HTML parser (unstructured/partition/html/parser.py) assigns custom element classes via ElementNamespaceClassLookup with an ElementDefaultClassLookup(element=DefaultElement) fallback. That fallback only covers elements — a processing instruction reaches the tree as a plain lxml.etree._ProcessingInstruction node, which lacks the parser's custom interface (.is_phrasing). Phrasing/flow iteration then hits while q and q[0].is_phrasing: and raises AttributeError.

Fix

html_parser = etree.HTMLParser(remove_comments=True, remove_pis=True)

Add remove_pis=True, mirroring the existing remove_comments=True. Processing instructions carry no rendered text, so stripping them at parse time matches browser behavior and keeps non-element nodes out of the element iteration entirely (a single-point fix rather than guarding every iteration site).

Tests

  • Added test_partition_html_ignores_processing_instructions (regression for bug/xml indentation causes parsing error #4358): a <?xml ...?> inside the body no longer crashes and the surrounding text is extracted. Fails on main (AttributeError), passes with the fix.
  • test_parser.py + test_partition.py suites: 236 passed, no new failures (one unrelated pre-existing failure, test_partition_html_on_ideas_page, fails identically on main in this environment).
  • ruff check / ruff format --check clean; bumped __version__ + CHANGELOG.md to 0.22.32 per CONTRIBUTING.

Summary by cubic

Fixes a crash in partition_html() and partition_md() when documents include processing instructions like <?xml ...?>. We now drop them during HTML parsing to prevent AttributeError and match browser behavior.

  • Bug Fixes
    • Pass remove_pis=True to lxml.etree.HTMLParser to drop processing instructions and avoid _ProcessingInstruction nodes without .is_phrasing. Fixes bug/xml indentation causes parsing error #4358.
    • Add regression tests for HTML and Markdown entry points to confirm PIs are ignored.

Written for commit aa7c725. Summary will update on new commits.

Review in cubic

partition_html() crashed with 'lxml.etree._ProcessingInstruction object
has no attribute is_phrasing' when the document contained a processing
instruction such as '<?xml ...?>'. The HTML parser's element-class lookup
only maps real elements, so a processing instruction reaches the tree as a
plain lxml _ProcessingInstruction node without the parser's custom
interface, then crashes the phrasing/flow iteration.

Pass remove_pis=True to the HTMLParser so processing instructions are
stripped during parsing, mirroring the existing remove_comments=True. They
carry no rendered text, so dropping them matches browser behavior.

Fixes Unstructured-IO#4358
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

Shadow auto-approve: would auto-approve. This PR adds a single parameter remove_pis=True to the HTML parser, mirroring the existing remove_comments=True, to safely drop processing instructions and prevent an AttributeError crash with no change to business logic or data flow.

Re-trigger cubic

…#4358)

The markdown entry path routes through the same HTML parser, so add a
regression test that exercises partition_md directly with a <?xml ?> PI.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 1 file (changes from recent commits).

Shadow auto-approve: would auto-approve. This change adds a single parameter to drop processing instructions during HTML parsing, fixing a specific crash without affecting other functionality, and includes regression tests.

Re-trigger cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug/xml indentation causes parsing error

1 participant