fix: drop processing instructions in HTML parser#4361
Open
assinscreedFC wants to merge 2 commits into
Open
Conversation
partition_html() crashed with 'lxml.etree._ProcessingInstruction object has no attribute is_phrasing' when the document contained a processing instruction such as '<?xml ...?>'. The HTML parser's element-class lookup only maps real elements, so a processing instruction reaches the tree as a plain lxml _ProcessingInstruction node without the parser's custom interface, then crashes the phrasing/flow iteration. Pass remove_pis=True to the HTMLParser so processing instructions are stripped during parsing, mirroring the existing remove_comments=True. They carry no rendered text, so dropping them matches browser behavior. Fixes Unstructured-IO#4358
Contributor
There was a problem hiding this comment.
No issues found across 4 files
Shadow auto-approve: would auto-approve. This PR adds a single parameter remove_pis=True to the HTML parser, mirroring the existing remove_comments=True, to safely drop processing instructions and prevent an AttributeError crash with no change to business logic or data flow.
Re-trigger cubic
…#4358) The markdown entry path routes through the same HTML parser, so add a regression test that exercises partition_md directly with a <?xml ?> PI.
Contributor
There was a problem hiding this comment.
0 issues found across 1 file (changes from recent commits).
Shadow auto-approve: would auto-approve. This change adds a single parameter to drop processing instructions during HTML parsing, fixing a specific crash without affecting other functionality, and includes regression tests.
Re-trigger cubic
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
partition_html()crashes when the input contains a processing instruction such as<?xml version="1.0" encoding="UTF-8"?>:Closes #4358.
Root cause
The HTML parser (
unstructured/partition/html/parser.py) assigns custom element classes viaElementNamespaceClassLookupwith anElementDefaultClassLookup(element=DefaultElement)fallback. That fallback only covers elements — a processing instruction reaches the tree as a plainlxml.etree._ProcessingInstructionnode, which lacks the parser's custom interface (.is_phrasing). Phrasing/flow iteration then hitswhile q and q[0].is_phrasing:and raisesAttributeError.Fix
Add
remove_pis=True, mirroring the existingremove_comments=True. Processing instructions carry no rendered text, so stripping them at parse time matches browser behavior and keeps non-element nodes out of the element iteration entirely (a single-point fix rather than guarding every iteration site).Tests
test_partition_html_ignores_processing_instructions(regression for bug/xml indentation causes parsing error #4358): a<?xml ...?>inside the body no longer crashes and the surrounding text is extracted. Fails onmain(AttributeError), passes with the fix.test_parser.py+test_partition.pysuites: 236 passed, no new failures (one unrelated pre-existing failure,test_partition_html_on_ideas_page, fails identically onmainin this environment).ruff check/ruff format --checkclean; bumped__version__+CHANGELOG.mdto 0.22.32 per CONTRIBUTING.Summary by cubic
Fixes a crash in
partition_html()andpartition_md()when documents include processing instructions like<?xml ...?>. We now drop them during HTML parsing to preventAttributeErrorand match browser behavior.remove_pis=Truetolxml.etree.HTMLParserto drop processing instructions and avoid_ProcessingInstructionnodes without.is_phrasing. Fixes bug/xml indentation causes parsing error #4358.Written for commit aa7c725. Summary will update on new commits.