feat: derive category_depth from heading level in the v2 (ontology) HTML parser (ML-1328)#4360
Draft
qued wants to merge 3 commits into
Draft
feat: derive category_depth from heading level in the v2 (ontology) HTML parser (ML-1328)#4360qued wants to merge 3 commits into
qued wants to merge 3 commits into
Conversation
…level
The v2 (ontology) HTML parser -- the path the VLM partitioner uses via
partition_html(html_parser_version="v2") -> ontology_to_unstructured_elements --
set category_depth to the DOM/recursion nesting depth. That encoded physical
layout (Page -> Column -> block) rather than logical hierarchy, so every flat
sibling heading collapsed to the same depth and a multi-column page bumped every
element's depth. It diverged from the v1 parser and from the documented
metadata contract (depth derives from native heading levels h1/h2/h3).
Changes:
- Add shared helper `category_depth_from_html_tag` in partition/common/metadata.py:
Title (ontology Title/Subtitle/Heading -> h1..h6) -> int(tag[1]) - 1
(h1->0, h2->1, ...); ListItem -> enclosing ol/ul/dl count; else None.
- v1 parser `_category_depth` now delegates to the shared helper (DRY; behavior
unchanged).
- v2 converter computes category_depth from the element's heading level (not
nesting). Layout containers and non-heading content get None.
parent_id approach:
- Layout/container elements (Page, Column, Section, ...) keep their tree
parent_id so the physical layout structure is preserved.
- Content elements are emitted with parent_id=None so the shared
set_element_hierarchy post-processor (run by @apply_metadata on
partition_html) assigns a heading-based parent -- subsections now chain under
their enclosing section heading. (set_element_hierarchy skips elements that
already have a parent_id, hence the None.)
- unstructured_elements_to_ontology (the reverse path) no longer reads content
parent_id (now heading-based) to rebuild the tree; it reconstructs layout
nesting from the layout-container elements' retained tree parent_id plus
document order, via a container stack. Round-trip is preserved exactly.
- Inline-merge ("same level in the tree") previously reused category_depth as a
nesting signal; it now uses a transient, non-serialized _ontology_nesting_depth
attribute so heading-level category_depth doesn't change merge behavior.
Verified: category_depth follows heading level (h1->0, h2->1, h3->2); depth is
unaffected by multi-column layout; parent_id reflects section->subsection;
text_as_html, Page/Column containers, and table/form structure are preserved.
v2 now matches v1 on heading-level depth. Confirmed on the real Kaiser
EOC_CA_Individual-Family page 6 ("Cost Share Summary"->0, subsections->1,
"How to read..."->2, subsections chained under their section).
New tests cover a multi-level page and a multi-column page; existing v2, ontology
round-trip, and set_element_hierarchy tests stay green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Harden `unstructured_elements_to_ontology` against malformed input flagged by the external review: - Return an empty `Document` for empty input instead of raising `IndexError` (pre-existing behavior; cheap, lossless guard). - A layout container whose `parent_id` matches no currently-open container now nests in the innermost open container rather than popping past valid ancestors to the document root (which mis-nested subsequent content). For the real producer (`ontology_to_unstructured_elements`) the parent is always an open ancestor, so this only changes behavior for input that violates the documented parent-before-child precondition, and it is lossless. Add round-trip tests: empty input, multi-column container nesting, and a container with an unknown `parent_id`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address the external-review finding that `ontology_to_unstructured_elements` emitted content elements with `parent_id=None`, relying on the `@apply_metadata` decorator on the *separate* `partition_html` function to run `set_element_hierarchy` and fill them in. The converter now applies `set_element_hierarchy` itself over its full output, so a direct caller gets complete heading-based hierarchy. Split the function into a public wrapper (runs hierarchy once over the flat list) and a private recursive worker `_ontology_to_unstructured_elements` (builds the list with container tree `parent_id` set and content `parent_id=None`). `set_element_hierarchy` only assigns `parent_id` to elements that lack one, so layout containers keep their tree parent and the reverse converter still rebuilds nesting from containers + document order. When `partition_html` re-runs `set_element_hierarchy` via `@apply_metadata`, every element already has a `parent_id`, so the second pass is a verified no-op (no reassignment/reorder). Add tests: converter assigns heading-based parent_id without the decorator; re-running set_element_hierarchy on the output is a no-op. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The v2 ontology HTML parser (
partition_html(html_parser_version="v2")→ontology_to_unstructured_elements) setcategory_depthfrom DOM/container nesting (Page→Column→block), so it tracked physical layout, not section hierarchy — flat (all depth 1) on single-column pages, bumped by columns elsewhere — discarding the<h1>/<h2>/<h3>levels the producer (VLM partitioner) already emits. Surfaced by a customer (ML-1301).Change
category_depth_from_html_taghelper inpartition/common/metadata.py(h1→0, h2→1, …; ListItem→list-ancestor depth) now feeds both the v1 parser and the v2 converter (v1 delegates to it; behavior unchanged).category_depthis now heading-level.ontology_to_unstructured_elementsis self-sufficient: it assigns content elements' heading-basedparent_idviaset_element_hierarchyitself (no longer relying on the@apply_metadatadecorator); layout containers keep their treeparent_id, and the reverse converterunstructured_elements_to_ontologyrebuilds nesting from those + document order, so the round-trip stays exact.@apply_metadata's laterset_element_hierarchypass is a verified no-op.Tests
367 passing (html partition + ontology round-trip +
set_element_hierarchy); new tests for heading-level depth, multi-column non-bump, self-sufficientparent_id, and round-trip.Rollout note
This lands in
unstructured; reaching the VLM-partitioner (the v2 consumer, e.g. the customer's path) additionally needs anunstructuredrelease + a bump of the VLM partitioner'sunstructuredpin.Why draft
Pending Alan's review.
🤖 Generated with Claude Code
Summary by cubic
The v2 (ontology) HTML parser now derives
category_depthfrom HTML heading level (h1->0,h2->1, …) and assigns heading-basedparent_ids, removing layout-driven depth bumps and aligning with v1. Addresses ML-1328.Refactors
category_depth_from_html_tag(used by both v1 and v2):Title→ heading level;ListItem→ list ancestor count; others →None.ontology_to_unstructured_elementsnow runsset_element_hierarchyitself so outputs have heading-basedparent_id; layout containers keep their treeparent_id; inline merge uses a transient nesting depth (notcategory_depth).unstructured_elements_to_ontologyrebuilds layout from container elements and document order (ignores contentparent_idfor layout), preserving exact round-trip.Bug Fixes
Documentinstead of raisingIndexError.parent_idnests in the current container instead of popping to root.Written for commit b07bfe8. Summary will update on new commits.