Skip to content

feat: derive category_depth from heading level in the v2 (ontology) HTML parser (ML-1328)#4360

Draft
qued wants to merge 3 commits into
mainfrom
alan/ml-1328-category-depth-heading-level
Draft

feat: derive category_depth from heading level in the v2 (ontology) HTML parser (ML-1328)#4360
qued wants to merge 3 commits into
mainfrom
alan/ml-1328-category-depth-heading-level

Conversation

@qued
Copy link
Copy Markdown
Contributor

@qued qued commented Jun 4, 2026

Problem

The v2 ontology HTML parser (partition_html(html_parser_version="v2")ontology_to_unstructured_elements) set category_depth from DOM/container nesting (Page→Column→block), so it tracked physical layout, not section hierarchy — flat (all depth 1) on single-column pages, bumped by columns elsewhere — discarding the <h1>/<h2>/<h3> levels the producer (VLM partitioner) already emits. Surfaced by a customer (ML-1301).

Change

  1. A shared category_depth_from_html_tag helper in partition/common/metadata.py (h1→0, h2→1, …; ListItem→list-ancestor depth) now feeds both the v1 parser and the v2 converter (v1 delegates to it; behavior unchanged).
  2. v2 category_depth is now heading-level.
  3. ontology_to_unstructured_elements is self-sufficient: it assigns content elements' heading-based parent_id via set_element_hierarchy itself (no longer relying on the @apply_metadata decorator); layout containers keep their tree parent_id, and the reverse converter unstructured_elements_to_ontology rebuilds nesting from those + document order, so the round-trip stays exact. @apply_metadata's later set_element_hierarchy pass is a verified no-op.

Tests

367 passing (html partition + ontology round-trip + set_element_hierarchy); new tests for heading-level depth, multi-column non-bump, self-sufficient parent_id, and round-trip.

Rollout note

This lands in unstructured; reaching the VLM-partitioner (the v2 consumer, e.g. the customer's path) additionally needs an unstructured release + a bump of the VLM partitioner's unstructured pin.

Why draft

Pending Alan's review.

🤖 Generated with Claude Code


Summary by cubic

The v2 (ontology) HTML parser now derives category_depth from HTML heading level (h1->0, h2->1, …) and assigns heading-based parent_ids, removing layout-driven depth bumps and aligning with v1. Addresses ML-1328.

  • Refactors

    • Added shared category_depth_from_html_tag (used by both v1 and v2): Title → heading level; ListItem → list ancestor count; others → None.
    • ontology_to_unstructured_elements now runs set_element_hierarchy itself so outputs have heading-based parent_id; layout containers keep their tree parent_id; inline merge uses a transient nesting depth (not category_depth).
    • unstructured_elements_to_ontology rebuilds layout from container elements and document order (ignores content parent_id for layout), preserving exact round-trip.
  • Bug Fixes

    • Empty input now returns an empty Document instead of raising IndexError.
    • A layout container with an unknown parent_id nests in the current container instead of popping to root.

Written for commit b07bfe8. Summary will update on new commits.

Review in cubic

qued and others added 3 commits June 4, 2026 14:46
…level

The v2 (ontology) HTML parser -- the path the VLM partitioner uses via
partition_html(html_parser_version="v2") -> ontology_to_unstructured_elements --
set category_depth to the DOM/recursion nesting depth. That encoded physical
layout (Page -> Column -> block) rather than logical hierarchy, so every flat
sibling heading collapsed to the same depth and a multi-column page bumped every
element's depth. It diverged from the v1 parser and from the documented
metadata contract (depth derives from native heading levels h1/h2/h3).

Changes:
- Add shared helper `category_depth_from_html_tag` in partition/common/metadata.py:
  Title (ontology Title/Subtitle/Heading -> h1..h6) -> int(tag[1]) - 1
  (h1->0, h2->1, ...); ListItem -> enclosing ol/ul/dl count; else None.
- v1 parser `_category_depth` now delegates to the shared helper (DRY; behavior
  unchanged).
- v2 converter computes category_depth from the element's heading level (not
  nesting). Layout containers and non-heading content get None.

parent_id approach:
- Layout/container elements (Page, Column, Section, ...) keep their tree
  parent_id so the physical layout structure is preserved.
- Content elements are emitted with parent_id=None so the shared
  set_element_hierarchy post-processor (run by @apply_metadata on
  partition_html) assigns a heading-based parent -- subsections now chain under
  their enclosing section heading. (set_element_hierarchy skips elements that
  already have a parent_id, hence the None.)
- unstructured_elements_to_ontology (the reverse path) no longer reads content
  parent_id (now heading-based) to rebuild the tree; it reconstructs layout
  nesting from the layout-container elements' retained tree parent_id plus
  document order, via a container stack. Round-trip is preserved exactly.
- Inline-merge ("same level in the tree") previously reused category_depth as a
  nesting signal; it now uses a transient, non-serialized _ontology_nesting_depth
  attribute so heading-level category_depth doesn't change merge behavior.

Verified: category_depth follows heading level (h1->0, h2->1, h3->2); depth is
unaffected by multi-column layout; parent_id reflects section->subsection;
text_as_html, Page/Column containers, and table/form structure are preserved.
v2 now matches v1 on heading-level depth. Confirmed on the real Kaiser
EOC_CA_Individual-Family page 6 ("Cost Share Summary"->0, subsections->1,
"How to read..."->2, subsections chained under their section).

New tests cover a multi-level page and a multi-column page; existing v2, ontology
round-trip, and set_element_hierarchy tests stay green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Harden `unstructured_elements_to_ontology` against malformed input flagged
by the external review:

- Return an empty `Document` for empty input instead of raising `IndexError`
  (pre-existing behavior; cheap, lossless guard).
- A layout container whose `parent_id` matches no currently-open container now
  nests in the innermost open container rather than popping past valid
  ancestors to the document root (which mis-nested subsequent content). For the
  real producer (`ontology_to_unstructured_elements`) the parent is always an
  open ancestor, so this only changes behavior for input that violates the
  documented parent-before-child precondition, and it is lossless.

Add round-trip tests: empty input, multi-column container nesting, and a
container with an unknown `parent_id`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address the external-review finding that `ontology_to_unstructured_elements`
emitted content elements with `parent_id=None`, relying on the
`@apply_metadata` decorator on the *separate* `partition_html` function to run
`set_element_hierarchy` and fill them in.

The converter now applies `set_element_hierarchy` itself over its full output,
so a direct caller gets complete heading-based hierarchy. Split the function
into a public wrapper (runs hierarchy once over the flat list) and a private
recursive worker `_ontology_to_unstructured_elements` (builds the list with
container tree `parent_id` set and content `parent_id=None`).

`set_element_hierarchy` only assigns `parent_id` to elements that lack one, so
layout containers keep their tree parent and the reverse converter still
rebuilds nesting from containers + document order. When `partition_html` re-runs
`set_element_hierarchy` via `@apply_metadata`, every element already has a
`parent_id`, so the second pass is a verified no-op (no reassignment/reorder).

Add tests: converter assigns heading-based parent_id without the decorator;
re-running set_element_hierarchy on the output is a no-op.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant