unstructured
aa332101 - fix: fix header and footer not parsed as Header/Footer types (#4041)

Commit
183 days ago
fix: fix header and footer not parsed as Header/Footer types (#4041) ## Summary This PR fixes an issue where header/footer content in html are not partitioned as `unstructured` `Header` or `Footer` element types. Rather they are either `UncategorizedText` or taking on the type of the nested structure inside the header/footer. E.g., `<header class="Header"><h1 class="Title">Header Title</h1></header>` would be partitioned as a `Title` instead of `Header`. ## Bug description This behavior is because we treat header and footer as layout, i.e., containers, in the ontology definition. As a result, during parsing we [unwrap](https://github.com/Unstructured-IO/unstructured/blob/ec209c6b5f9f24b4aabfa3bc8145ab896e7afd66/unstructured/partition/html/transformations.py#L361-L378) the container and parse the contents as if they are from the main text even though they are still part of header/footer. The fix is to treat header/footer as text instead of layout in ontology so that all content inside of them are properly gathered under `Header`/`Footer` element types.
Author
Parents
Loading