unstructured
bcd0eee7 - Feat: Detect all text in HTML Heading tags as titles (#1556)

Commit
2 years ago
Feat: Detect all text in HTML Heading tags as titles (#1556) ## Summary This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address categorize it as a title. ## Testing ``` from unstructured.partition.html import partition_html elements = partition_html(url="https://www.eda.gov/grants/2015") ``` Before, the date headers at the given url would not be correctly parsed as titles, after this change they are now correctly identified. A unit test to verify the functionality has been added: `test_html_partition::test_html_heading_title_detection` that includes values that were previously detected as narrative text and uncategorized text
Author
Parents
Loading