unstructured
a9b65067 - Fix: `partition_html()` fails parsing simple html (#2849)

Commit
1 year ago
Fix: `partition_html()` fails parsing simple html (#2849) Closes #2520. Previously, `partition_html()` did not extract text from `<b>` tags inside container tags (like `<div>`, `<pre>`). This PR provides support for extracting text from `<b>` tags inside container tags. ### Testing ``` html_text = """ <!DOCTYPE html> <html> <head> <title>A page</title> </head> <body> <div> <h1>Header 1</h1> <p>Text </p> <h2>Header 2</h2> <pre><b>Param1</b> = Y<br><b>Param2</b> = 1<br><b>Param3</b> = 2<br><b>Param4</b> = A <br><b>Param5</b> = A,B,C,D,E<br><b>Param6</b> = 7<br><b>Param7</b> = Five<br></pre> </div> </body> </html> """ elements = partition_html(text=html_text) print("\n\n".join([str(el) for el in elements])) ``` **Expected behavior** ``` Header 1 Text Header 2 Param1 = Y Param2 = 1 Param3 = 2 Param4 = A Param5 = A,B,C,D,E Param6 = 7 Param7 = Five ```
Parents
Loading