unstructured
c0457c1c - feat: include images when partitioning html (#3945)

Commit
300 days ago
feat: include images when partitioning html (#3945) Currently we [filter img tags](https://github.com/Unstructured-IO/unstructured/blob/2addb19473ba9e27af995291f57d35fb50bec4b0/unstructured/partition/html/partition.py#L226-L229) before tags are converted to Elements by the html partitioner. More importantly we also don’t currently have a defined “block” / mapping to support these. This adds these mappings and logic to process. It also respects `extract_image_block_types` and `extract_image_block_to_payload` (as we do with pdfs) to determine whether base64 is included in the metadata. The partitioned Image Elements sets the text to the img tag’s alt text if available. The partitioned Image Elements include the [url in the metadata](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/elements.py#L209) (rather than image_base64) if the img tag src is a url. ## Testing unit tests have been added for explicit coverage. existing integration tests and other unit test fixtures have been updated to account for `Image` elements now present --------- Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
Parents
Loading