fix: parse URL response Content-Type according to RFC 9110 (#2950)
Currently, `file_and_type_from_url()` does not correctly handle the
`Content-Type` header. Specifically, it assumes that the header contains
only the mime-type (e.g. `text/html`), however, [RFC
9110](https://www.rfc-editor.org/rfc/rfc9110#field.content-type) allows
for additional directives — specifically the `charset` — to be returned
in the header. This leads to a `ValueError` when loading a URL with a
response Content-Type header such as `text/html; charset=UTF-8`.
To reproduce the issue:
```python
from unstructured.partition.auto import partition
url = "https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/"
partition(url=url)
```
Which will result in the following exception:
```python
{
"name": "ValueError",
"message": "Invalid file. The FileType.UNK file type is not supported in partition.",
"stack": "---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[1], line 4
1 from unstructured.partition.auto import partition
3 url = \"https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/\"
----> 4 partition(url=url)
File ~/miniconda3/envs/ai-tasks/lib/python3.11/site-packages/unstructured/partition/auto.py:541, in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, languages, detect_language_per_element, pdf_infer_table_structure, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, xml_keep_tags, data_source_metadata, metadata_filename, request_timeout, hi_res_model_name, model_name, date_from_file_object, starting_page_number, **kwargs)
539 else:
540 msg = \"Invalid file\" if not filename else f\"Invalid file {filename}\"
--> 541 raise ValueError(f\"{msg}. The {filetype} file type is not supported in partition.\")
543 for element in elements:
544 element.metadata.url = url
ValueError: Invalid file. The FileType.UNK file type is not supported in partition."
}
```
This PR fixes the issue by parsing the mime-type out of the
`Content-Type` header string.
Closes #2257