unstructured
7b25dfc3 - fix(CVE-2024-39705): remove nltk download (#3361)

Comment changes are shownComment changes are hidden
Commit
317 days ago
fix(CVE-2024-39705): remove nltk download (#3361) ### Summary Addresses [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705), which highlights the risk of remote code execution when running `nltk.download` . Removes `nltk.download` in favor of a `.tgz` file with the appropriate NLTK data files and checking the SHA256 hash to validate the download. An error now raises if `nltk.download` is invoked. The logic for determining the NLTK download directory is borrowed from `nltk`, so users can still set `NLTK_DATA` as they did previously. ### Testing 1. Create a directory called `~/tmp/nltk_test`. Set `NLTK_DATA=${HOME}/tmp/nltk_test`. 2. From a python interactive session, run: ```python from unstructured.nlp.tokenize import download_nltk_packages download_nltk_packages() ``` 3. Run `ls /tmp/nltk_test/nltk_data`. You should see the downloaded data. --------- Co-authored-by: Steve Canny <stcanny@gmail.com>
Author
Parents
  • .github/workflows
    • File
      ci.yml
  • File
    CHANGELOG.md
  • File
    Dockerfile
  • test_unstructured/nlp
    • File
      test_tokenize.py
  • typings/nltk
    • File
      __init__.pyi
    • File
      data.pyi
    • File
      downloader.pyi
    • File
      internals.pyi
    • File
      tag.pyi
    • File
      tokenize.pyi
  • unstructured
    • File
      __version__.py
    • nlp
      • File
        tokenize.py