feat: add filter element types as post processing function (#1014)
* don't push
* enhancement: improve json detection by detect_filetype (#971)
* update regex pattern
* improve json regex pattern checks and add test file
* update file name
* update tests and formatting
* update changelog and version
* refactor: simplifies JSON detection and add tests (#975)
* refactor json detection
* version and changelog
* fix mock in test
* feat: adds Outlook connector (#939)
* bonus: fixes issue with email partitioning where From field was being assigned the To field value.
* Roman/expose dpi param (#966)
* Bump inference version
* Pass through the dpi param if available
* Update CHANGELOG
* Check dpi param passed in via unit test
* Bump inference version
* Fix unit test around file info to work on mac as well
* chore: cleanup changelog for 0.8.2 (#976)
* Update `partition_via_api` to not post a strategy value if not user specified (#967)
* remove default strategy
* working on test
* fixed test, coordinates param needed to be included
* nits
* update changelog
* lint
* update requirements
* build(release): cut 0.8.4 release (#979)
* feat: add document date for remaining file types (#930) (#969)
* feat: add document date for remaining file types (#930)
* feat: add functions for getting modification date
* feat: add date field to metadata from csv file
* feat: add tests for csv patition
* feat: add date field to metadata from html file
* feat: add tests for html partition
* fix: return file name onlyif possible
* feat: add csv tests
* fix: renaming
* feat: add filed metadata_date as date of last mod
* feat: add tests for partition_docx
* feat: add filed metadata_date to .doc file
* feat: add tests for partition_doc
* feat: add metadata_date to .epub file
* feat: add tests for partition_epub
* fix: fix test mocking
* feat: add metadata_date for image partition
* feat: add test for image partition
* feat: add coorrdinate system argument
* feat: add date to element metadata
* feat: add metadata_date for JSON partition
* feat: add test for JSON partition
* fix: rename variable
* feat: add metadata_date for md partition
* feat: add test for md partition
* feat: update doc string
* feat: add metadata_date for .odt partition
* feat: update .odt string
* feat: add metadata_date for .org partition
* feat: add tests for .org partition
* feat: add metadata_date for .pdf partition
* feat: add tests for .pdf partition
* feat: add metadata_date for .pptx partition
* feat: add metadata_date for .ppt partition
* feat: add tests for .ppt partition
* feat: add tests for .pptx partition
* feat: add metadata_date for .rst partition
* feat: add tests for .rst partition
* fix: get modification date after file checking
* feat: add tests for .rtf partition
* feat: add tests for .rtf partition
* feat: add metadata_date for .txt partition
* fix: rename argument
* feat: add tests for .txt partition
* feat: update doc string rst patrition function
* feat: add metadata_date for .tsv partition
* feat: add tests for .tsv partition
* feat: add metadata_date for .xlsx partition
* feat: add tests for .xlsx partition
* fix: clean up
* feat: add tests for .xml partition
* feat: add tests for .xml partition
* fix: use `or ` instead of `if`
* fix: fix epub tests
* fix: remove not used code
* fix: add try block for getting file name
* fix: applying linter changes
* fix: fix test_partition_file
* feat: add metadata_date for email
* feat: add test for email partition
* feat: add metadata_date for msg
* feat: add tests for msg partition
* feat: update CHANGELOG file
* fix: update partitions doc string
* don't push
* fix: clean up code
* linting, linting, linting
* remove unnecessary example doc
* update version and changelog
* ingest-test-fixtures-update
* set metadata date in test
---------
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* ingest-test-fixtures-update
* Update ingest test fixtures (#970)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* Revert "Update ingest test fixtures (#970)"
This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.
* remove date from metadata in outputs
* update docstring ordering
* remove print
* remove print
* remove print
* linting, linting, linting
* fix version and test
* fix changelog
* fix changelog
* update version
---------
Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* Chore: add uns api repo unittests (#954)
* stage
* git clone
* ci ignore markdown file
* make install
* use env instead
* remove md
* add script
* wrong env value
* add note
* maybe don't rm
* no cd../
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
* fix: handling for empty tables in word docs and powerpoints (#982)
* fix table index error
* changelog and version
* fix: only download nltk packages if necessary (#985)
* fix: only download nltk if necessary
* changelog and version
* Chore: Pass table support param to partition image (#973)
* add param and test in image table extraction
* version and changelog
* need to publish this one for api repo
* add new param skip_infer_table_types
* use warning
* clean up with mapping
* add test for tsv
* fix test fail
* weird change from merge
* doc nit
* don't use mapping
* correct conflict
* Update pip in makefile (#981)
* update pip in makefile
* merge and update requirements
* update version
* update outlook requirements
* chore: remove debug printing (#988)
* fix: correct nltk download arg order (#991)
* fix: correct download order to nltk args
* add smoke test for tokenizers
* Chore: put back function `split_by_paragraph` (#992)
* put back function
* not really fixes
* don't push
* fix: clean up code
* fix: clean up
* fix: clean up
* feat: add document date for remaining file types (#930) (#969)
* feat: add document date for remaining file types (#930)
* feat: add functions for getting modification date
* feat: add date field to metadata from csv file
* feat: add tests for csv patition
* feat: add date field to metadata from html file
* feat: add tests for html partition
* fix: return file name onlyif possible
* feat: add csv tests
* fix: renaming
* feat: add filed metadata_date as date of last mod
* feat: add tests for partition_docx
* feat: add filed metadata_date to .doc file
* feat: add tests for partition_doc
* feat: add metadata_date to .epub file
* feat: add tests for partition_epub
* fix: fix test mocking
* feat: add metadata_date for image partition
* feat: add test for image partition
* feat: add coorrdinate system argument
* feat: add date to element metadata
* feat: add metadata_date for JSON partition
* feat: add test for JSON partition
* fix: rename variable
* feat: add metadata_date for md partition
* feat: add test for md partition
* feat: update doc string
* feat: add metadata_date for .odt partition
* feat: update .odt string
* feat: add metadata_date for .org partition
* feat: add tests for .org partition
* feat: add metadata_date for .pdf partition
* feat: add tests for .pdf partition
* feat: add metadata_date for .pptx partition
* feat: add metadata_date for .ppt partition
* feat: add tests for .ppt partition
* feat: add tests for .pptx partition
* feat: add metadata_date for .rst partition
* feat: add tests for .rst partition
* fix: get modification date after file checking
* feat: add tests for .rtf partition
* feat: add tests for .rtf partition
* feat: add metadata_date for .txt partition
* fix: rename argument
* feat: add tests for .txt partition
* feat: update doc string rst patrition function
* feat: add metadata_date for .tsv partition
* feat: add tests for .tsv partition
* feat: add metadata_date for .xlsx partition
* feat: add tests for .xlsx partition
* fix: clean up
* feat: add tests for .xml partition
* feat: add tests for .xml partition
* fix: use `or ` instead of `if`
* fix: fix epub tests
* fix: remove not used code
* fix: add try block for getting file name
* fix: applying linter changes
* fix: fix test_partition_file
* feat: add metadata_date for email
* feat: add test for email partition
* feat: add metadata_date for msg
* feat: add tests for msg partition
* feat: update CHANGELOG file
* fix: update partitions doc string
* don't push
* fix: clean up code
* linting, linting, linting
* remove unnecessary example doc
* update version and changelog
* ingest-test-fixtures-update
* set metadata date in test
---------
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* ingest-test-fixtures-update
* Update ingest test fixtures (#970)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* Revert "Update ingest test fixtures (#970)"
This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.
* remove date from metadata in outputs
* update docstring ordering
* remove print
* remove print
* remove print
* linting, linting, linting
* fix version and test
* fix changelog
* fix changelog
* update version
---------
Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* Roman/ingest refactor (#978)
* Pull out s3 code as subcommand
* Pull out dropbox code as subcommand
* Pull out azure code as subcommand
* Pull out fsspec code as subcommand
* Pull out github code as subcommand
* Pull out gitlab code as subcommand
* Pull out reddit code as subcommand
* Pull out slack code as subcommand
* Pull out discord code as subcommand
* Pull out wikipedia code as subcommand
* Pull out gdrive code as subcommand
* Pull out biomed code as subcommand
* rename parameters
* Pull out onedrive code as subcommand
* Pull out outlook code as subcommand
* Pull out local code as subcommand
* Pull out elasticsearch code as subcommand
* Pull out confluence code as subcommand
* Drop previous main file
* update changelog
* Add back in mp.Pool
* Fix mypy issues with click
* Make sure all tests run with verbose flag
* refactor approach to dynamically add common options to each subcommand, scrub logging of options for sensitive data
* Pull out some more shared options
* Support running code via python as well as cli
* update ingest readme and move it to the ingest folder
* update usage in connector docs
* move local command arg in test
* Seperate out cli code from logic running unstructured
* Make some cli fields required rather than optional
* rename process -> processor
* Improve logger to avoid duplicate handlers
---------
Co-authored-by: Ryan Nikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
* feat: adds Box connector (#996)
* chore: rename Element's "date" field to "last_modified" (#997)
Change the Element's date field name to the more specific last_modified so there is less room for confusion of what that field represents.
* don't push
* feat: add document date for remaining file types (#930) (#969)
* feat: add document date for remaining file types (#930)
* feat: add functions for getting modification date
* feat: add date field to metadata from csv file
* feat: add tests for csv patition
* feat: add date field to metadata from html file
* feat: add tests for html partition
* fix: return file name onlyif possible
* feat: add csv tests
* fix: renaming
* feat: add filed metadata_date as date of last mod
* feat: add tests for partition_docx
* feat: add filed metadata_date to .doc file
* feat: add tests for partition_doc
* feat: add metadata_date to .epub file
* feat: add tests for partition_epub
* fix: fix test mocking
* feat: add metadata_date for image partition
* feat: add test for image partition
* feat: add coorrdinate system argument
* feat: add date to element metadata
* feat: add metadata_date for JSON partition
* feat: add test for JSON partition
* fix: rename variable
* feat: add metadata_date for md partition
* feat: add test for md partition
* feat: update doc string
* feat: add metadata_date for .odt partition
* feat: update .odt string
* feat: add metadata_date for .org partition
* feat: add tests for .org partition
* feat: add metadata_date for .pdf partition
* feat: add tests for .pdf partition
* feat: add metadata_date for .pptx partition
* feat: add metadata_date for .ppt partition
* feat: add tests for .ppt partition
* feat: add tests for .pptx partition
* feat: add metadata_date for .rst partition
* feat: add tests for .rst partition
* fix: get modification date after file checking
* feat: add tests for .rtf partition
* feat: add tests for .rtf partition
* feat: add metadata_date for .txt partition
* fix: rename argument
* feat: add tests for .txt partition
* feat: update doc string rst patrition function
* feat: add metadata_date for .tsv partition
* feat: add tests for .tsv partition
* feat: add metadata_date for .xlsx partition
* feat: add tests for .xlsx partition
* fix: clean up
* feat: add tests for .xml partition
* feat: add tests for .xml partition
* fix: use `or ` instead of `if`
* fix: fix epub tests
* fix: remove not used code
* fix: add try block for getting file name
* fix: applying linter changes
* fix: fix test_partition_file
* feat: add metadata_date for email
* feat: add test for email partition
* feat: add metadata_date for msg
* feat: add tests for msg partition
* feat: update CHANGELOG file
* fix: update partitions doc string
* don't push
* fix: clean up code
* linting, linting, linting
* remove unnecessary example doc
* update version and changelog
* ingest-test-fixtures-update
* set metadata date in test
---------
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* ingest-test-fixtures-update
* Update ingest test fixtures (#970)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* Revert "Update ingest test fixtures (#970)"
This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.
* remove date from metadata in outputs
* update docstring ordering
* remove print
* remove print
* remove print
* linting, linting, linting
* fix version and test
* fix changelog
* fix changelog
* update version
---------
Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* feat: add document date for remaining file types (#930) (#969)
* feat: add document date for remaining file types (#930)
* feat: add functions for getting modification date
* feat: add date field to metadata from csv file
* feat: add tests for csv patition
* feat: add date field to metadata from html file
* feat: add tests for html partition
* fix: return file name onlyif possible
* feat: add csv tests
* fix: renaming
* feat: add filed metadata_date as date of last mod
* feat: add tests for partition_docx
* feat: add filed metadata_date to .doc file
* feat: add tests for partition_doc
* feat: add metadata_date to .epub file
* feat: add tests for partition_epub
* fix: fix test mocking
* feat: add metadata_date for image partition
* feat: add test for image partition
* feat: add coorrdinate system argument
* feat: add date to element metadata
* feat: add metadata_date for JSON partition
* feat: add test for JSON partition
* fix: rename variable
* feat: add metadata_date for md partition
* feat: add test for md partition
* feat: update doc string
* feat: add metadata_date for .odt partition
* feat: update .odt string
* feat: add metadata_date for .org partition
* feat: add tests for .org partition
* feat: add metadata_date for .pdf partition
* feat: add tests for .pdf partition
* feat: add metadata_date for .pptx partition
* feat: add metadata_date for .ppt partition
* feat: add tests for .ppt partition
* feat: add tests for .pptx partition
* feat: add metadata_date for .rst partition
* feat: add tests for .rst partition
* fix: get modification date after file checking
* feat: add tests for .rtf partition
* feat: add tests for .rtf partition
* feat: add metadata_date for .txt partition
* fix: rename argument
* feat: add tests for .txt partition
* feat: update doc string rst patrition function
* feat: add metadata_date for .tsv partition
* feat: add tests for .tsv partition
* feat: add metadata_date for .xlsx partition
* feat: add tests for .xlsx partition
* fix: clean up
* feat: add tests for .xml partition
* feat: add tests for .xml partition
* fix: use `or ` instead of `if`
* fix: fix epub tests
* fix: remove not used code
* fix: add try block for getting file name
* fix: applying linter changes
* fix: fix test_partition_file
* feat: add metadata_date for email
* feat: add test for email partition
* feat: add metadata_date for msg
* feat: add tests for msg partition
* feat: update CHANGELOG file
* fix: update partitions doc string
* don't push
* fix: clean up code
* linting, linting, linting
* remove unnecessary example doc
* update version and changelog
* ingest-test-fixtures-update
* set metadata date in test
---------
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* ingest-test-fixtures-update
* Update ingest test fixtures (#970)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* Revert "Update ingest test fixtures (#970)"
This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.
* remove date from metadata in outputs
* update docstring ordering
* remove print
* remove print
* remove print
* linting, linting, linting
* fix version and test
* fix changelog
* fix changelog
* update version
---------
Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* fix: removie prints
* remove unused file
* fix: apply linter
* feat: add post processing filter_element_types
* feat: add tests for filter_element_types
* feat: update changelog
* feat: add doc string for filter_element_types
* fix: change the version
* feat: update documentation
* bump dev version number
* cleanup changelog
* linting, linting, linting
---------
Co-authored-by: John <43506685+Coniferish@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: David Potter <potterdavidm@gmail.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>