langchain
d4dc98a9 - community[patch]: RecursiveUrlLoader: add base_url option (#19421)

Commit

2 years ago

community[patch]: RecursiveUrlLoader: add base_url option (#19421) RecursiveUrlLoader does not currently provide an option to set `base_url` other than the `url`, though it uses a function with such an option. For example, this causes it unable to parse the `https://python.langchain.com/docs`, as it returns the 404 page, and `https://python.langchain.com/docs/get_started/introduction` has no child routes to parse. `base_url` allows setting the `https://python.langchain.com/docs` to filter by, while the starting URL is anything inside, that contains relevant links to continue crawling. I understand that for this case, the docusaurus loader could be used, but it's a common issue with many websites. --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com>

Author

german-swan

Parents

e71daa7a

langchain d4dc98a9 - community[patch]: RecursiveUrlLoader: add base_url option (#19421)

langchain
d4dc98a9 - community[patch]: RecursiveUrlLoader: add base_url option (#19421)