PR #9829 Feature: PGVector Collection Documents Update

Feature: PGVector Collection Documents Update #9829

lorenzofavaro wants to merge 2 commits into langchain-ai:master from lorenzofavaro:feature/pgvector-update_documents

lorenzofavaro1 year ago

Description

Enhancement to the PGVector functionality: the addition of an update function update_documents(...).

Currently, updating the documents of a collection requires emptying the collection and filling it again. This can cause more calls to be made to the model than are actually needed. In fact, if a text chunk (and therefore its embedding) is already present in the vector store in the current collection, what is currently done is to delete it and insert (therefore calling the embedding model) the same embedding again.

The new feature identifies differences between input and existing documents. It requests new embeddings only for different documents, inserting them into the DB, and deletes missing ones.

Issue

#9461 (Add Functionality to Update Embeddings in pgvector)

Using Sample

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.pgvector import PGVector
from langchain.docstore.document import Document

COLLECTION_NAME = "..."
CONNECTION_STRING = '...'
documents = [Document(page_content="foo", metadata={"page": "0"}), 
             Document(page_content="bar", metadata={"page": "1"})]

# Instance new collection in pgvector from documents
pgvector = PGVector.from_documents(
    embedding=OpenAIEmbeddings(),
    documents=documents,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
    pre_delete_collection=True)

# Update/Add some documents
documents[1].page_content = "baz"
documents.append(Document(page_content="far", metadata={"page": "2"}))

# Call update function
pgvector.update_documents(documents)

Feature: PGVector Collection Documents Update

f21a5e07

vercel1 year ago (edited 1 year ago)

The latest updates on your projects. Learn more about Vercel for Git ↗︎

2 Ignored Deployments

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Sep 3, 2023 8:34pm
langchain-deprecated	⬜️ Ignored (Inspect)	Visit Preview		Sep 3, 2023 8:34pm

dosubot added Ɑ: vector store

dosubot added 🤖:improvement

baskaryan assigned

eyurtsev 1 year ago

hwchase171 year ago

@lorenzofavaro - we are planning to use https://python.langchain.com/docs/modules/data_connection/indexing to do this type of updating. would that satisfy your requirements?

eyurtsev requested changes on 2023-08-30

eyurtsev1 year ago

Thanks for the contribution! See if the indexing code shared by @hwchase17 will work for your use case!

libs/langchain/langchain/vectorstores/pgvector.py

301	301	texts=texts, embeddings=embeddings, metadatas=metadatas, ids=ids, **kwargs
302	302	)
303	303
	304	def update_documents(

eyurtsev1 year ago

@lorenzofavaro take a look at the indexing code that @hwchase17 referenced. It should be able to solve this use case.
I'm OK adding an upsert_documents functionality, but it would need to be added on the base class as well, and would require the user to provide ids as part of the interface, and would need to implement upsert semantics.

The code should not assume that the content is identical just from the page_content, as there are use-cases when the relevant content lives in the metadata (e.g., metadata about two different products that share the same basic description). In this case, we'd want both documents to be indexed.

lorenzofavaro1 year ago (edited 1 year ago)👍 1

Thanks @hwchase17 I've seen the indexing and it solves my problem actually.
I must say that I wouldn't mind working on the solution proposed by @eyurtsev. I implemented the upsert semantics in the last commit.

Actually if the upsert_documents is also included in the base class (VectorStore) it must also be implemented in all other vector stores (apart from PGVector).

If it's okay I can start adding this functionality for PGVector, adding the abstract method upsert_documents in the base class (in another commit) and in other PRs I can work on the others vector stores.

Refactor: Upsert semantics implemented

80f91572

lorenzofavaro marked this pull request as draft 1 year ago

HamzaNiaziBS1 year ago

@eyurtsev Could you, please, review it?

efriis1 year ago

Hey @lorenzofavaro ! Looks like this draft hasn't been worked on or marked as ready to review in a while. Is this something you'd still like to work on, or can I close it?

hwchase17 closed this 1 year ago

baskaryan reopened this 1 year ago

ccurme added community

ccurme added langchain

hwchase17363 days ago

now a separate package

hwchase17 closed this 363 days ago

Reviewers

eyurtsev

Assignees

eyurtsev

Labels

Ɑ: vector store 🤖:improvement community langchain

Milestone

No milestone

langchain Feature: PGVector Collection Documents Update #9829 Closed

Feature: PGVector Collection Documents Update #9829

Description

Issue

Using Sample

langchain
Feature: PGVector Collection Documents Update
#9829

Closed