langchain
Feature: PGVector Collection Documents Update
#9829
Closed

Feature: PGVector Collection Documents Update #9829

lorenzofavaro
lorenzofavaro1 year ago

Description

Enhancement to the PGVector functionality: the addition of an update function update_documents(...).

Currently, updating the documents of a collection requires emptying the collection and filling it again. This can cause more calls to be made to the model than are actually needed. In fact, if a text chunk (and therefore its embedding) is already present in the vector store in the current collection, what is currently done is to delete it and insert (therefore calling the embedding model) the same embedding again.

The new feature identifies differences between input and existing documents. It requests new embeddings only for different documents, inserting them into the DB, and deletes missing ones.

Issue

#9461 (Add Functionality to Update Embeddings in pgvector)

Using Sample

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.pgvector import PGVector
from langchain.docstore.document import Document

COLLECTION_NAME = "..."
CONNECTION_STRING = '...'
documents = [Document(page_content="foo", metadata={"page": "0"}), 
             Document(page_content="bar", metadata={"page": "1"})]

# Instance new collection in pgvector from documents
pgvector = PGVector.from_documents(
    embedding=OpenAIEmbeddings(),
    documents=documents,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
    pre_delete_collection=True)

# Update/Add some documents
documents[1].page_content = "baz"
documents.append(Document(page_content="far", metadata={"page": "2"}))

# Call update function
pgvector.update_documents(documents)
lorenzofavaro Feature: PGVector Collection Documents Update
f21a5e07
vercel
vercel1 year ago (edited 1 year ago)

The latest updates on your projects. Learn more about Vercel for Git ↗︎

2 Ignored Deployments
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Sep 3, 2023 8:34pm
langchain-deprecated ⬜️ Ignored (Inspect) Visit Preview Sep 3, 2023 8:34pm
dosubot dosubot added Ɑ: vector store
dosubot dosubot added 🤖:improvement
baskaryan baskaryan assigned eyurtsev eyurtsev 1 year ago
hwchase17
hwchase171 year ago

@lorenzofavaro - we are planning to use https://python.langchain.com/docs/modules/data_connection/indexing to do this type of updating. would that satisfy your requirements?

eyurtsev
eyurtsev requested changes on 2023-08-30
eyurtsev1 year ago

Thanks for the contribution! See if the indexing code shared by @hwchase17 will work for your use case!

libs/langchain/langchain/vectorstores/pgvector.py
301301 texts=texts, embeddings=embeddings, metadatas=metadatas, ids=ids, **kwargs
302302 )
303303
304
def update_documents(
eyurtsev1 year ago

@lorenzofavaro take a look at the indexing code that @hwchase17 referenced. It should be able to solve this use case.
I'm OK adding an upsert_documents functionality, but it would need to be added on the base class as well, and would require the user to provide ids as part of the interface, and would need to implement upsert semantics.


The code should not assume that the content is identical just from the page_content, as there are use-cases when the relevant content lives in the metadata (e.g., metadata about two different products that share the same basic description). In this case, we'd want both documents to be indexed.

lorenzofavaro1 year ago (edited 1 year ago)👍 1

Thanks @hwchase17 I've seen the indexing and it solves my problem actually.
I must say that I wouldn't mind working on the solution proposed by @eyurtsev. I implemented the upsert semantics in the last commit.

Actually if the upsert_documents is also included in the base class (VectorStore) it must also be implemented in all other vector stores (apart from PGVector).

If it's okay I can start adding this functionality for PGVector, adding the abstract method upsert_documents in the base class (in another commit) and in other PRs I can work on the others vector stores.

lorenzofavaro Refactor: Upsert semantics implemented
80f91572
lorenzofavaro lorenzofavaro marked this pull request as draft 1 year ago
HamzaNiaziBS
HamzaNiaziBS1 year ago

@eyurtsev Could you, please, review it?

efriis
efriis1 year ago

Hey @lorenzofavaro ! Looks like this draft hasn't been worked on or marked as ready to review in a while. Is this something you'd still like to work on, or can I close it?

hwchase17 hwchase17 closed this 1 year ago
baskaryan baskaryan reopened this 1 year ago
ccurme ccurme added community
ccurme ccurme added langchain
hwchase17
hwchase17363 days ago

now a separate package

hwchase17 hwchase17 closed this 363 days ago

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
Labels
Milestone