Enabling Upserts For VectorStore (Using Langchain Code Node)

Jim_Le · June 1, 2024, 9:49pm

This is a quick tutorial on how you can use the rarely mentioned Langchain Code Node to support upserts for your favourite vectorstore.
Disclaimer: I’m still trying to wrap my head around this node so this might not be the best/recommend way for achieve this. Feedback is very welcome. Use at your own peril!

Background

At time of writing, n8n’s Vectorstore nodes do not support upserts because you can’t define IDs to go with your embeddings. This means you’ll get duplicate vector documents if you try to run the same content through a second time. Not an issue if you are able to clear the vectorstore when you insert… but what if you just can’t and only want to update a few specific documents at a time? If this is you, then using the Langchain Code Node is one way to achieve this.

Prerequisites

Self-hosted n8n. The Langchain Code Node is only available on the self-hosted version.
Ability to set NODE_FUNCTION_ALLOW_EXTERNAL environmental variable. For this tutorial, you kinda need this to access the Pinecone client library. I suspect the same to be true for other vectorstore services.
- For this tutorial, you’ll need to set the following: NODE_FUNCTION_ALLOW_EXTERNAL=@pinecone-database/pinecone
A Vectorstore that supports upserts. I think all the major ones supported by Langchain do but no harm in mentioning it here.
You’re not afraid of a little code. I’ve attached the template below so you can copy/paste the code as is but if that’s not enough, I’m happy to answer any questions in this thread or can offer paid support for more custom requirements.

Step 1. Add the Langchain Code Node

The Langchain Code Node is an advanced node designed to fill in for functionality n8n isn’t currently supporting right now. As such, it’s pretty raw, light on documentation and intended for the technically inclined - especially those who have used Langchain outside of n8n.

In your workflow, open the nodes sidepanel.
Select Advanced AI → Other AI Nodes → Miscellaneous → Langchain Code
The Langchain Code Node should open in edit mode but if not, you can double click the node to bring up its editor.
Under Inputs,
- Add an input with type “Main”, max connections set to “1” and required set to “true”
- Add an input with type “Embedding”, max connections set to “1” and required set to “true”
Under Outputs, add an output with type “main”.
Go back to the canvas.
On the Langchain code node you just created, add an Embedding Subnode. I’ve gone with OpenAI Embeddings but you can just any you like. We do this to save on writing extra code for this later.

Step 2. Writing the Langchain Code

Now the fun part! For this tutorial, we’ve set up a scenario where we want to vectorise a webpage to power our website search. The previous node supplies the webpage URL and our Langchain Code node will load and vectorise the webpage’s contents into our Pinecone Vectorstore. It’s a good use-case for using upserts because some webpages change often whilst others do not. We will be able to make frequent updates to this webpage’s vectors without duplicates or rebuilding the entire index. Sweet!

Open the Langchain Code Node in edit mode again.
Under Code → Add Code, select the Execute option.
- Tip: “Execute” for main node, “Supply Data” for subnodes.
In the Javascript - Execute textarea, we’ll enter the following code.
- Be sure to change <MY_API_KEY>, <MY_PINECONE_INDEX> and <MY_PINECONE_NAMESPACE> before running the code!

// 1. Get node inputs
const inputData = this.getInputData();
const embeddingsInput = await this.getInputConnectionData('ai_embedding', 0);

// 2. Setup Pinecone
const { PineconeStore } = require('@langchain/pinecone');
const { Pinecone } = require('@pinecone-database/pinecone');
const pinecone = new Pinecone({ apiKey: '<MY_API_KEY>' });
const pineconeIndex = pinecone.Index('<MY_PINECONE_INDEX>');
const pineconeNamespace = '<MY_PINECONE_NAMESPACE>';

const vectorStore = new PineconeStore(embeddingsInput, {
  namespace: pineconeNamespace || undefined,
  pineconeIndex,
});

// 3. load webpage url
const url = $json.url; // "https://docs.n8n.io/learning-path/"
const { CheerioWebBaseLoader } = require("langchain/document_loaders/web/cheerio");
const loader = new CheerioWebBaseLoader(url, { selector: '.md-content' });
const webpageContents = await loader.load();

// 4. initialise a text splitter (optional)
const { RecursiveCharacterTextSplitter } = require("langchain/text_splitter");
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 0,
});

// 5. Create smaller docs to vectorise (optional)
// - Depends on your use-case: smaller docs are usually preferred for RAG applications.
const docs = [];
for (contents of webpageContents) {
  const { pageContent, metadata } = contents;
  const cleanContent = pageContent.replaceAll('  ', '').replaceAll('\n', ' ');
  const fragments = await splitter.createDocuments([cleanContent]);
  docs.push(...fragments.map((fragment, idx) => {
    fragment.metadata = { ...fragment.metadata, ...metadata };
    
    // 5.1 Our IDs look like this "https://docs.n8n.io/learning-path/|0", "https://docs.n8n.io/learning-path/|1"
    // but is only specific to this tutorial, use whatever suits you but make sure IDs are unique!
    fragment.id = `${metadata.source}|${idx}`;
    return fragment;
  }));
};

// 6. Define IDs to enable upserts.
// - You can now run this as many times without worrying about duplicates!
const ids = docs.map(doc => `${doc.id}`);
await vectorStore.addDocuments(docs, ids);

// 7. Return for further processing
return docs.map(doc => ({ json: doc }));

Step 3. We’re Done!

We’ve now successfully built our own custom Vectorstore node which supports upserts ! Pretty rad if you ask me. I think I’ll experiment a bit more with Langchain Code node and see what other fun things it’ll allow me to do… until next time!

This code can be modified to work with other popular vectorstore such as PgVector, Redis, Qdrant, Chroma etc. Change the client library (and remember to add it to NODE_FUNCTION_ALLOW_EXTERNAL)
Unfortunately, it doesn’t seem like you can access crendentials from inside the node so you’ll either have to hardcode your API keys/tokens as we’ve done here or pass them through via variables maybe?
The document to vectorise doesn’t neccessarily need to be loaded in the Langchain Code. You can bring it in through previous nodes and use this.getInputData() to access it.

Cheers,
Jim
Follow me on LinkedIn or Twitter.

Demo Template

thumbnail

upserts

ridingthedragon · July 16, 2024, 2:18pm

Awesome @Jim_Le - Thanks so much for sharing, I haven’t thought about that way. It is a nice workaround to upsert. I just created this workflow on my localhost n8n and will start playing around with it.

Did you make any updates or came across any new thoughts since you posted this in June?

Jim_Le · July 16, 2024, 2:24pm

@ridingthedragon Thanks!
You may not need to use this hack as there is a planned update to the Pinecone and Supabase vector store nodes to allow for updating/upserting built-in. So definitely watch out for this in the upcoming releases.

If you’re using something like Qdrant however, this is probably still a good technique to learn. Learning more about how n8n works under the hood since June, I now recommend the following way to achieve the same but with the following advantages:

A lot less code by using existing nodes
Safer than hardcoding/exposing credentials by piggy-backing off the existing vector store node.

ridingthedragon · July 17, 2024, 8:46am

@Jim_Le Ah, that’s good to know about pinecone and subabase. Are you involved in developing these nodes?

Re: qadrant. I actually use it in my localhost setting as you can self-host the qdrant vectorstore. So for the “privacy-first” setting, your hack still stays relevant.

And the updated design is elegant. Thanks for sharing.
I’ve been playing around with the HTML node (instead of cheerio) to give users a simpler way to define which parts of a website to upsert.
But I guess a more robust setting would be done with puppeteer or similar web scraping libraries.

I’ll play around with this and share once I have something useful.

Alex5 · October 21, 2024, 6:28pm

Hi, Jim. According to Qdrant docs to make upserts you just need to set the same ID. But I don’t see how can I set ID in n8n. Is it even possible without custom code?

Morriz · November 7, 2024, 11:08pm

yes, that is possible by setting metadata with expression values

Morriz · November 7, 2024, 11:10pm

but what I am missing here @Jim_Le, is the use case that most of us have:

we want to check the page’s lastmod datetime to check if we need to upsert at all, so all we need is some logic sandbox in the vector store node that allows us to do so

Jim_Le · November 9, 2024, 12:05pm

@Alex5 If the way the document means to do it, no it’s not possible outside of custom code. But @Morriz solution is definitely the next best thing. I did post a short bit on why I think you should avoid upserting if possible How to set ID in Qdrant points in n8n - #3 by Jim_Le

@Morriz Hmm I wonder if doing this logic check outside the vector store would be a better idea? For example and assuming you’re working with websites, periodically downloading the sitemap.xml and tracking diffs between each fetch, isolating the pages which have changed and then running the upserts.

Morriz · November 9, 2024, 12:19pm

Yes, that is what I am doing now. (Actually I delete and insert as n8n has no notion of upset as it has no access to ids.)

I have an idempotent workflow that I will share here soon after cleaning it up

Morriz · November 9, 2024, 12:22pm

yes, that is possible by setting metadata with expression values

So what I said earlier is actually NOT possible as id is not in metadata

Morriz · November 9, 2024, 1:02pm

@Jim_Le here is a full example to get WooCommerce products and Wordpress pages into Qdrant with an idempotent workflow. This allows for a cronjob to trigger the workflow:

As you can see it requires a lot of plumbing, so a UI solution in the n8n vectorstore node would be preferable

Jim_Le · November 9, 2024, 2:51pm

@Morriz thanks for sharing.

I don’t have the full specifications or the usual number of products so forgive if I assume too much but for this type of workflow, personally I wouldn’t bother with the upsert. Just clear the vector store and reinsert everything. Reason being if pages or products are removed/deleted, it’s likely your vector store is going to get out of sync.

But here’s an alternative implementation which uses redis instead to keep track of modified items. It does add another component to the stack but you can also use the excellent KV storage community node instead.

brauliodiasribeiro · November 10, 2024, 8:29pm

@Jim_Le maybe you can help me.
I need to solve a similar problem about inserting into a database.
I loaded a RAG model with supabase.

I’m trying to switch the supabase node to postgres node, but I don’t know how to execute a function on the postgres node.
Maybe you can show me a way.

Supabase
Delete
metadata->>file_id=like.{{ $json.file_id }}

Insert
match_documents

Read
match_documents

Postgres
Delete
DELETE FROM documents
WHERE metadata->>‘file_id’ ILIKE ‘%’ || ‘{{ $json.file_id }}’ || ‘%’;

Insert
I left the “metadata” function blank

Read
I don’t know how to indicate the “match” query

Jim_Le · November 11, 2024, 9:20am

Hey @brauliodiasribeiro

Would it be okay if you post a new topic for your question? That way I’m sure you’ll get more of the community able to help with your issue.

Thanks!

brauliodiasribeiro · November 11, 2024, 10:52am

Yes, sure…tks for your attention.
For those who want to follow

cagri · November 12, 2024, 11:30am

This helped me a lot! Pretty elegant and simple.

Using this method, I created and automation where I crawl my website’s URLs, check lastmod info on sitemap, and if lastmod is newer than the last one, I execute the Upsert workflow.

I’ll publish the final version of my worklows as a template and share them here.