I am trying to build a training dataset using OpenAI’s Embeddings model and use that to classify text and recommend users from a new dataset later on. For example, I have Google Sheets file containing the user data (like name, age, qualification, skills, degree, etc.)
This has individual rows for each user, and this data must be parsed through to the OpenAI’s node to generate and search embeddings. I am not able to find this option in the OpenAI node, as well as I am not able to format the Google Sheets node to send the data in full accordingly. Does the data need to be sent in an Array or something?
Also, I am trying to use a ‘Set’ node to combine the values of all the fields for a particular user.
The Expected Output:
Generate the embeddings and store them under a property name called ‘Embeddings’ (just like how the Set node stores the output under the name ‘data’), which can then be used to search and classify.
If I share a new set of CSV files with similar data, the model should classify it and search the embeddings to output the recommended users’ registration numbers. For example, in a new Google Sheets file, if I am giving the desired qualification, skills, and preferred age and leaving the last column for the output, the workflow should take the data from each of the rows and search using the generated embeddings and return the suggested User Names in the last column.
So it looks like you’re already generating the embedding and are now looking for how to add them to your Google Sheet, did I get that right?
From looking at your data it seems to me the Register Number field in your dataset contains unique values. So you probably want to use the “Update” operation of the Google Sheets node in the next step, matching each value against the respective Register Number field.
If you’re expecting a different output from the OpenAI node it could make sense to disable the “Simplify” option to prevent n8n from interfering with the API response.
No, I am not able to generate the embeddings itself. Some basic issues I am coming across include the following:
How to send the whole data in the CSV (all the rows) simultaneously to the model? As of now, only a single row is sent to the model.
Unable to find the specific embeddings option in n8n to create embeddings. (The model is included under the ‘Text’ resource, which only has ‘Complete, Edit, and Moderate’ options, with support for a custom API call) So, I am guessing I should probably try making a custom call, but I never used that option before. So not sure how to use it.
Your suggestion on the Register Number field is perfect for storing the data. But how to run the model for each entry? Won’t we need to run the model once for all the rows to get the complete set of embeddings?
We also need a way to store the generated embeddings to a service like Weaviate or similar ones. We can’t rely on Google Sheets alone for production operations. (This is something for a later time)
The main question is how to use the n8n app itself for comparing the embeddings. For example, let’s say we have generated the embeddings for the user data, and we are providing a new sheet for it to fill in the user’s names. It should compare the inputs in the new sheet (like desired skills) with the embeddings generated earlier so that we can recommend the users.
I apologize for asking more unrelated questions in the community, but my organization uses n8n to automate many other things. So, we are looking to implement this somehow within n8n itself.