Here how can I convert a scraped text data to json formatting with gemini.
I am assuming you can get your output properly from 1st agent and only problem is structuring the second agent.
In my case I tried to parse second hand e-bike listing for ebay. For second hand listings, users doesn’t provide same structured info and also content is German. So I used Gemini to understand listing and structure it json. On browser I tested and got JSON correctly. When I done on API with n8n, I can get the same result. I add my parser after agent with JS code. I will share my chat and JS code.
CHAT:
You are a professional data extraction assistant for eBike listings. Your task is to analyze the provided description and extract structured information about the eBike. If the description is in a foreign language (e.g., German), translate it internally to English before processing.
Your key objectives are:
1-Extract relevant details about the eBike in a structured JSON format.
2-Identify any negative information such as damage, issues, missing components, abnormal wear, or any statement that may suggest a potential concern regarding the bike's condition, documentation, or usage.
3-If the model year is described in relative terms (e.g., "five years old" or "3 years ago"), calculate the Year_Model by subtracting the duration from the current year. If a specific year is not mentioned and cannot be inferred, return "Not Provided."
4-If any essential information is missing (e.g., battery capacity, motor model), return "Not Provided."
5-Include any important details like warranty status, maintenance history, available documents, or reported issues in the "Additional_Details" field.
6-Specifically list any negative information in the "Negative_Details" field.
7-Output only the structured JSON with English values for each key. Do not add any explanation or additional comments.
**Output Format**
{
"Document": {
"Available": "Yes/No",
"Type": "e.g., Purchase Receipt, Warranty, User Manual, etc."
},
"Brand_Model": "Brand and Model Name (e.g., Haibike XDURO AllMtn)",
"Year_Model": "Year (e.g., 2022 or Not Provided)",
"Motor_Model": "Motor Model (e.g., Bosch Performance Line CX)",
"Motor_Volt": "Motor Volt (e.g., 36V, 48V, 56V or Not Provided)",
"Battery_Power": "Battery Capacity (e.g., 500Wh, 625Wh, or Not Provided)",
"Milage_KM": "Milage in KM (e.g., 1200 km or Not Provided)",
"Additional_Details": "Any other important details such as warranty status, maintenance history, available accessories, or anything else noteworthy.",
"Negative_Details": "Any negative information, such as damage, issues, missing components, abnormal wear, or concerns raised by the seller in the description. If no negative details are found, return 'None.'"
}
Guidelines:
1- Translation: If the description is in a foreign language, translate it to English before extracting the data. Record the extracted details in English in the JSON output.
2- Completeness: Ensure the JSON output covers all key details. If information is not provided in the description, explicitly mark it as "Not Provided."
3- Negative Information: Pay special attention to any statements indicating potential problems or concerns with the eBike, including missing documentation, signs of damage, or abnormal wear. Capture these details in the "Negative_Details" field.
4- Document Verification: Check if the seller mentions having important documents (e.g., purchase receipts, warranty, user manuals). If such documents are missing, this should also be noted in the output.
5- Exact Year Mentioned: If the description includes a specific year like 2018 or 2020, use that directly for Year_Model.
6- Relative Time Mentioned: If the description includes phrases like "five years old" or "bought 3 years ago", calculate the year by subtracting the time from the current year (e.g., if it's 2025 and the bike is "five years old," the model year is 2020).
7- No Time or Year Mentioned: If neither a specific year nor a relative time reference is provided, return "Not Provided" for Year_Model.
TITLE of LISTING: E-Bike Flyer c 5.1, 26 Räder, 45 Rahmen, ges.500km gefahren
**LISTING Description**
Schweizer Markenfahrrad !
Neuwertiger Zustand ohne Probleme,
Tiefeinsteiger mit Bordkomputer, abgelesene 508km.
Mittelmotor 36 V, 432 Wh, 8 Gang Narbenschaltung,
freilauf, hydraulische Felgenbremsen v.+hi.
Schiebehilfe, Berganfahrhilfe, Felgenschloß,
Komfortabel zu fahren, mit Anleitung
und alle Papiere......Fahrradpaß
5 Jahre alt, wegen Krankheit nur wenig genutzt,
Batterie immer im warmen Wohnzimmer gelagert.
Fahrleistung bis zu 80 Km nie genutzt.
Anschauen lohnt sich, "sieht aus wie neu"
LP: 2800
OUTPUT from AI
JS Code for Formating the output
// Loop over input items and transform the output JSON
const outputArray = [];
for (const item of $input.all()) {
// Clean up the JSON string before parsing
let cleanOutput = item.json.output.replace(/```json|```/g, "").trim();
let extractedJson = JSON.parse(cleanOutput);
// Format and modify the extracted JSON
outputArray.push({
"Document": {
"Available": extractedJson.Document.Available,
"Type": "Fahrradpaß, Anleitung"
},
"Brand_Model": extractedJson.Brand_Model,
"Year_Model": "2020", // Manually set the new Year_Model
"Motor_Model": "Not Provided", // Manually set default value
"Motor_Volt": extractedJson.Motor_Volt,
"Battery_Power": extractedJson.Battery_Power,
"Milage_KM": extractedJson.Milage_KM,
"Additional_Details": "Mittelmotor, 8 Gang Narbenschaltung, freilauf, hydraulische Felgenbremsen v.+hi., Schiebehilfe, Berganfahrhilfe, Felgenschloß",
"Negative_Details": extractedJson.Negative_Details
});
}
return outputArray;
Result:
[
{
"Document": {
"Available": "Yes",
"Type": "Fahrradpaß, Anleitung"
},
"Brand_Model": "Flyer c 5.1",
"Year_Model": "2020",
"Motor_Model": "Not Provided",
"Motor_Volt": "36V",
"Battery_Power": "432Wh",
"Milage_KM": "508 km",
"Additional_Details": "Mittelmotor, 8 Gang Narbenschaltung, freilauf, hydraulische Felgenbremsen v.+hi., Schiebehilfe, Berganfahrhilfe, Felgenschloß",
"Negative_Details": "None."
}
]
Same test result on UI too:
Workflow:
Problem probably you are not cleaning up json format. Because AI returns this:
{
"response": {
"generations": [
[
{
"text": "```json\n{\n\"Document\": {\n\"Available\": \"Yes\",\n\"Type\": \"User Manual, All Papers, Bicycle Passport\"\n},\n\"Brand_Model\": \"Flyer c 5.1\",\n\"Year_Model\": \"2019\",\n\"Motor_Model\": \"Not Provided\",\n\"Motor_Volt\": \"36V\",\n\"Battery_Power\": \"432Wh\",\n\"Milage_KM\": \"508 km\",\n\"Additional_Details\": \"Low entry bike with on-board computer, 8-speed hub gears, freewheel, hydraulic rim brakes front and rear, push assistance, hill start assist, rim lock. Comfortable to ride. Battery always stored in the warm living room. Mileage up to 80 km never used. Original price: 2800\",\n\"Negative_Details\": \"None.\"\n}\n```",
"generationInfo": {
"finishReason": "STOP"
}
}
]
]
},
"tokenUsageEstimate": {
"completionTokens": 164,
"promptTokens": 978,
"totalTokens": 1142
}
}
NOTE:
My perspective is that let the AI generate JSON format and respons should be only the JSON format like my example, then try to extract that json from “text” (you can see above json AI return) with cleaning extras with code (like JS that I provided above)
If my reply solves your problem, please remember to mark it as a solution.