Hey, first time here posting a question.
I have a huge pdf file of a floor plan. It’s around 17 MB.
I need to extract specific data from the pdf. The thing is, since it’s so big with so many irrelevant details, the simple “information extractor” isn’t doing the job.
The other thing is, the relevant information that i need to extract is located always in the right side of the PDF file.
So maybe there is a way i can make it that the information extractor could scan the information only from that specific area?
I hope i was clear, i’ll be glad to answer questions if needed.
Thanks for the help!
Your question is simple, but the answer is not. Doing a bit of a search, it seems Tesseract could be your best option, however you’ll need to do a bit of custom implementation for this to work. According to perplexity you can extract x,y coords using tesseract. The other option is the split your pdf into multiple pdfs page by page to reduce the size of the actual document. The question is, does the information you need always display on the same page of the document?
Thank you for your reply!
Yes, it’s always located in the same area (it’s a right panel with all the details i need).
I tried to use pdf.co API with coordinates but i got lost in the process. It seems very complicated…
Yeah unfortunately working with PDFs is not very simple.
Here’s what perplexity gave me:
To extract text specifically from the right side of each page in a large document using Tesseract, you typically follow this workflow:
- Convert PDF pages to images (if starting from PDF).
- Determine the coordinates for the right-side region.
- Crop each image to that region.
- Run Tesseract OCR on the cropped region.
Below is a practical example using Python with OpenCV and pytesseract (a Python wrapper for Tesseract):
python
import cv2
import pytesseract
from pdf2image import convert_from_path
# Step 1: Convert PDF to images (one image per page)
pages = convert_from_path('large_document.pdf', dpi=300)
for idx, page in enumerate(pages):
# Convert PIL image to OpenCV format
img = cv2.cvtColor(np.array(page), cv2.COLOR_RGB2BGR)
height, width, _ = img.shape
# Step 2: Define the right-side region (e.g., rightmost 30% of the page)
left = int(width * 0.7)
top = 0
region_width = int(width * 0.3)
region_height = height
# Step 3: Crop the right-side region
cropped_img = img[top:top+region_height, left:left+region_width]
# Step 4: OCR on the cropped region
text = pytesseract.image_to_string(cropped_img, config='--oem 1 --psm 6')
print(f"Page {idx+1} right side text:\n{text}\n")
Key points:
- The region is defined by pixel coordinates:
left
, top
, width
, and height
. For the right side, set left
to a high value (e.g., 70% of the width) and width
to cover the remainder123.
- This approach works for each page in a multi-page document.
- You can adjust the percentage or pixel values based on your document’s layout.
For Tesseract.js (JavaScript):
You can use the rectangle
parameter:
javascript
const { createWorker } = require('tesseract.js');
const worker = await createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
// Example: rightmost 300px of an 800px-wide image
const { data: { text } } = await worker.recognize('page.png', {
rectangle: { left: 500, top: 0, width: 300, height: 1200 }
});
console.log(text);
await worker.terminate();
Adjust the rectangle as needed for your image size13.
This method allows you to efficiently extract only the content from the right side of each page in a large document using Tesseract.
- https://app.studyraid.com/en/read/15018/519357/extracting-text-from-specific-image-areas
- How to extract data using Tesseract OCR?
- https://app.studyraid.com/en/read/15018/519356/defining-rectangle-regions-for-targeted-ocr
- https://stackoverflow.com/questions/53185255/select-part-of-text-that-was-extracted-using-the-tesseract-ocr
- https://www.youtube.com/watch?v=tFW0ExG4QZ4
- https://www.reddit.com/r/MachineLearning/comments/1f87yfg/p_tesseract_ocr_has_anybody_used_it_for_reading/
- Tesseract OCR Guide: Exploring Capabilities & Performance
- Tesseract Python: Extract text from images using Tesseract OCR | Nutrient
- Python OCR Tutorial: Tesseract, Pytesseract, and OpenCV
- Tesseract Ocr: How do I use tesseract OCR to create bounding boxes? - OneLinerHub