Extract from PDF - huge file, need to extract only from a specific area of the PDF

Erez_Dagan · June 28, 2025, 10:31am

Hey, first time here posting a question.
I have a huge pdf file of a floor plan. It’s around 17 MB.
I need to extract specific data from the pdf. The thing is, since it’s so big with so many irrelevant details, the simple “information extractor” isn’t doing the job.
The other thing is, the relevant information that i need to extract is located always in the right side of the PDF file.
So maybe there is a way i can make it that the information extractor could scan the information only from that specific area?

I hope i was clear, i’ll be glad to answer questions if needed.
Thanks for the help!

Wouter_Nigrini · June 28, 2025, 12:12pm

Your question is simple, but the answer is not. Doing a bit of a search, it seems Tesseract could be your best option, however you’ll need to do a bit of custom implementation for this to work. According to perplexity you can extract x,y coords using tesseract. The other option is the split your pdf into multiple pdfs page by page to reduce the size of the actual document. The question is, does the information you need always display on the same page of the document?

Erez_Dagan · June 28, 2025, 12:17pm

Thank you for your reply!

Yes, it’s always located in the same area (it’s a right panel with all the details i need).
I tried to use pdf.co API with coordinates but i got lost in the process. It seems very complicated…

Wouter_Nigrini · June 28, 2025, 12:45pm

Yeah unfortunately working with PDFs is not very simple.

Here’s what perplexity gave me:

To extract text specifically from the right side of each page in a large document using Tesseract, you typically follow this workflow:

Convert PDF pages to images (if starting from PDF).
Determine the coordinates for the right-side region.
Crop each image to that region.
Run Tesseract OCR on the cropped region.

Below is a practical example using Python with OpenCV and pytesseract (a Python wrapper for Tesseract):

python

import cv2
import pytesseract
from pdf2image import convert_from_path

# Step 1: Convert PDF to images (one image per page)
pages = convert_from_path('large_document.pdf', dpi=300)
for idx, page in enumerate(pages):
    # Convert PIL image to OpenCV format
    img = cv2.cvtColor(np.array(page), cv2.COLOR_RGB2BGR)
    height, width, _ = img.shape

    # Step 2: Define the right-side region (e.g., rightmost 30% of the page)
    left = int(width * 0.7)
    top = 0
    region_width = int(width * 0.3)
    region_height = height

    # Step 3: Crop the right-side region
    cropped_img = img[top:top+region_height, left:left+region_width]

    # Step 4: OCR on the cropped region
    text = pytesseract.image_to_string(cropped_img, config='--oem 1 --psm 6')
    print(f"Page {idx+1} right side text:\n{text}\n")

Key points:

The region is defined by pixel coordinates: left, top, width, and height. For the right side, set left to a high value (e.g., 70% of the width) and width to cover the remainder1 2 3.
This approach works for each page in a multi-page document.
You can adjust the percentage or pixel values based on your document’s layout.

For Tesseract.js (JavaScript):
You can use the rectangle parameter:

javascript

const { createWorker } = require('tesseract.js');
const worker = await createWorker();

await worker.loadLanguage('eng');
await worker.initialize('eng');

// Example: rightmost 300px of an 800px-wide image
const { data: { text } } = await worker.recognize('page.png', {
  rectangle: { left: 500, top: 0, width: 300, height: 1200 }
});
console.log(text);

await worker.terminate();

Adjust the rectangle as needed for your image size1 3.

This method allows you to efficiently extract only the content from the right side of each page in a large document using Tesseract.