2

I am new RAG Application. Currently I am working in a RAG application where I have to do Ingestion of PDF invoice document and have to fill-up some date from JSON structure and store in NOSQL DB.

My PDF Invoice PDF is mostly contained tables based content. Either it is a single table or multiple table with rows and columns. Each data element I will fetch convert in JSON and store in NO SQL DB.

Now I am confused for a PDF where mostly we have tables and box or place holder to content data in that case which chunking process I should use so that I can get most meaning full data which may be will store in vector database , and after that do a similarity search and use those data for standard JSON preparation.

What I try :

according to https://medium.com/@anuragmishra_27746/five-levels-of-chunking-strategies-in-rag-notes-from-gregs-video-7b735895694d

I was trying with Agentic Chunking, but the output was not satisfactory as in my case pdf is mostly contained table data and data in box , example as you can find in the attachment.

Can any one suggest a suitable approach for chunking and embedding which may be I can use for this type of document where data are mostly given in format of table or box with some heading and then description as given in the attachment. Thanks in Advance....

PDF Invoices are : enter image description here

Sujoy
  • 21
  • 2

1 Answers1

1

If your PDF invoices have well-defined and clearly bordered tables, tools like Camelot and Tabula can directly handle table extraction without needing preliminary OCR. These tools can read the PDF’s internal structure to extract tables if the PDF is not scanned and already contains selectable text. Camelot provides two extraction modes—stream (for tables without borders) and lattice (for tables with borders). This flexibility makes it suitable for a variety of table structures.

For scanned PDFs where text is represented as images, OCR (Optical Character Recognition) tools like Tesseract is essential for converting images of text into machine-readable text. Table extraction tools like Camelot and Tabula rely on having text data to work with. Tesseract can preprocess the PDF to convert scanned images into text, which can then be further processed by table extraction tools.

Based on the final table extracted from an invoice PDF, you can break down the structured data into semantically meaningful chunks simply by rows and columns, and convert these chunks into vector representations such as Word2Vec, GloVe or more advanced models (BERT). Row chunks are flexible for querying and access, while column chunks can be used for analytical purposes or to perform operations like computing statistics. This is a type of semantic chunking following your reference's taxonomy.

cinch
  • 11,000
  • 3
  • 8
  • 17