I still love a hard copy book to read, but when it comes to day to day documents I’m very much a read on screen person. I RARELY print these days, and those are the times that someone requires a wet signature. Which is great for the environment – way less physical paper.
Sadly, we’re still stuck creating out documents with pages tied to an outdated physical format of a particular size. This means paragraphs, sentences, tables flow over pages. It’s slightly annoying when you read it but even more annoying when you want to process your documents and use it with AI.
A lot of the default parsers will chunk by page – it’s a decent baseline and people understand pages but it can be a pain when you get to the retrieval/search part.
If the creator has done the right thing and created the pdf with structure, you’ll want to look at a different chunking method.
I took a look at llmsherpa – LayoutPDFReader
Prequisites
- nlm-ingester – I recommend running the docker server
- pip install llmsherpa
Problem
Looking at some of my government budget papers – they are pretty good at keeping sentences on a single page but quite often the paragraphs for a section will span more than one page
You’ll generally want to keep this related content together, or you might want to retrieve the whole section etc.
Exploring llmsherpa
After installing my docker image and the latest llmsherpa I can now use the LayoutPDFReader on my document.
Once I have this I can query it and extract what I need. Here I find the section from above using a string search on the title and extra the section with all of it’s children and output it.
Which now gives us the whole section