It doesn’t take long after you run a Basic RAG pipeline with a PDF that contains charts and tables to find cases where it doesn’t work very well or not at all.
I took the last 5 years of local Brisbane City Council, state Queensland and federal Australia budget papers (yes boring I know) and loaded them in Azure AI Studio using all the defaults possible and used the Chat Playground to query the document. See my post here on how I set this up.
I wanted to see how it handed some specific tables in the document so I chose this one on page 60
I then asked “In the latest federal women’s budget what is the median superannuation balance for a 50-54 year old.”
The result was very generic, not quoting any specific figures. I checked the reference and documents that it had matched on. It had correctly identified the document (out of 20 or so), the right page and chart but couldn’t interpret the chart.
I decided to pull this apart a bit more and look at some different parsers.
In my resources post I link to a bunch of the common ones like pyPDF, pdfminer.six etc but today I’m going to compare the SimpleDirectoryReader in LlamaIndex with LlamaParse.
When using the SimpleDirectory Reader and outputting the content of page 60 in the document you’ll notice it is fairly close to our Basic Azure example above.
The figures for super, percentages, axes are all on single lines and it’s not obvious what belongs with each.
If I then ask it the same question:
Slightly better than our basic RAG – It correctly identifies the page, chart etc. It gets the gap percentage correct but incorrectly gives us the super balance for a 45-49 year old cohort.
You can see by the output of the chart is with this default parser looks nothing like the chart.
Next we’ll try a more specialised parser LlamaParse:
The output here you can see straight away looks more like the chart from the documents with the axis lining up the rows.
When asked the same question it correctly identifies the correct row in the chart.
This parser works quite well on charts and tables with textual representations.
See my multi-modal example where something like LlamaParse isn’t enough by itself.