A lot of documents you might want to processes and gather intelligence on will have charts. For instance in our Naive RAG chat we want to find out from the Federal Women’s Budget Statement for 2024-2025. I also loaded all local, state and federal budget papers from 2021 to 2024 using Azure AI studio into an index. See how to do that here.
I started in the chat playground and asked:
“In the latest federal women’s budget how many hours does a 46-50 year old spend on unpaid caring work”
Based on the document you might expect a number around 28 if you average all 4 numbers.
While it identifies the correct section of the document and makes some broad conclusions it’s unable to read the graph and unable to provide specific numbers
Models such as GPT-4o and Gemini introduced multimodal models which gives them the ability to handle text, audio and images. Previously models such as GPT-4 would process text and return text back. A specific image recognition model may take an image as an input and output text.
Using LlamaParse with multimodal
Utilising these new models in combination with a library such as LlamaParse you can take each page of your document and generate an image for it. You can then add the image as meta data with your text node vector index and use your multi-modal modal to query it.
First we use LlamaParse using openai-gpt4o and multimodal set to true to parse the file and also split the document into an image per page.
This outputs the images to a directory on my machine:
If you’re experimenting with this and will be running it more than once, I’d also recommend storing the list of text nodes to disk so you don’t have to parse the document on every run.
Next we’ll create the list of text nodes and associate the corresponding page image with it as meta-data
And persist the index to disk (rather than memory) so we can use it later multiple times.
Which looks like this:
Now we have it locally, we can load it from disk whenever we want:
Here we setup a class to help us retrieve relevant information using nodes and images:
To help our queries use the images we can craft a prompt that nudges it to use image information before text:
and setup our query engine
Query results
Let’s compare to our original standard query
Here we identified the correct page of the document and identified the correct years. 24 hours isn’t quite correct no matter which way I try and average the numbers so I’m not sure where this is coming from but it is in a relevant range.
Let’s get more specific and ask for the comparison for men in the 2 years:
The result has change. It correctly identifies the page number and what the graph is showing but incorrectly says 24 hours for both years. We’d be expecting 20 and 24. My only guess here is that it’s picked the numbers from either 36-40 or 66-70 year olds instead of the correct age group.
Finally lets compare the women’s numbers
It correctly identifies the page number and what the graph is showing but incorrectly says 30 and 38 hours. We’d be expecting 34 and 36. Here it looks to have grabbed the figures for 56-60 group instead.
Conclusion
The parsing and multimodal query allows us to get a much more specific result than text along. It appears to identify data from the correct chart but not the correct figures which is somewhat concerning.