Couple of weeks in the past a colleague and I participated in an inside hackathon the place the duty was to provide you with an attention-grabbing use case utilizing the latest multi-modal Giant Language Fashions (LLMs). Multi-modal LLMs take not solely textual content inputs through their immediate like earlier LLMs, however also can settle for non-text modalities similar to photos and audio. Some examples of multi-modal LLMs are GPT-4o from OpenAI, Gemini 1.5 from Google, and Claude-3.5 Sonnet from Anthropic. The hackathon supplied entry to GPT-4o by Azure, Microsoft’s Cloud Computing Platform. We didn’t win, there have been different entries that have been higher than ours each by way of the originality of their concept in addition to high quality of their implementations. Nevertheless, we discovered some cool new issues through the hackathon, and figured that these is likely to be of basic curiosity to others as effectively, therefore this submit.
Our concept was to make use of GPT-4o to extract and codify tables present in educational papers as semi-structured information (i.e. JSON). We might then both question the JSON information for looking inside tables, or convert it to Markdown for downstream LLMs to question them simply through their textual content interface. We had initially supposed to increase the concept to figures and charts, however we couldn’t get that pipeline working finish to finish.
Here’s what our pipeline appeared like.
- Tutorial papers are normally obtainable as PDFs. We use the PyMuPDF library to separate the PDF file right into a set of picture recordsdata, the place every picture file corresponds to a web page within the paper.
- We then ship every web page picture by the Desk Transformer, which returns bounding field data for every desk it detects within the web page, in addition to a confidence rating. The Desk Transformer mannequin we used was microsoft/table-transformer-detection.
- We crop out every desk from the pages utilizing the bounding field data, after which ship every desk to GPT-4o as a part of a immediate asking to transform it to a JSON construction. GPT-4o responds with a JSON construction representing the desk.
This pipeline was primarily based on my colleague’s concept. I like the way it progressively simplifies the duty by splitting every web page of the incoming PDF into its personal picture, then makes use of a pre-trained Desk Transformer to crop out the tables from them, and solely then passes the desk to GPT-4o to transform to JSON. That desk picture is handed into the immediate as a “information URL” which is simply the base-64 encoding of the picture formatted as "information:{mime_type};base64,{base64_encoded_data}
. The Desk Transformer, whereas not excellent, proved remarkably profitable at figuring out tables within the textual content. I say outstanding as a result of we used a pre-trained mannequin, however maybe it’s not that outstanding when you contemplate that it most likely skilled on tables in educational papers as effectively.
Our immediate for GPT-4o appeared one thing like this:
System: You're an AI mannequin that makes a speciality of detecting the tables and extracting, decoding desk content material from photos. Comply with under instruction step-by-step: 1. Acknowledge whether or not the given picture is desk or not, if it isn't a desk print "None". if it is a desk go to subsequent step. 2. precisely convert the desk's content material right into a structured structured Json format basic instruction: 1. don't output something further. 2. a desk should incorporates rows and columns Person: Given the picture, detect whether or not it is a desk or not, if it is a desk then convert it to Json format {image_data_url}
For the determine pipeline, I attempted to make use of an OWL-VIT (Imaginative and prescient Transformer for Open World Localization) mannequin instead of the Desk Transformer. But it surely was not as profitable at detecting figures within the textual content, most likely since SAM appears to be fine-tuned to detect objects in pure photos. Sadly we could not discover a pre-trained mannequin that may work for this explicit case. One other problem was changing the fgure right into a semi-structured JSON illustration, we ended up asking GPT-4o to explain the picture as textual content.
One suggestion by a few of my TWIML non-work colleagues was to ask GPT-4o to return the bounding bins for the pictures it finds in it, after which use that to extract the figures to ship to GPT-4o for describing. It did not work sadly, however was undoubtedly value attempting. As LLMs get increasingly succesful, I feel it is sensible to rethink our pipelines to delegate increasingly work to the LLM. Or at the least confirm that it may well’t do one thing earlier than transferring on to older (and tougher to implement) options.