Efficiently Applying LLMs to Transform Semi-Structured Data

MMS Founder
MMS Sergio De Simone

Article originally posted on InfoQ. Visit InfoQ

LLMs can be an effective way to generate structured data from semi-structured data, although an expensive one. A team of Stanford and Cornell researchers claim to have found a technique to reduce inference costs by 110x while improving inference quality.

According to Simran Arora, first author of the paper, using LLMs for inference on unstructured documents may get expensive as the corpus grows, with an estimated cost of at least $0.001 per 1K tokens. The strategy she and her colleagues at Stanford propose promises to reduce inference cost by 110 times using a sophisticated code synthesis tool dubbed EVAPORATE.

The basic task that EVAPORATE wants to solve can be described in the following terms: starting from heterogeneous documents, such as HTML files, PDFs, text files, and so on), identify a suitable schema, and extract data to populate a table. Often, traditional approaches to extract structured data from semi-structured data relies on a number of simplifying assumptions, for example regarding the position of tags in HTML documents or the existence of annotations, which necessarily end up reducing the generality of the system. EVAPORATE aims to maintain generality leveraging large language models.

In their paper, the researchers explore two alternative ways to extract data: using an LLM to extract values from the documents and build a tabular representation of the data; or synthesizing code that is later used to process the documents at large scale. The two approaches have different trade-offs in terms of cost and quality. Indeed, while the direct approach shows a very good performance in comparison to traditional techniques, it is very expensive.

LLMs are optimized for interactive, human-in-the-loop applications (e.g. ChatGPT), not high-throughput data processing tasks. The number of tokens processed by an LLM in EVAPORATE-DIRECT grows linearly with the size of the data lake.

On the other hand, the code approach, dubbed EVAPORATE-CODE, uses the LLM only on a small subset of the documents to generate a schema and synthesize a number of functions in a traditional programming language, e.g., Python, to extract the data from the whole set of documents. This approach is clearly less expensive than the former, but synthesized functions tend to be of varying quality, which affect the quality of the output table.

To strike a better balance between quality and cost, the researchers added a new element to their recipe: generating many candidate functions and estimating their quality. The results produced by those functions are then aggregated using weak supervision. This solution helps reducing the variability across generated functions, especially for those working only for a specific subset of documents, as well as the impact of those presenting any syntactic or logical errors.

Based on their evaluation of 16 sets of documents across a range of format, topics, and attribute types, the researchers say the extended approach, named EVAPORATE-CODE+, outperforms the state-of-the-art systems, which make simplifying assumptions, and achieves a 110x reduction in inference cost in comparison to EVAPORATE-CODE.

Our findings demonstrate the promise of function synthesis as a way to mitigate cost when using LLMs. We study the problem of materializing a structured view of an unstructured dataset, but this insight may be applicable in a broader suite of data wrangling tasks.

According to the researchers, there are many opportunities to further develop their system, including the possibilities of generating functions that invoke other AI models, such as those available on Hugging Face or through OpenAI. Another dimension to explore, they say, is iterating function generation so that, in case a sub-optimal or incorrect function is generated, it is fed back to the LLM to generate an improved function.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.