OCR just extracts text and provides its location on a document.
In the late 80s and 90s companies started using Optical Character recognition to digitize their paper documents. The idea was to optically detect characters on a digital image of a document and provide a computer readable form of the text that can be used in digital systems. Once text was “read” off the document, various text manipulation techniques were used to determine the context and the “meaning” of the text. In simple terms, OCR technology was no smarter than a 1st grader who can identify alphabets.
For most of the 90s and 2000s OCR technology made modest advancements. Products firms ABBYY, and KOFAX continued to add functionality however, “reading” text off a document remained to be a template-based affair. Someone would identify areas on a document that would correspond to a piece of data that needed to be extracted. Should the document format change, the template needed to be updated. If a new format of a document containing the same data needed to be processed a new template would need to be created. E.g. invoices from different vendors may require you to maintain a template per vendor.
ML Model Based Processing Adds context and meaning to text on a document.
In the era of machine learning, digitizing of documents, which had become synonymous with OCR, evolved, and took the name “Intelligent Document Processing” (IDP). IDP platforms introduced machine learning models that would rely on a data set labelled by humans, to combine text on a document, the position of the text and the context around it to identify data elements on a document. This was more akin to a 5th grader who could not only read words but also apply some context to those words and ascertain “meaning”.
Data Labeling for ML Model based data extraction is time-consuming.
However, the big challenge for Machine Learning Model based document processing was the amount of time and effort spent in labelling data. ML models learn by ingesting data that a human has labelled and determine an output based on that learning. E.g., a human may identify fields on 1000 documents and using this “labelling” data, the model determines how to identify this field based on text and positional context. The labelling can be time-consuming and if done incorrectly can result in a model being inefficient.
LLM/GPT is a giant leap in intelligent document processing – No more labelling.
Enter Large Language Models - LLMs (more commonly known as generative AI or GPT). Since LLMs have been pre trained with a tremendous amount data they have contextual knowledge of language and can extract information from text contextually. So, by asking a question such as “What is the Invoice Number on this invoice?” The model already knows that the document being fed is an invoice and its pre training allows it to respond with appropriate text from the provided data. The use of LLMs has proven to be a giant leap in document processing. LLMs do not require data labelling or pretraining. This significantly reduces the time required to implement document-based automation.
Data size challenges
There are however some challenges with processing documents with LLMs. Most services that host LLMs like Azure or OpenAI have a limit on the amount of data that can be sent in a single request. Hence processing very large documents can be a challenge. This is usually solved by first extracting pages from a document using a contextual search service such as MS Cognitive Search. Cognitive search can extract chunks of text that are relevant to the topic. Text from these pages is then fed to the LLM to extract the data required.
LLMs/GPT Hallucinate
Keep in mind that in its current form LLMs and GPT technology, especially for textual use cases just predicts the next best word, given its training and context. Which could mean that if the context is missing and the prompts do not specifically prevent it, the model may predict a word based on its training and not the data in the document. This is referred to as hallucination. To Reduce Hallucinations most platforms will utilize prompt engineering and providing additional context in the form of vector databases to prevent hallucinations.
Conclusion
Using Generative AI technology to process documents-based use cases can be extremely efficient. This technology marks a sea change in paradigms for Intelligent Document Processing. As this technology evolves and becomes more mainstream, we foresee significant advancement in document processing use cases through LLMs. Many platforms including OpenBots, UiPath and most recently Snowflake have already introduced LLM based document processing capabilities that have shown significant promise.