Series 3/6: Methodology for Data Extraction and Prompt Generation with LLMs
Blog Post 3: Methodology for Data Extraction: From Datasets to Model Training
Introduction
In the field of document data extraction, integrating Large Language Models (LLMs) like GPT-3.5, LLAMA, and Google BARD has significantly improved accuracy and efficiency. This blog post details our research methodology and the process of creating effective prompts for data extraction using LLMs.
Extracting structured information from unstructured data is a major challenge in data science. Our research pipeline includes five key stages: Optical Character Recognition (OCR), chunking, document classification, prompt generation, and LLM inference and decoding. This comprehensive approach optimizes the efficiency and accuracy of information extraction.
We begin by crafting a well-designed prompt to guide the LLM. By leveraging the flexibility of LLMs, we avoid extensive model training and use prompts to convert unstructured text into structured outputs. By ensuring the LLM produces valid JSON, we make it easy to integrate with Python using tools like Function Calling. This ensures the output matches our predefined data structure, enhancing both interpretability and usability.
Methodology
Data Selection
Our custom datasets are derived from diverse sources, including IRS forms like W-9, W-8BEN, and W-2. These datasets are categorized into structured documents, focusing on standardized forms. The structured documents are divided into two primary components: structure templates and instances.
Structure Templates: Define the expected format and organization of data within each document category.
Instances: Specific examples derived from diverse sources of the structured documents, guiding the model on how to interpret and extract information.
Data Preparation
Handling and manipulating PDFs is crucial for generating proprietary datasets. We use Python packages like "fillpdf" and "pypdf" for this purpose.
fillpdf Package:
Filling PDFs: Automates the filling of PDF forms.
Listing Fields: Extracts field information within PDFs.
Flattening PDFs: Converts editable PDFs into non-editable ones.
Inserting Images and Text: Enhances the realism of generated datasets.
Rotating PDFs and Placing Images: Contributes to the versatility of the package.
pypdf Package:
Splitting, Merging, and Cropping PDFs: Allows the creation of diverse datasets with varied structures.
Transforming Pages: Adapts forms to different layouts.
Adding Custom Data and Passwords: Simulates real-world scenarios.
Retrieving Text and Metadata: Ensures rich information in generated datasets.
OCR
Optical Character Recognition (OCR) converts non-machine-readable documents like PDFs or images into text. We use Tesseract OCR for its accuracy and efficiency. However, other tools like Adobe Acrobat OCR, ABBYY FineReader, Google Cloud Vision OCR, Azure OCR, and AWS OCR cater to diverse needs based on document complexity, language diversity, and processing speed.
Proposed Method
Our research proposes a novel approach using GPT-3.5 to extract key-value pairs from documents. This method aims to overcome the limitations of traditional and machine learning-based approaches by leveraging the advanced capabilities of LLMs.
Document Classification and Prompt Extraction
The first step in our approach involves preprocessing documents into pages and converting them into machine-readable text using Optical Character Recognition (OCR). We then embed these documents using Sentence Transformer to capture nuanced understanding and store the embeddings in Vector Databases for organized retrieval. This allows us to classify documents and extract relevant prompts based on predefined configurations.
Process Flow for Document Classification and Prompt Extraction.
Document data extraction has evolved with the advent of Large Language Models (LLMs). Our method leverages LLMs to achieve structured outputs in JSON format.
LLM-Based Extraction and JSON Output:
We harness LLMs to process unstructured text and generate structured outputs in JSON format. For our target field output, we define a Pydantic BaseModel, specifying the desired structure.
Exploration of Different LLM Models
We evaluate various models to understand their capabilities:
GPT-3.5: An autoregressive language model by OpenAI, known for capturing complex contextual relationships.
Google Gemini Pro 1.0 and Gemini Vision: Integrate language understanding and vision capabilities.
LLAMA2 Models (7b, 13b, 70b): Offer scalability in performance, with varying sizes catering to specific tasks.
Chunking
Large Language Models (LLMs) such as GPT-3.5 have a maximum token limit per input, typically around 4096 tokens. Processing long documents that exceed this limit requires a systematic approach to ensure that each segment (or chunk) is within the allowable token range. Here’s how we implement the chunking strategy:
1. Document Segmentation into Pages
The initial step involves breaking down the long document into manageable pages. This segmentation is crucial for handling documents that are significantly longer than the LLM's input token limit. Each page represents a portion of the document that is more likely to fit within the token limit.
2. Iterative Trimming of Pages
For each page segmented from the document:
Token Count Calculation: Determine the number of tokens in the page's text content. This calculation considers the tokens' length based on the tokenizer used for the LLM, accounting for spaces, punctuation, and special characters.
Chunking Logic: Check if the token count exceeds the LLM's maximum input token limit (e.g., 4096 tokens). If it does, proceed with iterative trimming.
3. Iterative Chunking Process
The iterative process involves:
Initial Chunk Definition: Start with an initial chunk from the beginning of the page.
Token Limit Check: Continuously check the token count of the current chunk against the LLM's token limit.
Trimming Criteria: Define criteria for trimming the chunk:
Token Threshold: Determine a threshold slightly below the LLM's maximum token limit to account for safety margins.
Sentence Boundaries: Preferably trim at the end of complete sentences to maintain syntactical correctness.
Chunk Adjustment: Adjust the chunk boundary iteratively until the token count of the chunk fits within the defined limit.
4. Chunk Processing and LLM Input
Once a chunk fits within the token limit:
LLM Input Preparation: Prepare the chunked text as input for the LLM. This involves formatting the text appropriately, including any necessary context or instructions relevant to the document.
Sequential Processing: Submit each chunk sequentially to the LLM for processing. Depending on the application, results from each chunk may be aggregated or processed individually before aggregation.
5. Results Aggregation
After processing all chunks:
- Aggregation of Outputs: Combine the outputs from each processed chunk to reconstruct the complete document or to aggregate results depending on the application's requirements.
Benefits of Chunking Strategy
Token Limit Compliance: Ensures compliance with the LLM's token limit, avoiding input truncation and maximizing utilization of the model's capacity.
Syntactical Integrity: Preserves syntactical integrity by trimming at sentence boundaries, maintaining readability and correctness across chunks.
Scalability: Facilitates processing of arbitrarily long documents by breaking them down into smaller, manageable units.
Implementation Considerations
Chunk Size Optimization: Optimize chunk sizes based on empirical performance and the specific requirements of the application.
Error Handling: Implement robust error handling mechanisms to manage cases where chunks exceed permissible limits or encounter processing errors.
By employing this chunking strategy, we effectively manage and process long documents within the constraints imposed by the token limits of Large Language Models. This approach ensures efficient and accurate extraction of information from extensive textual inputs, leveraging the capabilities of LLMs for diverse document processing tasks.
Model
After categorizing the document, we retrieve a corresponding prompt template containing essential elements like the document context, task description, and schema representation. This template becomes the basis for constructing a specific prompt tailored for the LLM. The crafted prompt is sent to the LLM through the chatbot API. We use OpenAI's ChatGPT for generating embeddings and executing the extraction process.
Prompt Generation
Crafting Effective Prompts for Data Extraction with LLMs
An essential part of our methodology was generating effective prompts to guide the LLMs in extracting data from documents.
Prompt Generation Process
Document Context: Providing context about the type of document and its purpose. For example:
- “This is a W-9 form used to collect taxpayer identification information.”
Task Description: Clearly stating the extraction task. For example:
- “Extract the name, business name, TIN, and address from the W-9 form.”
Schema Representation: Including the structure of the desired output. For example:
{ "Name": "", "Business name": "", "TIN": "", "Address": "" }
Examples of Prompt Templates
For W-9 Form:
{
"context": "This is a W-9 form used to collect taxpayer identification information.",
"task": "Extract the following fields: Name, Business name, TIN, Address.",
"schema": {
"Name": "",
"Business name": "",
"TIN": "",
"Address": ""
}
}
For W-8BEN Form:
{
"context": "This is a W-8BEN form used by non-U.S. persons to certify their foreign status.",
"task": "Extract the following fields: Name, Country of citizenship, Permanent residence address, Foreign tax identifying number.",
"schema": {
"Name": "",
"Country of citizenship": "",
"Permanent residence address": "",
"Foreign tax identifying number": ""
}
}
For W-2 Form:
{
"context": "This is a W-2 form reporting wages paid to employees and the taxes withheld.",
"task": "Extract the following fields: Employer’s name and address, Employee’s name and address, Wages, Social Security wages, Federal income tax withheld.",
"schema": {
"Employer’s name and address": "",
"Employee’s name and address": "",
"Wages": "",
"Social Security wages": "",
"Federal income tax withheld": ""
}
}
Importance of Including Context, Task, and Schema
Including document context, task description, and schema representation in prompts is crucial because it:
Provides Clarity: Helps the LLM understand what to extract and how to format the output.
Improves Accuracy: Reduces ambiguity and ensures that the model focuses on the relevant parts of the document.
Enhances Efficiency: Streamlines the extraction process, making it faster and more reliable.
Example and Image Reference
Let’s consider a W-9 form as an example. Below is an image of a sample W-9 form, annotated with key fields for extraction:
Using the prompt template for W-9 forms, the LLM extracts the following information:
{
"Name": "John Doe",
"Business name": "Doe Enterprises",
"TIN": "123-45-6789",
"Address": "123 Main St, Springfield, IL"
}
LLM Function Calling
Introduction to Function Calling:
Function Calling enhances the interpretability and functionality of LLMs by creating custom functions tailored to specific tasks.
Custom Function Definition:
Here's an example Python function for extracting student information from text:
student_custom_functions = [
{
'name': 'extract_student_info',
'description': 'Get the student information from the body of the input text',
'parameters': {
'type': 'object',
'properties': {
'name': {
'type': 'string',
'description': 'Name of the person'
},
'school': {
'type': 'string',
'description': 'The university name.'
},
'grades': {
'type': 'integer',
'description': 'GPA of the student.'
},
'club': {
'type': 'string',
'description': 'School club for extracurricular activities.'
}
}
}
}
]
Function Integration in LLM Inference:
Integrate the custom function into the LLM inference process:
student_description = [student_1_description, student_2_description]
for i in student_description:
response = client.chat.completions.create(
model='gpt-3.5-turbo',
messages=[{'role': 'user', 'content': i}],
functions=student_custom_functions,
function_call='auto'
)
json_response = json.loads(response.choices[0].message.function_call.arguments)
print(json_response)
Output Analysis:
The generated JSON output demonstrates the consistency achieved through Function Calling.
Sample Output JSON:
{
"name": "David Nguyen",
"major": "computer science",
"school": "Stanford University",
"grades": 3.8,
"club": "Robotics Club"
},
{
"name": "Ravi Patel",
"major": "computer science",
"school": "University of Michigan",
"grades": 3.7,
"club": "Chess Club"
}
Evaluation
After conducting zero-shot predictions, evaluating the model's performance is crucial.
Ground Truth Comparison:
- Compare the model's predictions with manually annotated correct extractions for a subset of documents.
Metric Selection:
- Use evaluation metrics like precision, recall, and F1-score.
Entity-level Evaluation:
- Assess the model's performance on individual extracted entity types.
Error Analysis:
- Identify patterns in misclassifications and consider refining the training data or adjusting the model architecture.
Scalability Testing:
- Evaluate the model's performance on varying document lengths and complexities.
Continuous Monitoring:
- Implement continuous monitoring and retraining with new data.
Conclusion
This comprehensive methodology ensures that LLMs like GPT-3.5, LLAMA, and Google BARD are effectively utilized for document data extraction. By carefully selecting and preparing datasets, creating structured templates, and generating effective prompts, we achieve high accuracy and efficiency in extracting relevant information from various document types.
Coming Up Next
Stay tuned for our next blog post where we delve into the practical applications and case studies of LLM-based data extraction, showcasing real-world examples and highlighting both successes and challenges.
Blog Post 4: Evaluating the Effectiveness of LLMs in Data Extraction
Explain the evaluation framework used to assess model performance.
Discuss metrics like precision, recall, and F1-score.
Highlight the importance of ground truth comparison and error analysis.
For a more detailed exploration of this topic, including methodologies, data sets, and further analysis, please refer to my Master's Thesis and Thesis Presentation.
LinkedIn link - https://www.linkedin.com/in/pramod-gupta-b1027361/