How do I Finetune the PDF/COA Extraction Tool for Better Accuracy?
Fine-tuning the PDF/COA extraction tool is essential to ensure it is properly configured and captures accurate data.
There are various ways to ensure your set up is as accurate as possible.
This feature is available with an active subscription to the Incoming Goods Inspections within the Supplier Quality module.
Finetune by Adjusting your Set-Up
- Rename fields to match PDF/COA wording
- Ensure the correct field types are selected (date, numeric or text)
- Remove any unused/ irrelevant fields
- Use similar PDF/COA document formats where possible
- Group similar documents in one set
Finetune when Handling Different PDF/COA Formats
Use ONE analysis set if:
- COAs share similar structure
- Most fields overlap
- Different suppliers use very different formats
- Different product categories require different data
Finetune with Custom Instructions
This guide provides a structured method for creating custom prompts to enhance variable extraction accuracy with a language model (LLM). Clear, context-specific instructions and examples ensure consistent outputs.
Following OpenAI's prompt engineering guidelines, it stresses the need for detailed instructions to effectively guide the model.
Define Fixed Variable Names and Values
For certain variables, establish standardized names or values to ensure consistency.
- Example: Always return the variable "unique id" as "unique_id_".
- Purpose: Standardized naming ensures that certain variables have fixed names or values, making the data reliable for downstream processing.
Provide Detailed Instructions per Variable
Specify precise instructions for each variable to make the model’s task clear, as detailed guidance improves relevance and accuracy.
- Example for document date : Exclude expiration dates; only extract the issuance date of the document, not the production date.
- Purpose: Detailed instructions clarify the target value and prevent extraction errors, ensuring that the LLM focuses on the right information.
Specify Variables to Ignore or Modify as Needed
For variables that should be ignored or modified, provide explicit instructions to exclude them from the final output.
- Example: Disregard "temp value " and "review comments " in the final response.
- Purpose: Keeps the output focused by excluding irrelevant variables, enhancing clarity and accuracy
Outline Specific Formatting Rules
For variables with different possible formats, specify the exact expected structure to reduce ambiguity.
- Example for order code : Allow values with both letters and numbers, e.g.,
"AB1234". - Purpose: Ensures accurate data capture by accommodating variations in structure and preventing extraction of incorrectly formatted data.
Filter Based on Keywords or Labels
Define keywords or labels that must be present for a variable to be considered valid, following the OpenAI guideline to include specific details for relevance.
- Example for product id : Extract only if labeled with terms like "Product ID",
"Customer Product No.", or similar. Ignore ambiguous terms such as "material". - Purpose: Ensures relevance by filtering out irrelevant terms, aligning with OpenAI’s prompt engineering principles for focused extractions.
Clarify Multiple Parts Extraction Rules
For variables containing multiple parts, specify which segments to include or exclude.
- Example for reference number : Extract only the initial values before any semicolon; ignore labels like "Ref No".
- Purpose: Keeps the output concise by focusing on the relevant data, excluding unnecessary parts.
Include Alternate Names or Translations
For variables that may appear under multiple names or translations, list acceptable variations. OpenAI’s recommendation to “include details” helps by listing all potential labels to improve recognition.
- Example for color metric : Accept alternative names such as "Delta L", "Delta E", and "Delta H".
- Purpose: Expands recognition of relevant data by capturing various valid labels, ensuring flexibility across documents.
Set Contextual Limits for Extraction
Restrict extraction to cases where exact terms or variations appear, avoiding loose matches to ensure data precision.
- Example for acidity level : Only extract if the exact term "pH" or "Acidity"
appears. - Purpose: Limits scope to precise terms, aligning with OpenAI’s guidance to specify exact language and avoid false positives.
Use Conditional Extraction Rules
Set specific conditions for extracting a variable based on context, making sure it only occurs when relevant.
- Example for solid content : Extract only if the term includes "Solid Content" or similar phrases.
- Purpose: Provides control over extraction, ensuring it happens only under the right conditions and reducing irrelevant extractions.
- Custom instructions for "unique id": Always return "unique id" as "uniqueid_".
- Custom instructions for "document date ": Exclude expiration dates. Only extract the issuance date, not the production date.
- Custom instructions for "temp value " and "review comments ": Ignore these variables if present in the document.
- Custom instructions for "order code ": Extract values with both numbers and letters if applicable (e.g., "AB1234").
- Custom instructions for "reference number ": Extract only the initial numeric values before any semicolon; ignore labels like "Ref No".
- Custom instructions for "color measurements": Only extract if it includes the exact term "Delta" followed by a single letter, such as "Delta L" or "Delta E".
- Custom instructions for "solid content ": Include only if the term contains "Solid Content" or a similar phrase.
- Custom instructions for "acidity level ": Only extract if "pH" or "Acidity" is
specifically mentioned.