COA PDF extraction deployment
Enhance your understanding of supplier quality performance and save valuable time by automatically extracting data from incoming certificates of analysis.
This article outlines the steps to deploy and configure COA PDF extraction in AlisQI.
In this article
Step by step deployment checklist
- Deploy template "Incoming COAs" to create the analysis set, selection list for extracted Products and extraction workflow.
- Create a dedicated user account:
- Create a new user group, to isolate permissions
- Create a new user account, so that API calls are properly traceable.
- Configure access and permissions for the new user group:
- Grant all permissions for the Incoming COAs set
- Grant View and Manage permissions for the "Incoming COA | Product" selection list.
- Grant access to the Specification Management feature in the Module access management.
- Generate an API authentication token for this new user, give it an expressive name like "PDF COA Parser"
- Share the access token with Thomas/Gerben
Workflow
The templates include a workflow designed to initiate the extraction process. This workflow incorporates a reference to the setId
and the tenant's subdomain (tenant.alisqi.com). It automatically activates when a new Result is generated with a single PDF attachment. Additionally, the workflow can be manually triggered, such as after adding fields or updating custom instructions.
Map values to fields
This feature demands minimal configuration. There's no necessity to manually map values from the COA to fields in the analysis set, as the application automatically attempts to align them.
Simply add the fields you wish to extract from the PDF document.
Tip: The more closely the field title aligns with the value's label in the PDF, the higher the extraction accuracy will be.
To improve the accuracy, you can provide a set of custom instructions to provide more context or examples on how to extract the right values from the PDF document. Read all about finetuning here.
Limitations
Currently, we can extract values from COAs under the following conditions:
- The document must be a native PDF, as we do not utilize OCR for scanned images.
- Each document should contain only one lot or batch. We can process a single batch, lot, or delivery per document, with all data consolidated into one result in AlisQI.
- The COA may have one or multiple pages, provided all pages pertain to the same batch or delivery, allowing us to handle multiple pages effectively.
Fine tuning with custom instructions
This guide provides a structured method for creating custom prompts to enhance variable extraction accuracy with a language model (LLM). Clear, context-specific instructions and examples ensure consistent outputs.
Following OpenAI's prompt engineering guidelines, it stresses the need for detailed instructions to effectively guide the model.
Define Fixed Variable Names and Values
For certain variables, establish standardized names or values to ensure consistency.
- Example: Always return the variable "unique id" as "unique_id_".
- Purpose: Standardized naming ensures that certain variables have fixed names or values, making the data reliable for downstream processing.
Provide Detailed Instructions per Variable
Specify precise instructions for each variable to make the model’s task clear, as detailed guidance improves relevance and accuracy.
- Example for document date : Exclude expiration dates; only extract the issuance date of the document, not the production date.
- Purpose: Detailed instructions clarify the target value and prevent extraction errors, ensuring that the LLM focuses on the right information.
Specify Variables to Ignore or Modify as Needed
For variables that should be ignored or modified, provide explicit instructions to exclude them from the final output.
- Example: Disregard "temp value " and "review comments " in the final response.
- Purpose: Keeps the output focused by excluding irrelevant variables, enhancing clarity and accuracy
Outline Specific Formatting Rules
For variables with different possible formats, specify the exact expected structure to reduce ambiguity.
- Example for order code : Allow values with both letters and numbers, e.g.,
"AB1234". - Purpose: Ensures accurate data capture by accommodating variations in structure and preventing extraction of incorrectly formatted data.
Filter Based on Keywords or Labels
Define keywords or labels that must be present for a variable to be considered valid, following the OpenAI guideline to include specific details for relevance.
- Example for product id : Extract only if labeled with terms like "Product ID",
"Customer Product No.", or similar. Ignore ambiguous terms such as "material". - Purpose: Ensures relevance by filtering out irrelevant terms, aligning with OpenAI’s prompt engineering principles for focused extractions.
Clarify Multiple Parts Extraction Rules
For variables containing multiple parts, specify which segments to include or exclude.
- Example for reference number : Extract only the initial values before any semicolon; ignore labels like "Ref No".
- Purpose: Keeps the output concise by focusing on the relevant data, excluding unnecessary parts.
Include Alternate Names or Translations
For variables that may appear under multiple names or translations, list acceptable variations. OpenAI’s recommendation to “include details” helps by listing all potential labels to improve recognition.
- Example for color metric : Accept alternative names such as "Delta L", "Delta E", and "Delta H".
- Purpose: Expands recognition of relevant data by capturing various valid labels, ensuring flexibility across documents.
Set Contextual Limits for Extraction
Restrict extraction to cases where exact terms or variations appear, avoiding loose matches to ensure data precision.
- Example for acidity level : Only extract if the exact term "pH" or "Acidity"
appears. - Purpose: Limits scope to precise terms, aligning with OpenAI’s guidance to specify exact language and avoid false positives.
Use Conditional Extraction Rules
Set specific conditions for extracting a variable based on context, making sure it only occurs when relevant.
- Example for solid content : Extract only if the term includes "Solid Content" or similar phrases.
- Purpose: Provides control over extraction, ensuring it happens only under the right conditions and reducing irrelevant extractions.
- Custom instructions for "unique id": Always return "unique id" as "uniqueid_".
- Custom instructions for "document date ": Exclude expiration dates. Only extract the issuance date, not the production date.
- Custom instructions for "temp value " and "review comments ": Ignore these variables if present in the document.
- Custom instructions for "order code ": Extract values with both numbers and letters if applicable (e.g., "AB1234").
- Custom instructions for "reference number ": Extract only the initial numeric values before any semicolon; ignore labels like "Ref No".
- Custom instructions for "color measurements": Only extract if it includes the exact term "Delta" followed by a single letter, such as "Delta L" or "Delta E".
- Custom instructions for "solid content ": Include only if the term contains "Solid Content" or a similar phrase.
- Custom instructions for "acidity level ": Only extract if "pH" or "Acidity" is
specifically mentioned.