How UpBrains AI Extractors Work on Files and Text Inputs
UpBrains AI offers powerful extractors that transform unstructured content—whether in text or file format—into structured, machine-readable data. This capability plays a crucial role in automating customer operations, especially in document-heavy workflows involving emails, PDFs, images, and other file types.
In this article, we explain how UpBrains extractors operate on different input types and what output formats to expect. If you're looking for step-by-step instructions on how to define or configure an extractor, please refer to our dedicated articles on Creating Custom Extractors, Extracting Text from PDF and Images (OCR) and Extracting Structured Data.
Operating Modes: Text vs. File Inputs
1. Text Inputs (e.g., Emails, Chat Messages, Web Form Inputs)
When the extractor receives plain text, it analyzes the content using NLP models to identify key data points. These could include things like names, dates, order numbers, invoice totals, or addresses. The result is structured JSON data that can be passed into business workflows.
2. File Inputs (e.g., PDFs, Images, Office Documents)
When the extractor processes files, it may do one or both of the following:
Extract text: Using OCR or native PDF parsing.
Extract structured data: Identifying entities like invoice numbers, line items, and totals.
Expected Output: JSON Schema
Depending on the extraction mode, the JSON output will differ slightly.
A. Output When Extracting Only Text (e.g., OCR from a PDF or Image)
This mode is typically used when the goal is to retrieve all visible text content from a file. Here's an example output:
[{ "status": 200, "extractor_info": { "id": "ext_3843991ae47947e1837224c1621a6819", "model_id": "mdl_6079e11a359f451481a4c2a4b78da202", "model_name": "Default OCR Model", "service_name": "Xtract-OCR", "extractor_name": "OCR Extractor", "only_file_processable": true, "display_name": "OCR Extractor" }, "extractor_result": { "file_name": "wordpress-pdf-invoice-plugin-sample.pdf", "attachment_id": null, "document_type": "Document", "pages": [{ "page_number": 1, "angle": null, "width": 8.2639, "height": 11.6806, "unit": "inch", "text": "...extracted text here..." }], "content": "Page 1:\n...full page content as plain text..." }, "metadata": { "service": "Xtract", "category": "Document Skills", "action": "Extract Information - OCR Extractor", "processed_data": "Attachment - wordpress-pdf-invoice-plugin-sample.pdf" }, "status_summary": "The information extraction process was completed successfully." }]
This format is useful for general OCR pipelines or as a pre-step to semantic analysis.
B. Output When Extracting Structured Information
When using extractors designed to capture structured information—like invoices, purchase orders, or certificates of analysis—the output includes both individual fields and line items. OCR text may also be included if the extractor performs hybrid processing.
The following JSON is from a custom extractor with the following fields. The supported types are Number, Text , Date and Address. For each of non-Text types, there is a 'details' section that would provide more information including the values that can be used for more fine-grained processing.
Header Fields
Invoice Total (Type: Number)
Invoice Number (Type" Text / String)
Invoice Date (Type: Date)
Vendor Address (Type: Address)
Line Item Fields
Description (Type: Text / String)
Unit Price (Type: Text / String -> This is set on purpose to show that the type has an impact in 'details' section)
[{ "status": 200, "extractor_info": { "id": "ext_2ab6d904b138445d8a304a0265826e95", "model_id": "mdl_8498cc8de0234bf287a6446543ab44e3", "model_name": "Default UpBrains Model", "service_name": "Xtract-AnyDoc", "extractor_name": "My custom invoice extractor", "only_file_processable": false, "display_name": "My custom invoice extractor" }, "extractor_result": { "file_name": "wordpress-pdf-invoice-plugin-sample.pdf", "attachment_id": null, "document_type": "Invoice", "pages": [{ "page_number": 1, "text": " SlicedInvoices\n Invoice\n\n\n Invoice Number INV-3337\nFrom:\nDEMO - Sliced Invoices Order Number 12345\nSuite 5A-1204 Invoice Date January 25, 2016\n123 Somewhere Street Due Date January 31, 2016\nYour City AZ 12345\n Total Due $93.50\nadmin@slicedinvoices.com\n\nTo:\nTest Business\n123 Somewhere St\nMelbourne, VIC 3000\ntest@test.com\n\n Hrs/Qty Service Rate/Price Adjust Sub Total\n 1.00 Web Design $85.00 0.00% $85.00\n This is a sample description ...\n\n Sub Total $85.00\n Tax $8.50\n Total $93.50\n Par\n\nANZ Bank\nACC # 1234 1234\nBSB # 4321 432\n\n\nPayment is due within 30 days from date of invoice. Late payment is subject to fees of 5% per month.\nThanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com\nPage 1/1" }], "documents": [{ "fields": { "Invoice Total": { "content": "$93.50", "field_name": "Invoice Total", "page_number": "1", "type": "Number", "details": { "value": 93.5, "unit": { "symbol": "$", "name": "USD" } } }, "Invoice Number": { "content": "INV-3337", "field_name": "Invoice Number", "page_number": "1", "type": "String", "details": null }, "Invoice Date": { "content": "January 25, 2016", "field_name": "Invoice Date", "page_number": "1", "type": "Date", "details": { "value": "01/25/2016", "day": 25, "month": 1, "year": 2016 } }, "Vendor Address": { "content": "Suite 5A-1204, 123 Somewhere Street, Your City AZ 12345", "field_name": "Vendor Address", "page_number": "1", "type": "Address", "details": { "street_number": 123, "city": "Your City", "state": "AZ", "zip_code": "12345", "country": "USA" } } }, "items": [{ "Description": { "type": "String", "content": "Web Design", "field_name": "Description" }, "Unit Price": { "type": "String", "content": "$85.00", "field_name": "Unit Price" }, "page_number": "1" }], "content": "File Name: wordpress-pdf-invoice-plugin-sample.pdf\nNumber of Pages: 1\nDocument Class: Invoice\n\nFields:\n\nInvoice Total: $93.50\nInvoice Number: INV-3337\nInvoice Date: January 25, 2016\n\n\nTotal Line Items: 1\n\nLine Item 1:\nDescription: Web Design\nUnit Price: $85.00", "csv_content": "Invoice Total,Invoice Number,Invoice Date,line_items\n$93.50,INV-3337,\"January 25, 2016\",\"Total Line Items: 1\n\nLine Item 1:\nDescription: Web Design\nUnit Price: $85.00\n\"\n" }] }, "result_id": null, "metadata": { "service": "Xtract", "category": "Document Skills", "action": "Extract Information - My custom invoice extractor", "processed_data": "Attachment - wordpress-pdf-invoice-plugin-sample.pdf" }, "status_summary": "The information extraction process was completed successfully." }]
This format is ideal when downstream systems (like ERPs, CRMs, or workflow tools) require well-typed data fields such as currency values, dates, and itemized tables.
Summary
UpBrains extractors can operate on both raw text and uploaded files (PDFs, images, and more).
When dealing with text-only extraction, expect a flat JSON schema containing page text.
When extracting structured information, the output includes typed fields, itemized sections, and optionally OCR content.
The JSON schema is standardized to simplify integration into downstream processes or systems.
For instructions on how to create or customize extractors, see our companion article on How to Create Custom Extractor.
How UpBrains AI Extractors Work on Files and Text Inputs
UpBrains AI offers powerful extractors that transform unstructured content—whether in text or file format—into structured, machine-readable data. This capability plays a crucial role in automating customer operations, especially in document-heavy workflows involving emails, PDFs, images, and other file types.
In this article, we explain how UpBrains extractors operate on different input types and what output formats to expect. If you're looking for step-by-step instructions on how to define or configure an extractor, please refer to our dedicated articles on Creating Custom Extractors, Extracting Text from PDF and Images (OCR) and Extracting Structured Data.
Operating Modes: Text vs. File Inputs
1. Text Inputs (e.g., Emails, Chat Messages, Web Form Inputs)
When the extractor receives plain text, it analyzes the content using NLP models to identify key data points. These could include things like names, dates, order numbers, invoice totals, or addresses. The result is structured JSON data that can be passed into business workflows.
2. File Inputs (e.g., PDFs, Images, Office Documents)
When the extractor processes files, it may do one or both of the following:
Extract text: Using OCR or native PDF parsing.
Extract structured data: Identifying entities like invoice numbers, line items, and totals.
Expected Output: JSON Schema
Depending on the extraction mode, the JSON output will differ slightly.
A. Output When Extracting Only Text (e.g., OCR from a PDF or Image)
This mode is typically used when the goal is to retrieve all visible text content from a file. Here's an example output:
[{ "status": 200, "extractor_info": { "id": "ext_3843991ae47947e1837224c1621a6819", "model_id": "mdl_6079e11a359f451481a4c2a4b78da202", "model_name": "Default OCR Model", "service_name": "Xtract-OCR", "extractor_name": "OCR Extractor", "only_file_processable": true, "display_name": "OCR Extractor" }, "extractor_result": { "file_name": "wordpress-pdf-invoice-plugin-sample.pdf", "attachment_id": null, "document_type": "Document", "pages": [{ "page_number": 1, "angle": null, "width": 8.2639, "height": 11.6806, "unit": "inch", "text": "...extracted text here..." }], "content": "Page 1:\n...full page content as plain text..." }, "metadata": { "service": "Xtract", "category": "Document Skills", "action": "Extract Information - OCR Extractor", "processed_data": "Attachment - wordpress-pdf-invoice-plugin-sample.pdf" }, "status_summary": "The information extraction process was completed successfully." }]
This format is useful for general OCR pipelines or as a pre-step to semantic analysis.
B. Output When Extracting Structured Information
When using extractors designed to capture structured information—like invoices, purchase orders, or certificates of analysis—the output includes both individual fields and line items. OCR text may also be included if the extractor performs hybrid processing.
The following JSON is from a custom extractor with the following fields. The supported types are Number, Text , Date and Address. For each of non-Text types, there is a 'details' section that would provide more information including the values that can be used for more fine-grained processing.
Header Fields
Invoice Total (Type: Number)
Invoice Number (Type" Text / String)
Invoice Date (Type: Date)
Vendor Address (Type: Address)
Line Item Fields
Description (Type: Text / String)
Unit Price (Type: Text / String -> This is set on purpose to show that the type has an impact in 'details' section)
[{ "status": 200, "extractor_info": { "id": "ext_2ab6d904b138445d8a304a0265826e95", "model_id": "mdl_8498cc8de0234bf287a6446543ab44e3", "model_name": "Default UpBrains Model", "service_name": "Xtract-AnyDoc", "extractor_name": "My custom invoice extractor", "only_file_processable": false, "display_name": "My custom invoice extractor" }, "extractor_result": { "file_name": "wordpress-pdf-invoice-plugin-sample.pdf", "attachment_id": null, "document_type": "Invoice", "pages": [{ "page_number": 1, "text": " SlicedInvoices\n Invoice\n\n\n Invoice Number INV-3337\nFrom:\nDEMO - Sliced Invoices Order Number 12345\nSuite 5A-1204 Invoice Date January 25, 2016\n123 Somewhere Street Due Date January 31, 2016\nYour City AZ 12345\n Total Due $93.50\nadmin@slicedinvoices.com\n\nTo:\nTest Business\n123 Somewhere St\nMelbourne, VIC 3000\ntest@test.com\n\n Hrs/Qty Service Rate/Price Adjust Sub Total\n 1.00 Web Design $85.00 0.00% $85.00\n This is a sample description ...\n\n Sub Total $85.00\n Tax $8.50\n Total $93.50\n Par\n\nANZ Bank\nACC # 1234 1234\nBSB # 4321 432\n\n\nPayment is due within 30 days from date of invoice. Late payment is subject to fees of 5% per month.\nThanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com\nPage 1/1" }], "documents": [{ "fields": { "Invoice Total": { "content": "$93.50", "field_name": "Invoice Total", "page_number": "1", "type": "Number", "details": { "value": 93.5, "unit": { "symbol": "$", "name": "USD" } } }, "Invoice Number": { "content": "INV-3337", "field_name": "Invoice Number", "page_number": "1", "type": "String", "details": null }, "Invoice Date": { "content": "January 25, 2016", "field_name": "Invoice Date", "page_number": "1", "type": "Date", "details": { "value": "01/25/2016", "day": 25, "month": 1, "year": 2016 } }, "Vendor Address": { "content": "Suite 5A-1204, 123 Somewhere Street, Your City AZ 12345", "field_name": "Vendor Address", "page_number": "1", "type": "Address", "details": { "street_number": 123, "city": "Your City", "state": "AZ", "zip_code": "12345", "country": "USA" } } }, "items": [{ "Description": { "type": "String", "content": "Web Design", "field_name": "Description" }, "Unit Price": { "type": "String", "content": "$85.00", "field_name": "Unit Price" }, "page_number": "1" }], "content": "File Name: wordpress-pdf-invoice-plugin-sample.pdf\nNumber of Pages: 1\nDocument Class: Invoice\n\nFields:\n\nInvoice Total: $93.50\nInvoice Number: INV-3337\nInvoice Date: January 25, 2016\n\n\nTotal Line Items: 1\n\nLine Item 1:\nDescription: Web Design\nUnit Price: $85.00", "csv_content": "Invoice Total,Invoice Number,Invoice Date,line_items\n$93.50,INV-3337,\"January 25, 2016\",\"Total Line Items: 1\n\nLine Item 1:\nDescription: Web Design\nUnit Price: $85.00\n\"\n" }] }, "result_id": null, "metadata": { "service": "Xtract", "category": "Document Skills", "action": "Extract Information - My custom invoice extractor", "processed_data": "Attachment - wordpress-pdf-invoice-plugin-sample.pdf" }, "status_summary": "The information extraction process was completed successfully." }]
This format is ideal when downstream systems (like ERPs, CRMs, or workflow tools) require well-typed data fields such as currency values, dates, and itemized tables.
Summary
UpBrains extractors can operate on both raw text and uploaded files (PDFs, images, and more).
When dealing with text-only extraction, expect a flat JSON schema containing page text.
When extracting structured information, the output includes typed fields, itemized sections, and optionally OCR content.
The JSON schema is standardized to simplify integration into downstream processes or systems.
For instructions on how to create or customize extractors, see our companion article on How to Create Custom Extractor.
How UpBrains AI Extractors Work on Files and Text Inputs
UpBrains AI offers powerful extractors that transform unstructured content—whether in text or file format—into structured, machine-readable data. This capability plays a crucial role in automating customer operations, especially in document-heavy workflows involving emails, PDFs, images, and other file types.
In this article, we explain how UpBrains extractors operate on different input types and what output formats to expect. If you're looking for step-by-step instructions on how to define or configure an extractor, please refer to our dedicated articles on Creating Custom Extractors, Extracting Text from PDF and Images (OCR) and Extracting Structured Data.
Operating Modes: Text vs. File Inputs
1. Text Inputs (e.g., Emails, Chat Messages, Web Form Inputs)
When the extractor receives plain text, it analyzes the content using NLP models to identify key data points. These could include things like names, dates, order numbers, invoice totals, or addresses. The result is structured JSON data that can be passed into business workflows.
2. File Inputs (e.g., PDFs, Images, Office Documents)
When the extractor processes files, it may do one or both of the following:
Extract text: Using OCR or native PDF parsing.
Extract structured data: Identifying entities like invoice numbers, line items, and totals.
Expected Output: JSON Schema
Depending on the extraction mode, the JSON output will differ slightly.
A. Output When Extracting Only Text (e.g., OCR from a PDF or Image)
This mode is typically used when the goal is to retrieve all visible text content from a file. Here's an example output:
[{ "status": 200, "extractor_info": { "id": "ext_3843991ae47947e1837224c1621a6819", "model_id": "mdl_6079e11a359f451481a4c2a4b78da202", "model_name": "Default OCR Model", "service_name": "Xtract-OCR", "extractor_name": "OCR Extractor", "only_file_processable": true, "display_name": "OCR Extractor" }, "extractor_result": { "file_name": "wordpress-pdf-invoice-plugin-sample.pdf", "attachment_id": null, "document_type": "Document", "pages": [{ "page_number": 1, "angle": null, "width": 8.2639, "height": 11.6806, "unit": "inch", "text": "...extracted text here..." }], "content": "Page 1:\n...full page content as plain text..." }, "metadata": { "service": "Xtract", "category": "Document Skills", "action": "Extract Information - OCR Extractor", "processed_data": "Attachment - wordpress-pdf-invoice-plugin-sample.pdf" }, "status_summary": "The information extraction process was completed successfully." }]
This format is useful for general OCR pipelines or as a pre-step to semantic analysis.
B. Output When Extracting Structured Information
When using extractors designed to capture structured information—like invoices, purchase orders, or certificates of analysis—the output includes both individual fields and line items. OCR text may also be included if the extractor performs hybrid processing.
The following JSON is from a custom extractor with the following fields. The supported types are Number, Text , Date and Address. For each of non-Text types, there is a 'details' section that would provide more information including the values that can be used for more fine-grained processing.
Header Fields
Invoice Total (Type: Number)
Invoice Number (Type" Text / String)
Invoice Date (Type: Date)
Vendor Address (Type: Address)
Line Item Fields
Description (Type: Text / String)
Unit Price (Type: Text / String -> This is set on purpose to show that the type has an impact in 'details' section)
[{ "status": 200, "extractor_info": { "id": "ext_2ab6d904b138445d8a304a0265826e95", "model_id": "mdl_8498cc8de0234bf287a6446543ab44e3", "model_name": "Default UpBrains Model", "service_name": "Xtract-AnyDoc", "extractor_name": "My custom invoice extractor", "only_file_processable": false, "display_name": "My custom invoice extractor" }, "extractor_result": { "file_name": "wordpress-pdf-invoice-plugin-sample.pdf", "attachment_id": null, "document_type": "Invoice", "pages": [{ "page_number": 1, "text": " SlicedInvoices\n Invoice\n\n\n Invoice Number INV-3337\nFrom:\nDEMO - Sliced Invoices Order Number 12345\nSuite 5A-1204 Invoice Date January 25, 2016\n123 Somewhere Street Due Date January 31, 2016\nYour City AZ 12345\n Total Due $93.50\nadmin@slicedinvoices.com\n\nTo:\nTest Business\n123 Somewhere St\nMelbourne, VIC 3000\ntest@test.com\n\n Hrs/Qty Service Rate/Price Adjust Sub Total\n 1.00 Web Design $85.00 0.00% $85.00\n This is a sample description ...\n\n Sub Total $85.00\n Tax $8.50\n Total $93.50\n Par\n\nANZ Bank\nACC # 1234 1234\nBSB # 4321 432\n\n\nPayment is due within 30 days from date of invoice. Late payment is subject to fees of 5% per month.\nThanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com\nPage 1/1" }], "documents": [{ "fields": { "Invoice Total": { "content": "$93.50", "field_name": "Invoice Total", "page_number": "1", "type": "Number", "details": { "value": 93.5, "unit": { "symbol": "$", "name": "USD" } } }, "Invoice Number": { "content": "INV-3337", "field_name": "Invoice Number", "page_number": "1", "type": "String", "details": null }, "Invoice Date": { "content": "January 25, 2016", "field_name": "Invoice Date", "page_number": "1", "type": "Date", "details": { "value": "01/25/2016", "day": 25, "month": 1, "year": 2016 } }, "Vendor Address": { "content": "Suite 5A-1204, 123 Somewhere Street, Your City AZ 12345", "field_name": "Vendor Address", "page_number": "1", "type": "Address", "details": { "street_number": 123, "city": "Your City", "state": "AZ", "zip_code": "12345", "country": "USA" } } }, "items": [{ "Description": { "type": "String", "content": "Web Design", "field_name": "Description" }, "Unit Price": { "type": "String", "content": "$85.00", "field_name": "Unit Price" }, "page_number": "1" }], "content": "File Name: wordpress-pdf-invoice-plugin-sample.pdf\nNumber of Pages: 1\nDocument Class: Invoice\n\nFields:\n\nInvoice Total: $93.50\nInvoice Number: INV-3337\nInvoice Date: January 25, 2016\n\n\nTotal Line Items: 1\n\nLine Item 1:\nDescription: Web Design\nUnit Price: $85.00", "csv_content": "Invoice Total,Invoice Number,Invoice Date,line_items\n$93.50,INV-3337,\"January 25, 2016\",\"Total Line Items: 1\n\nLine Item 1:\nDescription: Web Design\nUnit Price: $85.00\n\"\n" }] }, "result_id": null, "metadata": { "service": "Xtract", "category": "Document Skills", "action": "Extract Information - My custom invoice extractor", "processed_data": "Attachment - wordpress-pdf-invoice-plugin-sample.pdf" }, "status_summary": "The information extraction process was completed successfully." }]
This format is ideal when downstream systems (like ERPs, CRMs, or workflow tools) require well-typed data fields such as currency values, dates, and itemized tables.
Summary
UpBrains extractors can operate on both raw text and uploaded files (PDFs, images, and more).
When dealing with text-only extraction, expect a flat JSON schema containing page text.
When extracting structured information, the output includes typed fields, itemized sections, and optionally OCR content.
The JSON schema is standardized to simplify integration into downstream processes or systems.
For instructions on how to create or customize extractors, see our companion article on How to Create Custom Extractor.
Invoice Data Extraction