HomeAI AutomationHow to Auto-Extract Invoice Data from…
AI Automation

How to Auto-Extract Invoice Data from PDF with AI Using n8n

How to Auto-Extract Invoice Data from PDF with AI Using n8n

Every month, the same grind: open an email, download the PDF, squint at the invoice number, manually type the amount into a spreadsheet, and pray you didn’t transpose a digit. If your business processes more than a handful of invoices per week, that manual routine is silently eating hours you could spend on work that actually matters. What if every PDF invoice that landed in your inbox was automatically read, parsed, and filed into a clean database — without you touching it once?

That’s exactly what you’ll build in this guide. Using n8n, OpenAI’s GPT-4o mini, and Airtable, you’ll create a workflow that watches your Gmail inbox for invoice emails, extracts text from attached PDFs, sends that text to AI for structured parsing, and logs every field — vendor name, invoice number, amounts, line items, due dates — into an Airtable base. The whole thing runs on autopilot every 30 minutes.

Prefer to skip the setup? Grab the ready-made template → and be up and running in under 10 minutes.

What You’ll Build

  1. A schedule trigger checks your Gmail inbox every 30 minutes for unread emails with PDF attachments matching invoice-related subjects.
  2. The workflow downloads the PDF, extracts raw text, and sends it to OpenAI with a carefully tuned prompt that returns structured JSON — vendor name, invoice number, dates, amounts, line items, and payment status.
  3. A Code node validates and formats the AI output, then pushes a clean record into your Airtable base with 13 fields covering every detail you’d normally type by hand.
  4. The source email is marked as read so it won’t be processed twice.

How It Works — The Big Picture

The workflow runs as a single linear pipeline with one conditional branch. Emails without PDF attachments are silently skipped, while PDFs flow through extraction, AI parsing, and database logging.

┌──────────────────────────────────────────────────────────────────────────┐
│  INVOICE DATA EXTRACTOR — AI POWERED                                     │
│                                                                          │
│  [Schedule]  →  [Gmail: Fetch]  →  [IF: Has PDF?]                       │
│                                         │ Yes          │ No              │
│                                         ↓              ↓                │
│                                  [Extract PDF]    [Skip]                │
│                                         ↓                               │
│                                  [OpenAI: Parse]                        │
│                                         ↓                               │
│                                  [Code: Format]                         │
│                                         ↓                               │
│                                  [Airtable: Save]                       │
│                                         ↓                               │
│                                  [Gmail: Mark Read]                     │
└──────────────────────────────────────────────────────────────────────────┘
  

What You’ll Need

  • n8n — self-hosted or cloud (version 1.0 or higher)
  • Gmail account — with OAuth2 credentials configured in n8n
  • OpenAI API key — with access to the gpt-4o-mini model (or gpt-4o for higher accuracy on complex invoices)
  • Airtable account — free tier works fine; you’ll need one base with a table matching the schema below

Estimated build time: 30–45 minutes from scratch, or under 10 minutes with the template.

Building the Workflow

1 Check Inbox Every 30 Minutes (Schedule Trigger)

This node kicks off the workflow on a regular interval. Open n8n, create a new workflow, and add a Schedule Trigger node. Set it to run every 30 minutes. You can adjust this later — every 15 minutes if you process high volumes, or once per hour if invoices trickle in slowly.

💡

Tip: During testing, switch this to “manual” execution so you can trigger runs on demand without waiting for the timer.

2 Fetch Unread Invoice Emails (Gmail)

Add a Gmail node and set the operation to Get Many. This node searches your inbox for unread emails that have PDF attachments and match invoice-related subject lines. Configure it as follows:

  • Limit: 10 (processes up to 10 emails per run)
  • Search query: has:attachment filename:pdf is:unread subject:(invoice OR receipt OR bill)
  • Download Attachments: Enabled, with prefix attachment_

The search query is the key here — it filters for only unread emails containing a PDF file whose subject mentions “invoice,” “receipt,” or “bill.” You can customize these keywords to match the vocabulary your vendors use.

{
  "id": "18e3a5b7c9d2f100",
  "subject": "Invoice #INV-2026-0847 from Acme Supplies",
  "from": { "emailAddress": "billing@acmesupplies.com" },
  "attachment_0": {
    "fileName": "INV-2026-0847.pdf",
    "mimeType": "application/pdf"
  }
}
💡

Tip: If you receive invoices from specific vendors, tighten the query with from:billing@vendor.com to avoid picking up random PDFs like flight confirmations.

3 Has PDF Attachment? (IF Node)

Add an IF node to verify that the attachment is actually a PDF. Check that attachment_0.mimeType equals application/pdf. Emails that pass go to the extraction step. Emails that fail (wrong attachment type, or the attachment was an image) are routed to a No Operation node and silently skipped.

📌

This guard prevents errors downstream. Some emails match the search query but attach Excel files or images instead of PDFs — the IF node filters those out cleanly.

4 Extract Text from PDF (Extract From File)

Add an Extract From File node. Set the operation to PDF and the binary property name to attachment_0 (matching the prefix you set in the Gmail node). This node converts the PDF binary data into raw text that the AI can read.

After this node runs, the output looks like this:

{
  "data": "INVOICE\nAcme Supplies Inc.\n742 Evergreen Terrace\nSpringfield, IL 62704\n\nBill To: Carter & Associates LLC\nInvoice #: INV-2026-0847\nDate: March 28, 2026\nDue Date: April 27, 2026\n\nDescription          Qty    Price     Amount\nConsulting Services    40   $150.00   $6,000.00\nSoftware License        1   $499.00     $499.00\n\nSubtotal: $6,499.00\nTax (8.25%): $536.17\nTotal: $7,035.17\n\nPayment Terms: Net 30"
}
💡

Tip: The Extract From File node works well with text-based PDFs. If your invoices are scanned images (no selectable text), you’ll need to add an OCR step before this node — tools like Google Cloud Vision or AWS Textract can handle that.

5 Parse Invoice with AI (OpenAI)

This is where the magic happens. Add an OpenAI node configured for the Chat resource. Set the model to gpt-4o-mini (fast, cheap, and surprisingly accurate for structured extraction) and the temperature to 0.1 (low randomness = consistent output).

The prompt instructs the AI to return a strict JSON object with 12 fields covering every piece of data you’d want from an invoice. Here’s what the prompt asks for:

Field Type Example
vendor_name String Acme Supplies Inc.
vendor_address String 742 Evergreen Terrace, Springfield, IL 62704
invoice_number String INV-2026-0847
invoice_date Date 2026-03-28
due_date Date 2026-04-27
subtotal Number 6499.00
tax_amount Number 536.17
total_amount Number 7035.17
currency String USD
payment_status String unpaid
line_items Array [{description, qty, unit_price, amount}]
notes String Payment Terms: Net 30
💡

Tip: For invoices with complex layouts or non-English text, switch the model to gpt-4o. It costs a bit more per call but handles edge cases much better. At typical invoice lengths (under 1,000 tokens), even gpt-4o costs less than a penny per invoice.

6 Format Extracted Data (Code Node)

The AI returns a JSON string inside its message content. This Code node does three things: extracts the JSON from the response (handling cases where the AI wraps it in markdown code blocks), validates each field with fallback defaults, and formats the line items array into a human-readable string for Airtable’s long text field.

The node also carries forward metadata from the original email — the subject line and sender address — so you always know where each invoice came from.

{
  "vendor_name": "Acme Supplies Inc.",
  "invoice_number": "INV-2026-0847",
  "invoice_date": "2026-03-28",
  "due_date": "2026-04-27",
  "subtotal": 6499.00,
  "tax_amount": 536.17,
  "total_amount": 7035.17,
  "currency": "USD",
  "payment_status": "unpaid",
  "line_items_text": "1. Consulting Services — Qty: 40 × $150 = $6000\n2. Software License — Qty: 1 × $499 = $499",
  "notes": "Payment Terms: Net 30",
  "source_email_from": "billing@acmesupplies.com",
  "processed_at": "2026-04-08T14:30:00.000Z"
}

7 Save to Airtable

Add an Airtable node set to the Create operation. Point it at your base and table, then map each field from the Code node output to the corresponding Airtable column. The mapping uses n8n expressions like ={{ $json.vendor_name }} to pull values dynamically.

Every processed invoice becomes a new row in your Airtable base — fully structured, searchable, and ready for reporting.

📌

Make sure your Airtable column names match exactly — including capitalization and spaces. “Invoice Number” is not the same as “invoice_number” in Airtable.

8 Mark Email as Read (Gmail)

The final node marks the original email as read so it won’t be picked up again on the next run. It references the email ID from the “Fetch Unread Invoice Emails” node using the expression ={{ $('Fetch Unread Invoice Emails').item.json.id }}.

💡

Tip: Want extra safety? Add a Gmail “Add Label” step before this one to move processed emails into a “Processed Invoices” label. That gives you an audit trail even if something goes wrong with Airtable.

The Data Structure

Before activating the workflow, create an Airtable base called “Invoice Tracker” with a single table named “Invoices” using this schema:

Column Type Example Description
Vendor Name Single line text Acme Supplies Inc. Company that issued the invoice
Invoice Number Single line text INV-2026-0847 Unique invoice identifier
Invoice Date Date 2026-03-28 Date the invoice was issued
Due Date Date 2026-04-27 Payment deadline
Subtotal Currency $6,499.00 Amount before tax
Tax Currency $536.17 Tax amount
Total Amount Currency $7,035.17 Final amount due
Currency Single line text USD Currency code
Payment Status Single select unpaid Options: paid, unpaid, unknown
Line Items Long text 1. Consulting — 40 × $150 Formatted breakdown of charges
Notes Long text Net 30 Payment terms, PO numbers, etc.
Source Email Email billing@acmesupplies.com Sender of the original email
Processed At Date 2026-04-08 Timestamp when the workflow ran
📌

Column names must match the mapping in the Airtable node exactly. The easiest way: duplicate the table from the template’s included Airtable CSV, or create columns one by one following this schema.

Full System Flow

┌──────────────────────────────────────────────────────────────────────────────┐
│                                                                              │
│   GMAIL INBOX                          AIRTABLE                             │
│   ┌──────────────┐                     ┌──────────────────┐                 │
│   │ Unread email  │                     │ Invoice Tracker   │                │
│   │ + PDF attach  │                     │ ┌──────────────┐ │                │
│   └──────┬───────┘                     │ │ Vendor: Acme │ │                │
│          ↓                             │ │ #INV-0847    │ │                │
│   [Schedule: 30 min]                   │ │ $7,035.17    │ │                │
│          ↓                             │ └──────────────┘ │                │
│   [Gmail: Fetch unread]                 └────────▲─────────┘                │
│          ↓                                       │                          │
│   [IF: Has PDF?]──No──→ [Skip]                   │                          │
│          │ Yes                                   │                          │
│          ↓                                      │                          │
│   [Extract PDF Text]                            │                          │
│          ↓                                      │                          │
│   [OpenAI: Parse JSON]                          │                          │
│          ↓                                      │                          │
│   [Code: Format Data]──────────────────────────→┘                          │
│          ↓                                                                 │
│   [Gmail: Mark Read]                                                       │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘
  

Testing Your Workflow

  1. Send yourself a test email with a sample PDF invoice attached. Use a subject line like “Invoice #TEST-001 from Test Vendor” so it matches the search query.
  2. In n8n, click Test Workflow (the play button) to trigger a manual run.
  3. Watch the execution — each node should show a green checkmark. Click any node to inspect its output data.
  4. Check your Airtable base. You should see a new row with all 13 fields populated from the test invoice.
  5. Verify the original email is now marked as read in Gmail.
Problem Likely Cause Fix
Gmail node returns 0 items No unread emails match the search query Send a test email with “invoice” in the subject and a PDF attached, then wait a moment for Gmail to index it
Extract From File returns empty text The PDF is a scanned image, not text-based Add an OCR step (Google Cloud Vision or AWS Textract) before the extraction node
OpenAI returns malformed JSON Invoice text was too garbled or incomplete Check the extracted text — if it’s unreadable, the source PDF may need OCR. Also try switching to gpt-4o for better handling
Airtable node fails with “field not found” Column name mismatch between n8n mapping and Airtable Compare column names character by character — watch for extra spaces, capitalization differences, or missing columns
Same invoice processed twice Mark as Read step failed or email wasn’t marked before the next scheduled run Check the Gmail node credentials have write permissions. Consider adding a label-based filter as a secondary guard

Frequently Asked Questions

How much does each invoice cost to process with OpenAI?

Very little. A typical invoice is 300–800 tokens of input text. With GPT-4o mini at $0.15 per million input tokens, processing one invoice costs roughly $0.0001 — less than a hundredth of a penny. Even at 1,000 invoices per month, your OpenAI bill would be under $0.50.

Does this work with scanned PDF invoices (images, not text)?

The Extract From File node handles text-based PDFs natively. For scanned invoices where the text is embedded in an image, you’ll need to add an OCR step before extraction. Google Cloud Vision, AWS Textract, or the n8n HTTP Request node calling an OCR API all work. The rest of the workflow stays the same.

Can I use Google Sheets instead of Airtable?

Yes. Replace the Airtable node with a Google Sheets “Append Row” node and map the same fields to spreadsheet columns. The rest of the workflow doesn’t need any changes. Airtable gives you richer field types (currency, single select, date pickers), but Sheets is perfectly functional.

What if I use Outlook instead of Gmail?

Swap the two Gmail nodes for their Microsoft Outlook equivalents — “Microsoft Outlook Trigger” for fetching emails and “Microsoft Outlook” for marking as read. The search query syntax differs slightly, but the concept is identical. Update the credential type in n8n accordingly.

Can the workflow handle invoices in languages other than English?

GPT-4o mini handles most major languages well. The prompt asks for specific JSON field names in English, but the AI will extract data from French, Spanish, German, and many other languages and map it into the English schema. For best results with non-Latin scripts, consider upgrading to GPT-4o.

What happens if two invoices arrive in the same email?

The current workflow processes the first PDF attachment per email. If a sender regularly attaches multiple PDFs to one email, you can modify the workflow to loop through all attachments by adding a “Split In Batches” node after the IF check and adjusting the binary property name dynamically.

Get the Invoice Data Extractor Template

Stop building from scratch — get the complete workflow JSON, Airtable schema, and step-by-step setup guides. Import it, add your credentials, and start processing invoices in under 10 minutes.

Get the Template →

Instant download · Works on n8n Cloud and self-hosted

What’s Next?

  • Add Slack notifications — send a message to your #accounting channel whenever a high-value invoice (over $5,000) is processed.
  • Auto-flag overdue invoices — add a second scheduled workflow that checks Airtable daily and flags rows where the due date has passed and payment status is still “unpaid.”
  • Build a payment approval flow — route invoices above a threshold to a manager via email for approval before marking them as “approved” in Airtable.
  • Connect to accounting software — push the structured data into QuickBooks, Xero, or FreshBooks via their APIs to eliminate double entry entirely.
n8n
PDF
OpenAI
Airtable
invoice processing
automation
AI
accounts payable