PDF2XLS converts PDF invoices into rows in a Google Spreadsheet. It supports three independent extraction workflows that you can switch between via configuration. Optionally it can upload the original PDF to a public location and write the link to the spreadsheet.
The APIs are not perfect — always double-check the output and fix errors manually.
You choose the workflow with the PreferredAPI setting in appsettings.json. Valid values:
PreferredAPI value |
What it does | External services required |
|---|---|---|
NuDelta |
Sends the PDF to the NuDelta Invoice service and polls until extraction is done. | NuDelta Invoice account |
OpenAIResponses |
Sends the PDF as an input_file to the OpenAI Responses API with a structured prompt + JSON schema. |
OpenAI account with API key |
AzureDocumentIntelligence |
Sends the PDF to the Azure Document Intelligence prebuilt-invoice model and maps the structured fields to the internal schema. |
Azure subscription with a Document Intelligence resource |
All three workflows share the Google Sheets, Seq, and (optional) PDF upload configuration described below.
The application validates that the configuration fields required by the selected workflow are present at startup and exits with a clear error message if anything is missing.
Regardless of which workflow you use, you always need:
- A Google Service Account with access to your target spreadsheet (JSON key file).
- A Google Spreadsheet with column headers matching the
Mappingsvalues. - (Optional) A Seq server if you want centralised logs.
- (Optional) The PDF2URL helper executable if you want to upload PDFs and store a link in the sheet.
Then, depending on the workflow:
- NuDelta workflow → NuDelta Invoice account.
- OpenAIResponses workflow → OpenAI account with a funded API key.
- AzureDocumentIntelligence workflow → Azure subscription with a Document Intelligence (Form Recognizer) resource.
- Go to https://console.cloud.google.com/ and create (or select) a project.
- Open APIs & Services → Library and enable Google Sheets API.
- Open APIs & Services → Credentials → Create Credentials → Service Account. Give it a name (this name becomes
GoogleSheets:ApplicationName). - Open the new service account, go to the Keys tab, Add Key → Create new key → JSON. Save the downloaded file somewhere safe — the path goes into
GoogleSheets:ServiceAccountFile. - Open the JSON file and copy the
client_emailvalue (looks likename@project.iam.gserviceaccount.com). In your Google Spreadsheet, click Share and grant Editor access to that email. - From the spreadsheet URL
https://docs.google.com/spreadsheets/d/<SPREADSHEET_ID>/edit#gid=0, copy<SPREADSHEET_ID>intoGoogleSheets:SpreadsheetId. - Set
GoogleSheets:SheetNameto the tab name (e.g.Sheet1). - Fill in
GoogleSheets:Mappings— each field's value is the column letter in the sheet where that field should be written (e.g."InvoiceNumber": "A"). Leave blank to skip a field.
- Sign up at the NuDelta Invoice portal and confirm you have an active subscription.
- Put your portal login into
NuDeltaCredentials:UsernameandNuDeltaCredentials:Password. These are used with HTTP Basic auth against the NuDelta API.
- Create an account at https://platform.openai.com/.
- Add credit / payment method under Billing. The Responses API with
input_filerequires a funded account. - Go to API keys → Create new secret key. Copy the value (
sk-…) intoOpenAI:OpenAI_APIKey. You won't be able to see it again — store it safely. - Choose a model that supports the
input_filecontent type and set it inOpenAI:OpenAI_Model. Recommended:gpt-4o-mini(good cost/quality trade-off). Other supported options at the time of writing:gpt-4o,gpt-4.1,gpt-4.1-mini. - Leave
OpenAI:Promptat the default unless you know what you're doing. The placeholder{schema}is substituted at runtime with the JSON schema the program expects — do not remove it.
Cost note: each invoice costs a few cents on
gpt-4o-mini. Larger / multi-page PDFs cost more because the file is sent as base64.
- In the Azure Portal, click Create a resource → AI + Machine Learning → Document Intelligence (previously called Form Recognizer).
- Choose a subscription, a resource group, a region (e.g.
westeurope), and a pricing tier. F0 (free) lets you process 500 pages/month; S0 (standard) is pay-as-you-go. - After deployment, open the resource and go to Keys and Endpoint in the left menu.
- Copy Endpoint (looks like
https://<resource>.cognitiveservices.azure.com/) intoAzureDocumentIntelligence:Endpoint. - Copy KEY 1 into
AzureDocumentIntelligence:ApiKey. - This workflow uses the
prebuilt-invoicemodel, which is built into the service — you do not need to train anything. It is available in all regions that support Document Intelligence v4.
Note: Document Intelligence is deterministic and returns structured fields directly. It does not use OpenAI.
If you want a DocumentLink column in the sheet pointing to the uploaded PDF:
- Obtain or build a small uploader CLI (the
PDF2URLhelper) that takes a file path as an argument and prints the public URL to stdout. - Set
UploadPDF:Enabledto"true"andUploadPDF:PDF2URLPathto the executable path. - Add
DocumentLinktoGoogleSheets:Mappingswith the column letter to write the URL into.
Set UploadPDF:Enabled to "false" to skip uploading.
- Run a Seq server (locally or remote).
- Put its URL into
Seq:ServerAddress(e.g.http://localhost:5341/). - Create an API key in Seq and set it in
Seq:ApiKey. Seq:AppNameis the value of theApplicationproperty used to filter events in Seq.
If you don't have a Seq server, leave Seq:ServerAddress empty — the file logs under logs/ still work.
- Open the GitHub Releases page and download the latest
PDF2XLS-<version>-win-x64.zipasset. - Unpack the archive into a folder. It contains:
PDF2XLS.exe— self-contained, single-file executable (64-bit Windows)appsettings.json— configuration file (kept alongside the executable so you can edit it without rebuilding)
- Open
appsettings.jsonand fill in:- Always:
PreferredAPI,GoogleSheets.*. - Workflow-specific (only for the workflow you chose — see above).
- Optional:
UploadPDF.*,Seq.*.
- Always:
| Section | Key | Required for | Description |
|---|---|---|---|
| (root) | PreferredAPI |
All | NuDelta, OpenAIResponses, or AzureDocumentIntelligence. |
NuDeltaCredentials |
Username, Password |
NuDelta | NuDelta Invoice portal login (HTTP Basic auth). |
OpenAI |
OpenAI_APIKey, OpenAI_Model, Prompt |
OpenAIResponses | OpenAI key, model id, and prompt containing {schema}. |
AzureDocumentIntelligence |
Endpoint, ApiKey |
AzureDocumentIntelligence | Azure Document Intelligence endpoint and key. |
GoogleSheets |
ServiceAccountFile, SpreadsheetId, ExpectedSpreadsheetName, SheetName, ApplicationName, Mappings |
All | Google Sheets target. ExpectedSpreadsheetName is verified against the live spreadsheet title at startup; leave blank to skip the check. Mappings values are spreadsheet column letters. |
UploadPDF |
Enabled, PDF2URLPath |
Optional | Enable to upload the PDF and store the link in DocumentLink. |
Seq |
ServerAddress, ApiKey, AppName |
Optional | Centralised Seq logging. ApiKey is required when ServerAddress is set. |
This is a CLI application. Pass one or more PDF file paths, or a folder path, as arguments. You can also drag-and-drop a PDF (or folder) onto the executable.
PDF2XLS.exe <file.pdf> [file2.pdf ...] | <folder path>
PDF2XLS.exe "C:\invoices\invoice-001.pdf"
The selected workflow processes the file, parses the response into the internal schema, and appends a row to the configured Google Sheet.
Pass two or more PDF paths on the command line:
PDF2XLS.exe "C:\invoices\invoice-001.pdf" "C:\invoices\invoice-002.pdf"
- Files are processed in the order given on the command line.
- Duplicate paths are skipped automatically.
Pass a single folder path to process all PDF files inside it:
PDF2XLS.exe "C:\invoices\2026-05\"
- The app scans for all
.pdffiles in the folder (top-level only — subfolders are not scanned). - Files are processed in alphabetical order.
When more than one file is processed in a single application run (multiple CLI paths or a folder):
- Each file gets its own unique RunID (GUID), but they all share the same RunTime timestamp.
- Each processed file is appended as a separate row in the Google Sheet.
- Each processed file is renamed to
.bakafter it succeeds. - If a file fails, it is left untouched and processing continues with the next file.
- If a folder contains no PDF files, the app exits with a warning message.
On startup the program creates a logs/ subfolder next to the executable and writes a daily rolling log file (365 days retained, up to 365 files).
If Seq:ServerAddress is set, events are also forwarded to Seq using the configured Seq:ApiKey. After each PDF file finishes processing (success or failure), buffered log events are flushed to both the local file and Seq before the next file starts.
| Destination | When active | Notes |
|---|---|---|
| Local file | Always | {exe}/logs/log-YYYYMMDD.txt |
| Seq | When Seq:ServerAddress is configured |
Requires Seq:ApiKey |
When the Google Sheets write succeeds, the original PDF is renamed in-place using the following convention:
{RunTime} {RunID} {OriginalFileName}.bak
| Part | Format | Details |
|---|---|---|
RunTime |
yyyyMMdd HHmmss |
UTC timestamp of application start (UTC). Set once per batch run; all files in the same run share this value. |
RunID |
GUID | A unique ID generated for each file. Also written to the Google Sheet (column mapped to RunID). |
OriginalFileName |
— | The original filename, unchanged, including its extension. |
.bak |
— | Fixed suffix appended after the original extension. |
All three parts are separated by a single space. There are no underscores in the filename.
Example:
20260517 013221 48070c04-bce5-4205-8ee7-ff506c8f2533 invoice-001.pdf.bak
Why spaces? Spaces make the three logical parts visually distinct without requiring a special delimiter character. The GUID already contains hyphens internally, so using an underscore as a separator would be ambiguous when splitting the name programmatically. Spaces allow a simple
Split(' ', 3)to recoverRunTime,RunID, andOriginalFileNameunambiguously.
The file is left untouched if processing fails or the Google Sheets write does not succeed.
- NuDelta — polls for the result with exponential backoff (up to 5 attempts, 1-second base delay). The outer operation also retries on exception with a 1-second delay (up to 5 attempts).
- OpenAIResponses — the inner HTTP client retries on HTTP 429 / 5xx with exponential backoff (3 attempts). The outer operation also retries on exception with exponential backoff (3 attempts).
- AzureDocumentIntelligence — retries on exception with exponential backoff (3 attempts, 2 s → 4 s → 8 s). The Azure SDK additionally handles low-level transient HTTP errors.
The application is designed so that a network outage or a service disruption during any stage of processing always leaves the source file untouched. The next run will pick it up and try again.
A file is only renamed/deleted after all of the following have succeeded:
- PDF extraction (NuDelta / OpenAI / Azure DI)
- PDF upload to public URL (if
UploadPDF:Enabledis"true") - Google Sheets row write
If any step fails — even after all retries are exhausted — the file stays in place.
| Layer | Timeout | Behaviour on expiry |
|---|---|---|
| Azure DI polling | 5 minutes | OperationCanceledException thrown; retried by outer policy (up to 3×) |
| NuDelta HTTP client (upload + poll requests) | 5 minutes per request | TaskCanceledException thrown; propagates to outer retry policy |
| OpenAI HTTP client | 5 minutes per request | TaskCanceledException thrown; propagates to outer retry policy |
| Google Sheets API calls (all three calls per write) | 5 minutes per call | OperationCanceledException thrown; propagates as a write failure — file stays |
| PDF2URL process | 5 minutes | Process is killed; empty URL returned → write not attempted → file stays |
| Workflow / stage | Retries | Delay strategy | What triggers a retry |
|---|---|---|---|
| NuDelta outer (whole operation) | up to 5 | 1 s fixed | Any exception except OperationCanceledException |
| NuDelta inner (result polling) | up to 5 | Exponential (2^n s) |
Document state is not done |
| OpenAI outer (whole operation) | up to 3 | Exponential (2^n s) |
Any exception except OperationCanceledException |
| OpenAI inner (HTTP call) | up to 3 | Exponential (2^n s) |
HTTP 5xx or HTTP 429 |
| Azure DI (whole operation) | up to 3 | Exponential (2^n s) |
Any exception except OperationCanceledException |
| Google Sheets (each API call) | up to 3 | Exponential (2^n s) |
HTTP 5xx, HTTP 429, HttpRequestException, IOException |
OperationCanceledException is never retried in any policy — it signals an intentional 5-minute timeout and should propagate immediately so the file is left untouched for the next run.
Before processing any files the application fetches the spreadsheet title and compares it to GoogleSheets:ExpectedSpreadsheetName. If the names do not match, or if the API call fails after 3 retries, the application logs the error and exits without processing any files. This prevents writing data to the wrong spreadsheet when SpreadsheetId is misconfigured.
Requires the .NET 10 SDK.
dotnet publish PDF2XLS/PDF2XLS.csproj `
-c Release `
-r win-x64 `
--self-contained true `
-p:PublishSingleFile=true `
-p:PublishReadyToRun=true `
-p:EnableCompressionInSingleFile=true `
-o publishThe output directory contains PDF2XLS.exe and appsettings.json.
Pushing a version tag triggers the Release workflow, which:
- Builds a self-contained, single-file
win-x64executable onwindows-latest - Packages
PDF2XLS.exeandappsettings.jsonintoPDF2XLS-<tag>-win-x64.zip - Creates a GitHub release with auto-generated release notes and attaches the zip
To publish a release:
git tag v1.0.0
git push origin v1.0.0