The Architecture: What the Tool Needs to Do Before Writing a Line of Code

Before I wrote a single line of Python, I spent two days staring at blank docs trying to answer one question: what does this tool actually need to do?

Not "build a variance commentary generator." That's the label. What does it need to do at a level specific enough that I can test whether it worked?

That question forced a set of smaller ones. What does the input look like? What counts as good output? What does the prompt need to know, and what should it never be asked to guess? This post is the architecture work that happened before any code. No Python yet. Just the decisions that will determine whether the thing is useful or not.

Post 2 spec

Input Flat Excel table, one row per P&L line item, pre-calculated variances

Output Structured commentary, one block per line item, no invented context

Stack Python · openpyxl · Claude API, using claude-sonnet-4-20250514

This post Input design, output spec, prompt strategy. Before any code runs.

What the input needs to look like

Finance teams produce variance data in a hundred different formats. The question isn't what format exists out there. It's what's the minimum structure I need to make the prompt useful.

My answer: a flat Excel table. One row per line item. Eight columns. No formulas in the file.

#	Column	Type	Example value
1	Line Item	Input	Revenue
2	Budget (€)	Input	1,200,000
3	Actual (€)	Input	1,340,000
4	Prior Year (€)	Input	1,150,000
5	Variance vs Budget (€)	Pre-calc	+140,000
6	Variance vs Budget (%)	Pre-calc	+11.7%
7	Variance vs Prior Year (€)	Pre-calc	+190,000
8	Variance vs Prior Year (%)	Pre-calc	+16.5%

openpyxl reads it flat. Python calculates nothing. The variances come in pre-calculated because if the input is already clean, the model can focus entirely on language rather than arithmetic it might get wrong.

The deliberate constraint: no chart of accounts, no entity hierarchy, no dimension tables. This is a line-item tool. A standard P&L has twelve to twenty items. Flat is readable, flat is testable, and flat means I can validate inputs without a schema validator. Aggregation is a post-v1 problem.¹

What "finance-grade" actually means

This is the part that matters more than any technical choice.

When I say finance-grade commentary, I mean three specific things. First, directional specificity. Not "revenue increased" but "Revenue was €140k above budget, up 11.7%." The number has to be in the sentence. The direction has to be explicit. Second, no hallucinated context. The model can only speak to what's in the data. If I give it no reason for a variance, it cannot invent one. "Higher headcount" is not a valid output if headcount isn't in the input. This is the biggest failure mode in naive prompting, and the one I'm most worried about. Third, consistent structure. Each commentary block should follow the same logic: what moved, by how much, in which direction, against which benchmark. Finance people don't want creative variation in structure. They want something they can scan.²

Here's what that looks like in practice. Same data point, two outputs.

✗ Naive output: plausible but useless

Revenue showed positive performance this period, exceeding expectations. The strong results reflect effective commercial strategies and favourable market conditions.

✓ Finance-grade target

Revenue of €1,340k was €140k (11.7%) above budget and €190k (16.5%) above prior year. No driver detail is available in the input; editorial context to be added by reviewer.

The bad version could describe any positive variance on any line item in any company. The good version is boring, and that's correct. It's a starting point, not a story.

What finance-grade is not: sophisticated narrative. That requires business context this tool doesn't have. What it produces is a defensible first draft, faster to edit than to write from scratch.

Prompt design decisions

I spent more time on the prompt than on any other part of the build so far. A few decisions that shaped it.

The model is a finance analyst, not a writing assistant. The system prompt establishes this explicitly: "You are a financial analyst preparing internal management commentary for review." Not "You are a helpful assistant." The framing changes the output register. Tested both. The difference in formality and structure is noticeable.

Numbers must come from the input only. Explicit instruction in the prompt: do not reference any figures not present in the data provided. This addresses hallucination risk directly. Whether Claude follows it consistently across edge cases (missing values, zero variances, prior year not available) is something I'll test in Post 3 with real data.

Sentence-level directives beat vague quality adjectives. "Write two sentences per line item" produces more consistent output than "write concise commentary." "Include the absolute variance in euros and the percentage variance in the first sentence" is more reliable than "be specific." The more structural the instruction, the less the model has to interpret. That's what you want here.

Why claude-sonnet-4-20250514. Fast enough for a P&L-sized input, accurate enough for structured text generation. Extended thinking mode isn't worth it for this use case because the structure is doing the heavy lifting. The model's job is to fill a template reliably, not to reason through ambiguity. If the input is clean, Sonnet is the right call.³

The flow

Data flow / v1 architecture

Excel input
(openpyxl)

structured data

→

Python
formatter

row context

→

Claude API
Sonnet

commentary gen

→

Terminal
output (v1)

plain text

→

Power BI
(post-v1)

later

The key design choice: Python handles data, Claude handles language. There is no blurring of those responsibilities. The script does not summarise or interpret. It formats and passes. The model does not calculate or restructure. It writes.

What success looks like before writing a line of code

I defined the success criteria upfront so I'd know what to test against, not just whether it ran without crashing.

For the input: the script reads a properly formatted Excel file without crashing and without me touching the data manually before running it.

For the output: each line item gets a one-to-two sentence commentary that contains the correct number, the correct direction, and nothing invented.

For the prompt: running the same input twice produces structurally similar output. Variance in phrasing is fine. Factual inconsistency is not.

That's the bar. Not "does it look impressive." Does it produce output a finance analyst could use as a starting point without needing to correct the facts first.

⬡

Repository · Built in public

variance-commentary-generator

github.com/alloutofxps/variance-commentary-generator · code committed post by post

↗

Notes

1 The decision to keep v1 flat is partly pragmatic, partly philosophical. Flat inputs are easier to validate and easier to explain. Once the core loop works, adding dimensional support (entity, period, segment) is additive rather than architectural. Start with a problem you can define cleanly.

2 The hallucination risk in LLM-generated commentary isn't random fabrication, it's plausible fabrication. "Volume growth driven by new customer wins" is believable. It's also completely invented if the input doesn't include CRM data. The fix isn't model selection, it's prompt design: explicitly forbid inference beyond the data provided, and instruct the model to flag where driver context is missing rather than fill the gap.

3 Model documentation: Anthropic model overview. claude-sonnet-4-20250514 is the Sonnet generation used at time of writing.

Next up: building it. The actual Python, the actual prompt in full, and whatever breaks first when I point it at real numbers. Find me on LinkedIn if you want to follow along or flag something I've got wrong.

Pratik Parashar

Senior Auditor at a Big-4 firm in the Netherlands. Writing about AI, finance, and what happens when the two collide. Building a Variance Commentary Generator in public. Open to new roles in Finance Transformation, FP&A, and Financial Control. LinkedIn →