How we taught the AI to do the math first, talk after
Language models are confidently wrong with numbers — which is poison for an app about your money. Here's how we stopped ours from guessing: it never does the arithmetic itself. It writes a query, we run it, and only then does it talk.
Ask a language model how much you spent on groceries last month and it will give you an answer — a specific one, to the dollar, in a perfectly confident sentence. The trouble is that it might be made up.
In an app about your money, that's a disaster. A number that's almost right — $312 instead of $340 — is worse than no number at all, because you'll believe it. And the first time you catch your finance app in a confident lie, the trust doesn't come back.
Language models can't actually do math
It's worth being clear about why this happens. A language model doesn't calculate; it predicts the next plausible token. Asked to total forty-odd transactions, it produces something that looks like the sum — the right number of digits, a believable magnitude — without ever adding anything up. Sometimes it nails it. Sometimes it's thirty dollars off. You can't tell which from the outside, and neither can it.
For a while we tried to fix this with better prompting — "be careful," "double-check your math," "show your work." It helped at the margins and solved nothing. You cannot prompt a model into being a calculator.
Math first, talk after
So we stopped asking it to. The shift — which took us longer to fully commit to than we'd like to admit, the better part of three months — is that the model's job was never to compute the answer. Its job is to decide what question to ask the data.
Now it runs in two stages. When you ask "how much did I spend on dining in May," the model doesn't answer. It writes a query against your actual transactions and hands it back to us:
select sum(amount)
from transactions
where category = 'dining'
and month = '2026-05';
We run that as real SQL against the database — a machine that does, in fact, do arithmetic — get back a real number, and pass it to the model. Only then does it speak, wrapping that ground-truth figure in a sentence.
The model decides what to count. The database does the counting. The model never touches the number until it's already correct.
The part that took three months
The idea is simple; making it reliable was not. The whole thing only works if the model actually asks instead of answering from its gut — and a fluent model's first instinct, every single time, is to just answer. A lot of the work was making the data step non-optional: when a question is about your numbers, querying isn't a choice the model gets to skip.
The rest was the long tail — questions that needed more than one query, questions that looked like they needed data but didn't, and keeping all of it fast enough to happen on every message without you feeling the wait. None of it was one clever fix. It was months of closing the gap between "usually right" and "right," because for money those are completely different products.
The rule we ended on is one sentence: the model is never allowed to invent a number. Every figure you see in a conversation was computed from your real transactions, not generated by something that's good at sounding right. The AI gets to be warm and fluent and conversational — all the things it's genuinely good at — precisely because we took the one job it's bad at away from it.