FoxBurrowAI takes raw invoice and purchase-order exports and turns them into rich, sensible spend categories (aligned to UNSPSC), ready for analysis and dashboards.
The underlying research ran FoxBurrowAI’s workflow against a large public invoice dataset and compared it to a careful human baseline.
Most organisations have years of invoices, POs, and supplier payments - but very few have a clear view of what that money is actually buying. When thousands of lines sit in generic, inconsistent, or overly broad categories, opportunities remain hidden. Good categorisation turns raw transactions into usable insight.
Clean categorisation often pays for itself quickly: even a 1–3% improvement in visibility across a multi-million-pound spend base translates into material savings, reduced waste, and better decision-making.
Without it, organisations end up running procurement with blurred vision - and miss the very areas where efficiencies and improved value lie.
Categorisation is only useful if it holds up under real scrutiny — especially at deeper levels like L3/L4 where mistakes are easy to hide. Below are two practical comparisons.
On a real public-sector invoice dataset with existing manual categories, FoxBurrowAI produced results that were comparable to — and in places more consistent than — a diligent but non-specialist human analyst working line-by-line.
In practice, this is the difference between “months of effort” and “review and refine”.
A useful baseline is what happens when you rely on a general-purpose model to categorise items from the front end with prompt tweaks. Reported results in the literature show that while L1 can be passable, accuracy collapses as you go deeper — exactly where procurement teams need precision.
| Comparison | L1 match | L2 match | L3 match | L4 match |
|---|---|---|---|---|
| Prompt-engineered GPT-4 (cleaned dataset) | 54.59% | 40.31% | 29.01% | 10.8% |
At L4, ~10% matching is not fit-for-purpose: it forces heavy manual correction, destroys trust, and makes dashboards misleading. FoxBurrowAI is built to behave like an analyst — producing consistent, reviewable decisions rather than a one-shot guess.
Many published tests use cleaned, cherry-picked, or “best case” rows. In real procurement exports, descriptions are inconsistent, columns go missing, and suppliers use vague language. That’s why a production approach needs guardrails, consistency, and review support — not just a clever prompt.
In one public dataset, a line labelled by the original team as “green beans” turned out – on closer inspection of the description – to be garden peas.
It is likely whoever was working on this was either just incorrect or time constrained in categorising many of these rows. This is not uncommon in procurement classification, especially when being tasked to go all the way to L4. Across thousands of rows, these shortcuts add up, especially at deeper category levels, making it impossible to know you are over paying for your specific veg.
FoxBurrowAI focuses on the actual text on the line – “peas”, not just “fresh veg” – and works systematically. That means you can finally see where spend is really going:
Unlike traditional auto data categorisation scripts or methodologies, we do not rely entirely on past categorisations and vague matches. If the very normal errors you saw in the Californian Government’s efforts are propagated throughout your future processes, you will quickly find your accuracy in categorisation dropping.
This is also a major concern for any overly specific machine learning techniques trained on large existing datasets. Many companies see large data histories as an invaluable resource, but unless you are entirely comfortable in that data quality, the “garbage in, garbage out” effect will quickly apply with AI.
Advanced reasoning at the point of discovery proves to be far more effective on unseen data than simply copying old patterns forward.
Many descriptions have more than one reasonable home. In those cases, FoxBurrowAI can provide a ranked set of category options per line rather than a single forced decision.
The workflow can propose several plausible paths through the hierarchy and attach a brief justification for each. Review teams then see:
This keeps control firmly with you while removing the heavy lifting of thinking up categories from scratch.
Some organisations are happy for data to leave their network; others aren’t. FoxBurrowAI can support both:
Either way, the goal is the same: reliable categorisation that your teams can trust, with appropriate guardrails and auditability.
Real procurement data rarely comes as a neat, standardised table. You get multi-sheet workbooks, varying column names, different date formats, free-text descriptions, and the occasional “mystery” column that still turns out to matter.
Want to see how your own data behaves? Share a sample file and we’ll run it through the process, returning a categorised extract and an example dashboard.
The dashboards on this site are built with Plotly Dash and designed from the ground up for procurement use. They support:
As part of a paid plan, FoxBurrowAI can provide self-contained, offline dashboards tailored to your data. They open in a web browser, require no user accounts, and don’t depend on shared cloud servers.
That makes them ideal for sharing with non-technical stakeholders or in environments where traditional BI tools and user management are a constant friction.
Whether it’s a year of invoices or a multi-year, multi-entity dataset, FoxBurrowAI can turn it into clean, sensible categories and ready-to-use dashboards. Pricing typically starts from around £0.075 per line for categorisation, with options for one-off projects or ongoing refreshes.