There was a stretch around late 2024 where every Monday morning, somebody on the team would land in Slack saying we should "rewrite the whole reporting layer with an LLM." Six months later, almost none of those rewrites shipped. The ones that did have completely changed how we work.
So here is the honest version — what worked, what did not, and what we are still arguing about.
The thing nobody warns you about
Models hallucinate. Everybody knows that. What nobody warns you about is that they hallucinate confidently, and most of your team will believe them the first three or four times. We had a junior engineer paste a "fixed migration" suggested by an LLM straight into production. The model had invented a Postgres function that does not exist. The migration ran fine in dev because nothing called it.
This is a category of bug I had never seen before. The code looks right. It even type-checks. It just refers to a reality that does not exist.
The fix was not to ban AI — it was to change our review rule. If the diff was AI-assisted, the human on the PR has to explain, in writing, why each non-trivial line is correct. The act of explaining catches the hallucinations.
Where models actually earn their keep
In our work, three places consistently:
- Translating between formats. SQL to Eloquent, CSV to JSON schema, a client's spreadsheet into a clean migration. Mechanical, verifiable, fast. A junior dev who used to spend a day on a data import job now does it before lunch.
- First-draft anything. Test scaffolding, API client wrappers, regex (especially regex), commit messages. The first draft is never the final draft, but starting from a draft is a different cognitive task than starting from nothing.
- Rubber-ducking. Half the time we ask the model a question, we figure out the answer ourselves while typing the question. That is not nothing.
Where they do not
- Anything that touches money. We do not let a model decide pricing, run a quote, or generate an invoice. The downside of a wrong number is huge and the cost of an extra human step is small.
- Schema decisions. Models will happily suggest a fourteen-column table when four columns and a join would do. They optimise for the prompt, not for the next engineer who has to live with the table.
- Anything novel. If a problem has a thousand StackOverflow answers, models are great. If it is specific to your business, they confidently make up plausible-looking nonsense. Healthcare scheduling rules, telecom billing edge cases, construction project sequencing — domain-specific stuff is still a human problem.
A note on fine-tuning
We tried fine-tuning a small open model on our own technical docs. The output looked great in our QA harness and was useless in production. Turns out the model learned our writing style without learning our knowledge. We replaced the whole experiment with a much simpler RAG setup — chunk the docs, retrieve top-k passages, give the model the passages with the question. Less impressive on paper, more useful in the field.
If you are not already doing RAG before reaching for a custom model, do that first. The number of organisations that need their own fine-tune is much smaller than the number of organisations that say they need their own fine-tune.
What I would tell a team starting today
Pick three tasks where the cost of being wrong is low and the cost of doing it manually is high. Build a workflow around those. Measure. Ignore anyone telling you it is "the new web." It might be, eventually. But your job this quarter is to ship something that works.
The unglamorous truth is that AI is mostly a productivity tool. Treat it like one, audit it like one, and you will get more from it than the people building "AI-first" platforms in their pitch decks.