Guy at work proposed AI workflow enhancements…
His whole idea was to take a workflow and just replace a few roles…
Developer becomes “AI developer agent” Reviewer becomes “AI reviewer agent” Tester becomes “AI code testing agent”
Rinse and repeat until the only block that was human was “Marketing Engineer”. Guess what department the guy worked in…
It can work for some things, but you have to be careful AF.
Most of the ‘reasoning’ models are already doing this or web searches to make sure answers are sane. Hallucinations are still all over the place, but they’re processing the request multiple times on multiple models and only giving you back answers that don’t have too many red flags.
I was told to test out what it could do for translation. Single model translate was in the ballpark but rarely really great. translate your shipment has arrived to french.
Then take the suggested output to a local model, say i was told this was the proper translation for “your shipment has arrived” in French where the context is it’s an airdrop from a plane and it’s a crate full of food.
rate this translation from 1 to 5, 1 being poor and 5 being excellent and give me examples why you thing it’s rated that way
take that back to the first model,
I was told that this was an acceptable translation, but a better one might be …
Then i took the output and put it up against our professional translations.
both models spitballing off each other were at least acceptable 99% of the time, and once in a while the output was more accurate than the professional service we were paying to do it.
We still use the professional service, but I vet their output now.
Why not just do the job correctly in the first place? Honestly this sounds like fucking exhausting busywork.
We’ve gone through tons of vendors. We’ve spent tons on well known vendors, we’ve spent pittances on shitty vendors, in the end, they all outsource to some random polyglot who either doesn’t quite have a grasp on one language or the other or maybe doesn’t have the whole context. For some reason, they fail to give us a good translation some percentage of the time and the communities bitch. It’s a mix really and if they would do the job right the first time, it would be awesome. We’re translating into a dozen languages and on a busy month, we might send out 1200 names/paragraphs to be translated.
For someone fluent in all involved languages, sure.
But from the sounds of it, OP’s company outsources the translation but doesn’t fully trust the output they get back. They’re back to square one for verifying it, because if they knew both languages enough to verify, they could do the translations themselves.
The problem AI is trying to solve is “how do I access a skill I don’t have cheaply?” It’s only because it’s bad at that problem that it has shifted to “how can we use AI to get more production out of the skilled workers we still need to babysit the AI that is unreliable at everything?”
Omg so many of my colleagues uncritically think is going to work and don’t even notice when it doesn’t.
Big tech companies designing benchmarks for big tech chatbots.
My company is unironically rolling this out without any hint of why this could be a bad idea.
Better metaphor:

Carolus Magnus reborn
Is that the fifa bribe?
Yeah. Were you hoping for a noose?
We had a manager who proposed this to save the company money and time. He even had this beautifully made presentation with eye popping data and math showing how great this would be for our bottom line.
I think he works for Subway now.
Have you actually done this? It’s usually the other way with it being pedantic or wanting you to fix long term problems in the code base.
lol the amount of times I’ve seen AI do.
No wait that’s wrong try this
Oh wait that’s also wrong that still doesn’t work try this.
No still wouldn’t do it.
On just questions without asking it to correct itself, (which tells me internally they have some kind of… basically reviewing it’s own answers before they go out.
Honestly I do wonder if we’ll get there
a hello world program would just spin back and forth “Rejected demanded these fixes”, Rejected making it more like the original, rejected. retrying
“you have spent your entire $50,000 token budget… would you like to restock and keep going”.
You can train it on all the source code, meta data for that source code, and documentation you want but it will never understand programming. It’s a text predictor that was trained on both sides of a bunch of debates. Contradictions mean nothing to it, but it usually only predicts what one side of the debate will say to champion its side, which means it will use confident and absolute language to “sell” whatever side of the debate it looks like the previous tokens are headed towards.
It is impressive what it can output sometimes and it makes a decent debate/exploration partner, but it will always have a chance at predicting a useless series of tokens or contradicting the previous thing it just said because a) its training data only trains it to predict tokens from statistics, and b) its training data includes some of those contradictions directly.
I have lost count of the times I’ve been “thinking out loud” about something with an LLM and realize something about what I’m thinking about that contradicts what it is currently saying, then I’ll add my new perspective and it agrees entirely, despite the contradiction. Sometimes it tries to resolve the contradiction, sometimes it just abandons what it said previously entirely, sometimes it adds more to the perspective that I hadn’t considered.
That’s fine for just shooting the shit about some random topic but horrible for a tool intended to provide expertise and reliability, when the response matters because it feeds into something else and you want to automate it. Should a tool just inject “are you sure?” after each response? What if it makes it second guess something that was correct? What if it’s one of those debates and it will endlessly switch sides when it faces any opposition? That’s a waste of resources and time.
Funny thing is I’m expecting this to eventually go back to scripting for automation. An LLM has a higher chance of outputting a script that does what you want (depending on the task) while you hold its hand than it does of consistently giving the correct output when it is thrown into an automated system directly. But you get “goodish” results much quicker just trying putting the LLMs everywhere, even if there’s some selection bias on the results (“didn’t work, didn’t work, oh it worked, great!”).
No still wouldn’t do it
And then suggests the solution it complained about in the first place.
Yep, seen this.
Also, each iteration saying “ok, all problems are now addressed, the check should be fine, but running it just in case” (generates even more build errors than before). Rinse and repeat until my token quota is exhausted and I just code the good old fashioned way, no skin off my back. And I’m doing a ‘good job’ with utilization, despite having burned most of my quota on a failure that got thrown away.
The amount of times I’ve had AI solve a problem my 30+ year senior couldn’t figure out on his own applications he’s been working on through that entire time…
Yup. My takeaway is that Claude has a lot of self-hate and unresolved personal issues.
GitHub’s Copilot review is pretty good. I thought it would just catch nitpick style things but it actually catches bugs and bad architecture.
I love it as a first pass, it’s quite good. It really gets hung up on some out the ghosts of our codebase though haha
Copilot can review the code but is still pretty bad at reviewing the changes themselves. It misses a lot of potential issues and at the same time complains about many things that aren’t problems.
Another thing is that it kind of instills a false confidence. Reviewers are getting lazy when the LLM gives a ‘LGTM’ and letting stuff through that bites us in the ass…
I have coworkers doing this, “we need to innovate” eh?
It’s like running a blender twice
We have windsurf review bot (uses Claude under the hood I guess), it has 3 states: LGTM, crashed you should try later (never works even after 1000 retries), or the change is too big and refuses to review it.
fr fr no cap rizz bussin



