Well, I hope you don’t have any important, sensitive personal information in the cloud?
We asked 100+ AI models to write code.
The Results: AI-generated Code
no shit son
That Works
OK this part is surprising, probably headline-worthy
But Isn’t Safe
Surprising literally no one with any sense.
That Works
OK this part is surprising, probably headline-worthy
Very, and completely non-consistent wiþ my experiences. ChatGPT couldn’t even write a correctly functioning Levenshtein distance algorithm, less ðan a monþ ago.
Depends on their definition of “working” .
I tried asking an AI to make a basic webrtc client to make audio calls - something that has hundreds of examples on the web about how to do it from the first line of code to the very last. It did generate a complete webrtc client for audio calls I could launch and see working, it just had a couple tiny bugs:
- you needed an user id to call someone and one was only generated when you call (effectively meaning you can only call people if they are calling someone)
- if you fixed the above and managed to make a call between two users, the audio was exchanged but never played.
Technically speaking, all of the small parts worked, they just didn’t work together. I can totally see someone ignoring that fact and treating this as an example of “working code”.
Btw I tried to ask the AI to fix those problems on its own code but from that point forward it just kept going farther and farther from a working solution.
That’s the broken behavior I see. It’s the evidence of a missing understanding that’s going to need another evolutionary bump to get over.
I find that very difficult to believe. If for no other reason that there is an implementation in the wiki page for Levenshtein distance (and wiki is known to be very prominant in the training sets used for foundational models), and that trying it just now and it gave a perfectly functional implementation.
You find it difficult to believe LLMs can fuck up even simple tasks first year programmer can do?
Did you verify the results in what it gave you? If you’re sure it’s correct, you got better results than I did.
Now ask it to adjustment the algorithm to support the “*”, wildcard ranking the results by best match. See if what it gives you is the output you’d expect to see.
Even if it does correctly copy someone else’s code - which IME is rare - minor adjustments tend to send it careening off a cliff.
Wow, were you so outraged, you dropped all the ‘ð’ and ‘þ’?
You’ll be absolutely thrilled to hear that I discovered that I can assign different color themes to different accounts in my mobile app, so these sorts of crossover mistakes should be greatly reduced.
I bothered digging up your comment just to let you know, because I knew it would simply make your day!
Toodles!
I started using the same client for both by “normal” account (this one) and my toy account (my pþþþt one) but have discovered that now it’s impossible hard to tell which one I’m in once I start replying. And I flip between them often, so now I’m accidentally posting eths and thorns here, and forgetting them more in the other account.
It’s a conundrum. I’m losing sleep over it, really.
Yes, i find it difficult to believe that they mess up a dozen line algo that is in their training set in a prominant place with no complicating factors. Despite what a lot of people here think, LLMs do have value for coding. Even if the companies selling them make ridiculous claims about what they can do.
I was surprised by that sentence, too.
But I see from my AI-using coworkers that there are different values in use for “it works”.
Yeah, for me it’s more that just “produces correct output.” I don’t expect to see 5 pages of sequential if-statements (which, ironically, is pretty close to LLM’s internal designs), but also no unnessesary nested loops. “Correct” means producing the right results, but also not having O(n²) (or worse) when it’s avoidable.
The thing that puts me off most, though, is how it usually expands code for clarified requirements in the worst possible way. Like, you start with simple specs and make consecutive clarifications, and the code gets worse. And if you ask it to refactor it to be cleaner, it’ll often refactor the Code to look better, but it’ll no longer produce the correct output.
Several times I’ve asked it for code in a language where I don’t know the libraries well, and it’ll give me code using functions that don’t exist. And when I point out they don’t exist, I get an apology and sometimes a different function call that also doesn’t exist.
It’s really wack how people are using this in their jobs.
Yeah, I’ve found AI generated code to be hit or miss. It’s been fine to good for boilerplate stuff that I’m too lazy to do myself, but is super easy CS 101 type stuff. Anything that’s more specialized requires the LLM to be hand-held in the best case. More often than not, though, I just take the wheel and code the thing myself.
By the way, I think it’s cool that you use Old English characters in your writing. In school I used to do the same in my notes to write faster and smaller.
Thanks! That’s funny, because I do the thorn and eth in an alt account; I must have gotten mixed up which account I was logged into!
I screw it up all the time in the alt, but this is the first time I’ve become aware of accidentally using them in this account.
We’re not too far from AGI. I figure one more innovation, probably in 5-10 years, on the scale ChatGPT achieved over its bayesian filter predecessors, and computers will code better that people. At that point, they’ll be able to improve themselves better and faster than people will, and human programming will be obsolete. I figure we have a few more years, though.
Here’s the full report, for anyone who doesn’t want to give their personal information: https://enby.life/files/c564f5f8-ce51-432d-a20e-583fa7c100b8
These weren’t obscure, edge-case vulnerabilities, either. In fact, one of the most frequent issues was: Cross-Site Scripting (CWE-80): AI tools failed to defend against it in 86% of relevant code samples.
So, I will readily believe that LLM-generated code has additional security issues, but given that the models are trained on human-written code, this does raise the obvious question of what percentage of human-written code properly defends against cross-site scripting attacks, a topic that the article doesn’t address.
There are a few aspects that LLMs are just not capable of, and one of them is understanding and observing implicit invariants.
(That’s getting to be funny if the tech is used for a while on larger, complex, multi-threaded C++ code bases. Given that C++ appears already less popular with more experienced people than with juniors, I am very doubtful whether C++ will survive that clash.)
If a system was made to show blogs by the author and gets repurposed by a LLM to show untrusted user content the same code becomes unsafe.
Ssssst 😅
This thread forgetting that junior devs exist and the purpose of code review 🤣
1 - Code review is inefficient at catching subtle bugs. You’re not paying the same attention when you just read the code vs when you write it and test it. And even if you’re particularly good at it, your colleagues might not be.
2 - Even if junior programmers exist, they don’t write all the code produced in a company. They’re usually in teams with more experienced people. You probably shouldn’t hand the keys to juniors and leave it at that, if you want to get stuff done.
A shallow take but not entirely incorrect.