We Asked 100+ AI Models to Write Code. The Results: AI-generated Code That Works, But Isn’t Safe

HaraldvonBlauzahn@feddit.org · 3 months ago

We Asked 100+ AI Models to Write Code. The Results: AI-generated Code That Works, But Isn’t Safe

NaibofTabr@infosec.pub · 3 months ago

We asked 100+ AI models to write code.

The Results: AI-generated Code

no shit son

That Works

OK this part is surprising, probably headline-worthy

But Isn’t Safe

Surprising literally no one with any sense.

𝕽𝖚𝖆𝖎𝖉𝖍𝖗𝖎𝖌𝖍@midwest.social · 3 months ago

That Works

OK this part is surprising, probably headline-worthy

Very, and completely non-consistent wiþ my experiences. ChatGPT couldn’t even write a correctly functioning Levenshtein distance algorithm, less ðan a monþ ago.

Womble@piefed.world · edit-2 3 months ago

I find that very difficult to believe. If for no other reason that there is an implementation in the wiki page for Levenshtein distance (and wiki is known to be very prominant in the training sets used for foundational models), and that trying it just now and it gave a perfectly functional implementation.

𝕽𝖚𝖆𝖎𝖉𝖍𝖗𝖎𝖌𝖍@midwest.social · 3 months ago

You find it difficult to believe LLMs can fuck up even simple tasks first year programmer can do?

Did you verify the results in what it gave you? If you’re sure it’s correct, you got better results than I did.

Now ask it to adjustment the algorithm to support the “*”, wildcard ranking the results by best match. See if what it gives you is the output you’d expect to see.

Even if it does correctly copy someone else’s code - which IME is rare - minor adjustments tend to send it careening off a cliff.

lad@programming.dev · 3 months ago

Wow, were you so outraged, you dropped all the ‘ð’ and ‘þ’?

𝕽𝖚𝖆𝖎𝖉𝖍𝖗𝖎𝖌𝖍@midwest.social · 3 months ago

I started using the same client for both by “normal” account (this one) and my toy account (my pþþþt one) but have discovered that now it’s impossible hard to tell which one I’m in once I start replying. And I flip between them often, so now I’m accidentally posting eths and thorns here, and forgetting them more in the other account.

It’s a conundrum. I’m losing sleep over it, really.

𝕽𝖚𝖆𝖎𝖉𝖍𝖗𝖎𝖌𝖍@midwest.social · 3 months ago

You’ll be absolutely thrilled to hear that I discovered that I can assign different color themes to different accounts in my mobile app, so these sorts of crossover mistakes should be greatly reduced.

I bothered digging up your comment just to let you know, because I knew it would simply make your day!

Toodles!

lad@programming.dev · 3 months ago

We’re on to a new era of brilliance, I’m sure

Besides, in Thunder it shows username in a drop-down on the comment screen

𝕽𝖚𝖆𝖎𝖉𝖍𝖗𝖎𝖌𝖍@midwest.social · 3 months ago

At first it wasn’t an issue: I used Voyager for this account, and Interstellar for the alt. Then I decided I liked Interstellar’s interface more and started using it for both. Both list the account in most places, but Interstellar doesn’t show it when replying.

I started making enough mistakes that I played with the settings and discovered Interstellar links the color theme to the account, and now I can easily tell which I’m using.

I’m certain I’ll continue to make mistakes. Thorn is surprisingly seductive, but the real issue is that auto complete and autocorrect on my phone keyboard has decided that the correct spelling for “the” is “þe”. I could correct it, but I feel bad for it; it’s just trying to he helpful.

Womble@piefed.world · 3 months ago

Yes, i find it difficult to believe that they mess up a dozen line algo that is in their training set in a prominant place with no complicating factors. Despite what a lot of people here think, LLMs do have value for coding. Even if the companies selling them make ridiculous claims about what they can do.

Hudell@lemmy.dbzer0.com · 3 months ago

Depends on their definition of “working” .

I tried asking an AI to make a basic webrtc client to make audio calls - something that has hundreds of examples on the web about how to do it from the first line of code to the very last. It did generate a complete webrtc client for audio calls I could launch and see working, it just had a couple tiny bugs:

you needed an user id to call someone and one was only generated when you call (effectively meaning you can only call people if they are calling someone)
if you fixed the above and managed to make a call between two users, the audio was exchanged but never played.

Technically speaking, all of the small parts worked, they just didn’t work together. I can totally see someone ignoring that fact and treating this as an example of “working code”.

Hudell@lemmy.dbzer0.com · 3 months ago

Btw I tried to ask the AI to fix those problems on its own code but from that point forward it just kept going farther and farther from a working solution.

𝕽𝖚𝖆𝖎𝖉𝖍𝖗𝖎𝖌𝖍@midwest.social · 3 months ago

That’s the broken behavior I see. It’s the evidence of a missing understanding that’s going to need another evolutionary bump to get over.

HaraldvonBlauzahn@feddit.org · 3 months ago

I was surprised by that sentence, too.

But I see from my AI-using coworkers that there are different values in use for “it works”.

𝕽𝖚𝖆𝖎𝖉𝖍𝖗𝖎𝖌𝖍@midwest.social · 3 months ago

Yeah, for me it’s more that just “produces correct output.” I don’t expect to see 5 pages of sequential if-statements (which, ironically, is pretty close to LLM’s internal designs), but also no unnessesary nested loops. “Correct” means producing the right results, but also not having O(n²) (or worse) when it’s avoidable.

The thing that puts me off most, though, is how it usually expands code for clarified requirements in the worst possible way. Like, you start with simple specs and make consecutive clarifications, and the code gets worse. And if you ask it to refactor it to be cleaner, it’ll often refactor the Code to look better, but it’ll no longer produce the correct output.

Several times I’ve asked it for code in a language where I don’t know the libraries well, and it’ll give me code using functions that don’t exist. And when I point out they don’t exist, I get an apology and sometimes a different function call that also doesn’t exist.

It’s really wack how people are using this in their jobs.

gedhrel@lemmy.world · 3 months ago

I’m primarily transfixed, not by the example in your comment, but that you don’t voice the “th” in “with”.

epicshepich@programming.dev · 3 months ago

I’ve got a buddy who swaps which ths are voiced and which aren’t. “ðank you”, “some for me but not for þee”. I think he does it to mess with people.

𝕽𝖚𝖆𝖎𝖉𝖍𝖗𝖎𝖌𝖍@midwest.social · 3 months ago

Gosh darn it, am I using thorns in this account again?? I didn’t mean to.

I recently learned that only Icelandic does that. Eth was dropped early in old English, and thorn was used in both places. Additionally (as I understand it, now), while thorn was a direct “th” (voiced or unvoiced) sound, even when eth was in use it want orthographically a simple replacement for voiced “th”.

I guess Icelandic kept it, but eth was not in use through most of the old English, medieval period. And then the Normans came, and fucked written English completely up.

epicshepich@programming.dev · 3 months ago

I swear I just saw someone on another comment thread doing this, and it was a first since I’ve started using Lemmy.

Test_Tickles@lemmy.world · 3 months ago

wikipedia.org/wiki/Thorn_(letter)

astronaut_sloth@mander.xyz · 3 months ago

Yeah, I’ve found AI generated code to be hit or miss. It’s been fine to good for boilerplate stuff that I’m too lazy to do myself, but is super easy CS 101 type stuff. Anything that’s more specialized requires the LLM to be hand-held in the best case. More often than not, though, I just take the wheel and code the thing myself.

By the way, I think it’s cool that you use Old English characters in your writing. In school I used to do the same in my notes to write faster and smaller.

𝕽𝖚𝖆𝖎𝖉𝖍𝖗𝖎𝖌𝖍@midwest.social · edit-2 3 months ago

Thanks! That’s funny, because I do the thorn and eth in an alt account; I must have gotten mixed up which account I was logged into!

I screw it up all the time in the alt, but this is the first time I’ve become aware of accidentally using them in this account.

We’re not too far from AGI. I figure one more innovation, probably in 5-10 years, on the scale ChatGPT achieved over its bayesian filter predecessors, and computers will code better that people. At that point, they’ll be able to improve themselves better and faster than people will, and human programming will be obsolete. I figure we have a few more years, though.

tal@lemmy.today · 3 months ago

These weren’t obscure, edge-case vulnerabilities, either. In fact, one of the most frequent issues was: Cross-Site Scripting (CWE-80): AI tools failed to defend against it in 86% of relevant code samples.

So, I will readily believe that LLM-generated code has additional security issues, but given that the models are trained on human-written code, this does raise the obvious question of what percentage of human-written code properly defends against cross-site scripting attacks, a topic that the article doesn’t address.

HaraldvonBlauzahn@feddit.org · 3 months ago

There are a few aspects that LLMs are just not capable of, and one of them is understanding and observing implicit invariants.

(That’s getting to be funny if the tech is used for a while on larger, complex, multi-threaded C++ code bases. Given that C++ appears already less popular with more experienced people than with juniors, I am very doubtful whether C++ will survive that clash.)

anton@lemmy.blahaj.zone · 3 months ago

If a system was made to show blogs by the author and gets repurposed by a LLM to show untrusted user content the same code becomes unsafe.

crandlecan@mander.xyz · 3 months ago

Ssssst 😅

Hazelnoot [she/her]@beehaw.org · 3 months ago

Here’s the full report, for anyone who doesn’t want to give their personal information: https://enby.life/files/c564f5f8-ce51-432d-a20e-583fa7c100b8

Sl00k@programming.dev · 3 months ago

This doesn’t include which models or prompts given in the article. Really need to include that if they have anything worth saying, otherwise it’s just a marketing article for their platform.

CallMeAnAI@lemmy.world · 3 months ago

This thread forgetting that junior devs exist and the purpose of code review 🤣

Senal@programming.dev · 3 months ago

A shallow take but not entirely incorrect.

Guttural@jlai.lu · edit-2 3 months ago

1 - Code review is inefficient at catching subtle bugs. You’re not paying the same attention when you just read the code vs when you write it and test it. And even if you’re particularly good at it, your colleagues might not be.

2 - Even if junior programmers exist, they don’t write all the code produced in a company. They’re usually in teams with more experienced people. You probably shouldn’t hand the keys to juniors and leave it at that, if you want to get stuff done.

MrSmith@lemmy.world · 3 months ago

I love people who are excited to slave for an LLM. Like, to do the most mundane, monotonic shit, while the machine does all the satisfactory problem solving.

Techbros are really making their future beds right now, I love it.

We Asked 100+ AI Models to Write Code. The Results: AI-generated Code That Works, But Isn’t Safe

We Asked 100+ AI Models to Write Code. The Results: AI-generated Code That Works, But Isn’t Safe

We Asked 100+ AI Models to Write Code. Here’s How Many Failed Security Tests. | Veracode