If you actually read the Tweet, the exploit doesn't work against Fable, Opus, Grok...at least, in the examples.
Jailbreaks do work against the models (look on Github), and they do use similar strategies of mixing SAFE text with malicious text, or malicious with even more malicious, etc, but the working Jailbreaks I've seen are pretty long and complicated and even...creepy.
elashri
I still don't know why all these concern about nuclear weapons with LLMs. It is not that if an entity (A country) wants to develop a nuclear weapons that the resources they need for such a program and huge infrastructure and scientific enterprise would need an LLM to teach them anything. Knowing how to develop one is not an closed secret but getting in secret is impossible without the whole world knowing.
So I wouldn't be able to develop a nuclear weapons with the resources of drug cartal (as an example) using Claude in secret.
show comments
strenholme
The solution is simple: If using an AI-assisted scanner and a guardrail gets hit, then the code is obviously malicious and needs to be automatically flagged (and refuse to run the code!).
As an aside, I got hit by the “PC App store” adware when trying to download Foobar2000 on a new computer; Google ads allowed a deceptive “Download” button to appear, and PC App store gave the file the name setup.exe. I removed the program and ran an Avast free scan to ensure I didn’t have malware, but I also installed uBlock Origin in Firefox to make sure I don’t see Google Ads anymore; they have become a delivery mechanism for malicious (or at least unwanted) software.
show comments
ofjcihen
Worked a contract where this succeeded in pushing through a fail open design.
It also should be a warning to everyone that these groups are now aware of analysis and deobfuscation using AI and to take using a sandboxed environment more seriously.
I’ve personally had about 20% success rate getting opus 4.8 to download a package and install it using a breadcrumb trail technique that would be trivial for threat actors to replicate in their malware in order to target responders/automated scanning/curious devs.
show comments
y-curious
My friend made this in jest (code very NSFW, ironically):
Same energy and kind of a funny, low tech solution to frontier model analysis.
show comments
logancbrown
Would this realistically be a problem for code going through LLM-based code-review? Presumably if a LLM reviewer agent hits this commentary, it would produce a failure to analyze and exit, thus failing the automated code review and forcing a human to read through it which they would subsequentially catch and revoke.
show comments
carlsborg
Pipeline is then: Cheap open source model for flagging potential LLM refusal content -> main LLM check
elevation
Why would a malware scanner read the comments?
show comments
charcircuit
The sooner frontier models get rid of guardrails the better. They constantly get in the way and make things worse than actually making things "safe".
show comments
ipython
good news, now we have pretty much a clear signal that there's something nefarious going on... after all, the first step to analyzing malware is to determine if it's malware at all.
show comments
hurtigioll
devs will say this is proof we need to remove all biological guardrails. think about that for a second
If you actually read the Tweet, the exploit doesn't work against Fable, Opus, Grok...at least, in the examples.
Jailbreaks do work against the models (look on Github), and they do use similar strategies of mixing SAFE text with malicious text, or malicious with even more malicious, etc, but the working Jailbreaks I've seen are pretty long and complicated and even...creepy.
I still don't know why all these concern about nuclear weapons with LLMs. It is not that if an entity (A country) wants to develop a nuclear weapons that the resources they need for such a program and huge infrastructure and scientific enterprise would need an LLM to teach them anything. Knowing how to develop one is not an closed secret but getting in secret is impossible without the whole world knowing.
So I wouldn't be able to develop a nuclear weapons with the resources of drug cartal (as an example) using Claude in secret.
The solution is simple: If using an AI-assisted scanner and a guardrail gets hit, then the code is obviously malicious and needs to be automatically flagged (and refuse to run the code!).
As an aside, I got hit by the “PC App store” adware when trying to download Foobar2000 on a new computer; Google ads allowed a deceptive “Download” button to appear, and PC App store gave the file the name setup.exe. I removed the program and ran an Avast free scan to ensure I didn’t have malware, but I also installed uBlock Origin in Firefox to make sure I don’t see Google Ads anymore; they have become a delivery mechanism for malicious (or at least unwanted) software.
Worked a contract where this succeeded in pushing through a fail open design.
It also should be a warning to everyone that these groups are now aware of analysis and deobfuscation using AI and to take using a sandboxed environment more seriously.
I’ve personally had about 20% success rate getting opus 4.8 to download a package and install it using a breadcrumb trail technique that would be trivial for threat actors to replicate in their malware in order to target responders/automated scanning/curious devs.
My friend made this in jest (code very NSFW, ironically):
https://github.com/thebabush/mcp-job-security
Same energy and kind of a funny, low tech solution to frontier model analysis.
Would this realistically be a problem for code going through LLM-based code-review? Presumably if a LLM reviewer agent hits this commentary, it would produce a failure to analyze and exit, thus failing the automated code review and forcing a human to read through it which they would subsequentially catch and revoke.
Pipeline is then: Cheap open source model for flagging potential LLM refusal content -> main LLM check
Why would a malware scanner read the comments?
The sooner frontier models get rid of guardrails the better. They constantly get in the way and make things worse than actually making things "safe".
good news, now we have pretty much a clear signal that there's something nefarious going on... after all, the first step to analyzing malware is to determine if it's malware at all.
devs will say this is proof we need to remove all biological guardrails. think about that for a second