Pretty fascinating research using OpenAI's GPT-4 to generate code to bypass another AI model's protections This could be used to trick a classifier into thinking an image of something dangerous – like a gun – is just harmless
AI-Guardian attempts to prevent such scenarios by building a backdoor in a given machine learning model to identify and block adversarial input – images with suspicious blemishes and other artifacts that you wouldn't expect to see in a normal picture.
Zhang said he and his co-authors worked with Carlini, providing him with their defense model and source code. And later, they helped verify the attack results and discussed possible defenses in the interest of helping the security community.Zhang said Carlini's contention that the attack breaks AI-Guardian is true for the prototype system described in their paper, but that comes with several caveats and may not work in improved versions.
Anyway, here's how GPT-4 described the proposed attack on AI-Guardian when prompted by Carlini to produce the explanatory text: Carlini said he chose to attack AI-Guardian because the scheme outlined in the original paper was obviously insecure. His work, however, is intended more as a demonstration of the value of working with an LLM coding assistant than as an example of a novel attack technique.
Carlini's assessment of the merits of GPT-4 as a co-author and collaborator echoes – with the addition of with cautious enthusiasm – the sentiment of actor Michael Biehn when warning actor Linda Hamilton about a persistent cyborg in a movie called The Terminator :"The Terminator is out there. It can't be bargained with. It can't be reasoned with. It doesn't feel pity or remorse or fear. And it absolutely will not stop, ever, until you are dead.
"GPT-4 is much faster at writing code than humans – once the prompt has been specified. Each of the prompts took under a minute to generate the corresponding code."GPT-4 does not get distracted, does not get tired, does not have other duties, and is always available to perform the user’s specified task."