OpenAI’s new AI security tool may give people a false sense of security



OpenAI unveiled last week two new ones The free downloadable tool should make it easier for businesses to build guardrails around the cues users provide to AI models and the output generated by these systems.

For example, the new guardrails are designed to make it easier for companies to set up controls that prevent customer service chatbots from responding with a rude tone or revealing internal policies about how decisions about offering refunds are made.

While these tools are designed to make AI models more secure for enterprise customers, some security experts warn that the way OpenAI releases these tools could create new vulnerabilities and give companies a false sense of security. And, while OpenAI says it’s releasing these security tools for the benefit of everyone, some have questioned whether OpenAI’s motivation is partly to undermine an advantage of its AI rival Anthropic, which is favored by commercial users in part because its Claude model is perceived to have stronger guardrails than other rivals.

The OpenAI safety tools (called gpt-oss-safeguard-120b and gpt-oss-safeguard-20b) are themselves AI models called classifiers, designed to evaluate whether user-submitted tips to a larger, more general AI model and the tips generated by that larger AI model satisfy a set of rules. In the past, companies that purchased and deployed AI models could train these classifiers themselves, but this process was time-consuming and potentially expensive because developers had to collect examples of policy-violating content in order to train the classifiers. Then, if companies wanted to adjust the policies used for guardrails, they would have to collect new examples of violations and retrain the classifier.

OpenAI hopes the new tool will make the process faster and more flexible. These new security classifiers do not need to be trained to follow a fixed rulebook, but can simply read written policies and apply them to new content.

OpenAI says this approach, called “inference-based classification,” allows companies to adjust their security policies as easily as editing text in a document, rather than rebuilding the entire classification model. The company is positioning the release as a tool for businesses that want more control over how their artificial intelligence systems handle sensitive information like medical records or personnel records.

However, while these tools are supposed to be more secure for enterprise customers, some security experts say they may give users a false sense of security. This is because OpenAI open sourced the AI ​​classifier. This means that they provide all the code for the classifier for free, including the weights or internal settings of the AI ​​model.

Classifiers are like extra safety gates for AI systems, designed to stop unsafe or malicious cues before they reach the main model. But by open sourcing them, OpenAI risks sharing the blueprint for these gates. This transparency can help researchers strengthen security mechanisms, but it can also create a false sense of comfort by making it easier for bad actors to discover weaknesses and risks.

“Making these models open source can help both attackers and defenders,” said David Krueger, professor of artificial intelligence security at Mila. wealth. This will make it easier to develop methods to bypass classifiers and other similar protections. “

For example, when attackers have access to a classifier’s weights, they can more easily develop so-called “hint injection” attacks, where they develop hints to trick the classifier into ignoring the policy it is supposed to enforce. Security researchers have found that in some cases, even a string of characters that may seem meaningless to a human can convince an AI model to ignore its guardrails and do something it shouldn’t, such as offer advice on building a bomb or spew racist slurs, for reasons the researchers don’t fully understand.

OpenAI representative guidance wealth Arriving at the company Blog Post Announcement and Technical report of the model.

Short-term pain for long-term gain

Open source can be a double-edged sword when it comes to security. It enables researchers and developers to test, improve and adapt AI safeguards faster, increasing transparency and trust. For example, security researchers may be able to adjust a model’s weights in various ways to make it more robust to hint injection without degrading the model’s performance.

But it can also make it easier for attackers to study and bypass these protections, for example, by using other machine learning software to run hundreds of thousands of possible cues until they find one that causes the model to jump over guardrails. What’s more, security researchers found that such automatically generated hint injection attacks developed on open source AI models sometimes also target proprietary AI models because the attackers do not have access to the underlying code and model weights. The researchers speculate that this is because there may be something inherent in the way all large language models encode language, and that similar hint injections would be able to successfully combat any AI model.

In this way, open source classifiers may not only give users a false sense of security that their systems are well protected, they may actually make each AI model less secure. But experts say the risk may be worth taking, since making the classifier open source should also make it easier for all the security experts in the world to find ways to make the classifier more resistant to such attacks.

“In the long run, sharing how your defenses work is beneficial – it may cause some short-term pain. But in the long run, it leads to powerful defenses that are actually very difficult to circumvent,” said Vasilios Mavroudis, principal research scientist at the Alan Turing Institute.

Mavroudis said that while open-source classifiers could theoretically make it easier for someone to try to circumvent the security systems on OpenAI’s main model, the company likely believes the risk is low, Mavroudis said. He said OpenAI also has other protective measures in place, including guardrails that have teams of human security experts constantly trying to test its models to find vulnerabilities and hopefully improve them.

“Open source classifier models offer those who want to bypass classifiers the opportunity to learn how to do so. But a determined jailbreaker is likely to succeed regardless,” said Robert Trager, co-director of the Oxford Martin Artificial Intelligence Governance Initiative.

“We recently discovered a method that bypasses all protections by major developers about 95% of the time, and we were not looking for such a method. Considering that determined jailbreakers will succeed anyway, it is useful for open source systems that developers can use for less determined people,” he added.

Enterprise Artificial Intelligence Competition

The launch also has competitive implications, especially as OpenAI seeks to challenge rival artificial intelligence company Anthropic’s growing foothold among enterprise customers. Anthropic’s Claude series of AI models are popular with enterprise customers, in part because they have stronger security controls than other AI models. The security tools used by Anthropic include a “constitutional classifier,” which works similarly to the tool OpenAI just open sourced.

Anthropic has been carving out a niche for enterprise customers, especially in coding. According to a July Report by Menlo VenturesAnthropic holds 32% of the enterprise large language model market in terms of usage, while OpenAI has 25%. Anthropic reportedly accounts for 42% of coding-specific use cases, while OpenAI accounts for 21%. By offering enterprise-focused tools, OpenAI may be trying to win over some of those enterprise customers while also positioning itself as a leader in AI security.

Anthropic’s “Constitutional Classifier” consists of small language models that examine the output of a larger model against a set of written values ​​or policies. By open-sourcing similar functionality, OpenAI is effectively giving developers the same type of customizable guardrails that help make Anthropic’s model so attractive.

“From what I’ve seen in the community, it seems to be popular,” Mavroudis said. “They thought the model might be a way to autoscale. It also comes with some nice connotations, like ‘we’re contributing to the community.’ It might also be a useful tool for smaller businesses that can’t train such a model themselves.”

Some experts also worry that open sourcing these security classifiers could centralize so-called “safe” AI.

“Security is not a well-defined concept. Any implementation of a security standard will reflect the values ​​and priorities of the organization that created the standard, as well as the limitations and flaws of its model,” said John Thickstun, assistant professor of computer science at Cornell University, Tell Entrepreneurial Beat. “If the industry as a whole adopts the standards set by OpenAI, we risk institutionalizing a particular view of security and hindering broader investigation into the security needs of AI deployments in many sectors of society.”



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *