Last updated: 2024-08-25.

Below are some areas in the field of cybersecurity in AI where I would like to see progress. By “cybersecurity in AI”, I mean one of two things: cybersecurity from AI (e.g. how to deal with deep fakes making phishing easier), and cybersecurity for AI (e.g. how to make ML supply chains trustworthy). (There’s also cybersecurity with AI, e.g. using AI to improve cybersecurity products, which I don’t consider here). The following items are at least partly research questions.

  • confidential computing: third-party model evaluations are an important component of AI risk mitigation. Today, there is a trust issue: evaluators don’t want the AI labs to know which evaluations they’re running (because they could game them), and labs don’t want evaluators to have access to their model weights. In practice, third-party evaluators have so far been trusting the labs not to look at the evaluations, out of better options, and are limited to black-box investigations on frontier models. Confidential computing could change both of these things. There are software-based solutions (which tend to have a significant overhead), and hardware-based solutions (example company in this space: Mithril Security).
  • widespread cryptographic proofs of content authenticity, or “HTTPS for media”: deep fakes have the potential to help with phishing (by spoofing voice and video), and undermine images and videos as a trusted media. However, public key cryptography can be leveraged to prove the authenticity of content, in the same way that HTTPS is now a widespread proof of the authenticity of websites. To prove that a raw picture or video is genuine, it’s roughly enough to have the camera contain a private key, ideally through a Trusted Platform Module (TPM), and use it to digitally sign all its outputs. This notably requires strong protections to prevent people with physical access to a camera from signing arbitrary images, and progress in the infrastructure to make verification of all media online the default, in the same way that HTTPS has now become − after decades of effort − a universal standard, with browsers now raising warnings on non-HTTPS content. (thanks to Manuel Reinsperger for discussion on this point and making me realize that this idea of “HTTPS for content” is harder than I thought).
  • secure AI assistants. Imagine you had a trusted always-on AI assistant on your devices. Among many benefits, this could severely reduce the widespread damage from phishing and scams, essentially addressing the issue that “humans don’t get upgraded”. But there are a few security issues to solve first, among which prompt injection and more generally adversarial robustness, as well as more standard security and privacy issues of a privileged process with access to everything on a user’s computer (as an example of how not to approach this, see the Microsoft Recall fiasco).
  • trustworthy ML supply chain: backdoors are easier to create than detect, and when the weights of a model are published, we would ideally want more guarantees on these weights than just trusting the model developer. Work in this domain could include cryptographic proofs that a model was indeed pretrained on the alleged training data, post-trained with the alleged method and data… In the future, a powerful model trained to be honest and unbiased would be valuable, and this would also require similar proofs for everyone to agree that the model isn’t e.g. pushing a hidden agenda.
  • securing source code: work to improve performance on AI vulnerability detection in source code (e.g. something like the eyeballvul benchmark) would be valuable1. Reducing false positives should be relatively easy, by spawning agents that investigate each lead in detail. Reducing false negatives (finding more, harder vulnerabilities) is more challenging. There’s a lot of work to be done in integrating frontier LLMs with the usual tools used by vulnerability researchers.
  • good benchmarking of offensive hacking capabilities: we don’t want to be caught by surprise by sudden jumps in capabilities there. In fact, when agents are able to significantly uplift offensive hacking teams, it will probably be time to start restricting access to these capabilities (from the current default of everyone having access to SOTA models, protected only by non-adversarially robust “alignment” guardrails and some amount of monitoring). Now if all models were served through a monitored API, labs could potentially wait until detecting threat actors using their hacking capabilities in concerning ways. While this might work, there is no such backpedaling mechanism for open-weights models, so measures should be taken before they reach dangerous levels of hacking capabilities.
  • securing AI labs against state-level actors: this is really hard. Even the NSA’s hacking arsenal was stolen and published in 2016/2017. Leopold Aschenbrenner has a great overview of the problem, see also the RAND report Securing AI Model Weights.

Footnotes

  1. Isn’t vulnerability detection dual-use? Yes, but see the discussion in section 6 of the eyeballvul paper for why I believe that vulnerability detection in source code, using simple and universal tooling, in the absence of an implementation overhang, should empower defenders disproportionately over attackers.