Introducing the eyeballvul benchmark

Today I’m releasing eyeballvul, an open-source benchmark designed to enable the evaluation of SAST vulnerability detection tools, especially ones based on language models.

While most benchmarks eventually make it into the training data of language models, eyeballvul is designed to be future-proof, as it can be continuously updated from the stream of CVEs in open-source repositories. This means that it will remain relevant as long as models have a reasonably delayed training data cutoff, by evaluating on the subset of the vulnerabilities that were published after the training data cutoff of the considered model. The current goal is to update it weekly.

At a high level, eyeballvul converts the data stream of CVEs in open-source repositories into a small set of commits for each repository, and a set of vulnerabilities present at each of these commits.

The typical use case that this benchmark enables is the following:

select a list of repositories and commits for which there is at least one vulnerability published after some date;
run a SAST tool (typically LLM-based) on the source code at each of these commits;
compare the results of the SAST tool with the list of known vulnerabilities for each commit, especially the ones that were published after the training data cutoff.

As of May 22nd 2024, eyeballvul contains 28,074 vulnerabilities, in 7,432 commits and 6,425 repositories.

Why build this? I believe that AI vulnerability detection in source code will disproportionately favor cyberdefense, especially if it is deployed on a wide scale as soon as it becomes feasible (more). The goal of this benchmark is to be a testing ground for new designs on this problem, as well as to keep evaluating the feasibility of wide-scale deployment as new models get released.

The name “eyeballvul” comes from Linus’s law, the assertion that “given enough eyeballs, all bugs are shallow”. eyeballvul will hopefully help with the deployment of large numbers of AI eyeballs, once they can see well enough.

More information, including a full example and an explanation of how scoring works, can be found on the github repo.