I’m implementing generative AI into my company’s cybersecurity product. Here’s what I’ve learned.

Eshel Yaron


Software engineer


AI is ubiquitously on everyone’s minds today – from large corporations to middle school classrooms. And it’s no wonder—this technology is transformative in the speed of creation and innovation.

When ChatGPT came out, I was amazed with how well it fared with complex (and sometimes bizarre) tasks I threw at it.  One of the first things I asked it to do was to implement a lisp interpreter in Prolog.  That’s something I did myself sometime beforehand, and let me tell you—there’s not a lot of reference material about doing something like that on the web.  Especially not up-to-date content (we’re talking about implementing a 60 year-old programming language in a 50 year-old language). 

To my surprise, ChatGPT didn’t flinch and started outlining a lisp interpreter in Prolog.  That was a wow moment for me.  I had played with chat bots before, but this was clearly something else.  It didn’t get everything right (as many of us know and have experienced), but I saw huge potential.  

The Over-Optimism Complex

One of my first findings in the quest to implement generative AI is that an issue we’re facing with LLMs is over-optimism. When people first try out new, innovative tools, they’re delighted and surprised when they see them produce results that go beyond expectations. 

Yet, if you try to consult these tools in an area in which you don’t have expertise to scrutinize the results, you’ll often get bad results. 

 For us, a good example of this is code fixes—or so-called “self-healing code.” Initially our team was very enthusiastic about it.  We generated templated fixes for many issues, and It was my job to validate them before we integrated into our product.  

Sadly, almost all of the fix templates that GPT generated were, in practice, not applicable.  GPT particularly suffered in areas around infrastructure as code and cloud infrastructure misconfigurations.

When it comes to computer programs, we have a good understanding of what’s possible and what’s not. We’re already quite good at taking computational tasks and classifying them by how difficult they are. But for some problems there will never be an algo that can solve the problem. We’ve known this since 1937 when Turing showed that you can’t solve the halting problem. It’s not possible to write a program that will take another program and input and determine whether the given program will ever halt. 

But for the over-optimist, in the new world, AI can solve the halting problem. Just give it a program and it’ll tell if it halts. But unfortunately,  it’s easy to trick AI into getting it wrong.  This kind of issue is pervasive. LLMs complete prompts; they do not rely on a logically consistent model of the world, unlike (most) of us.

We approximate “undecidable problems" by choosing a different problem that IS decidable and come up with an appropriate algo that will solve it. However this is usually a weaker, less powerful problem. 

An example of an “undecidable problem” is when we do statistical analysis to find vulnerabilities. Instead of verifying that a program is safe, we verify much more strict criteria. Perhaps a person could prove a concrete program is safe, but it’s impossible to detect or prove that this class of problems is safe. All static analysis tools rely on this type of approximation, being overly-restrictive. We want to guarantee safety properties that we can’t decide entirely accurately but CAN estimate. 

Keep in mind–AI and LLMs are still limited by the same general restrictions we have in the world and apply them appropriately. When a dev works on a code project, it’s extremely helpful to have suggestions from LLMs since what the dev is doing is likely something that other people have already done—and LLMs are trained on that data. Code review processes and automatic, deterministic tools like static analysis foster confidence in that way. If it’s been done right before, it can be done right now. 

In the 1920’s, David Hilbert imagined we could build machines or come up with an algorithm that would mechanically decide whether theorems are true or false. It’s a beautiful dream where if 2 people would ever disagree, after a few words exchanged, one would tell the other, “let’s calc” (calculemus) and sit with the abacus, and simply solve the problem. But a few years later, Kurt Gödel basically proved this would never be possible. Any non-trivial system is incomplete. There are true things in the world that you will never be able to prove mechanically. That’s the over-optimism that we’re in right now and will have to re-learn it.

Security problems that LLMs are not well-suited to address directly

We’re well aware that LLMs are creating a level of digital transformation that is unparalleled. But what are they not able to do well? Let’s dig in. 

  1. Root cause analysis - Tying runtime security findings back to their root cause requires a ton of context that greatly differs from organization to organization.  It’s not enough to know “how people usually go about solving this issue,” (which LLMs know very well). It requires knowing how this particular resource has been configured, built, deployed and managed—and by whom.  It requires access to VCS logs, CI/CD logs, CSP APIs, and more online data sources.  We have very powerful techniques for RCA, that get us the right answer without fail, given that we have access to the relevant data.  This data is often bespoke and sometimes real-time, so leveraging LLMs for this task is very difficult and not very promising due to their inherent limitations.
  1. Identifying novel application weaknesses - Analyzing compiled code is notoriously difficult for humans, and it turns out that LLMs struggle with this task just as well.  In fact, application security review is an area where LLMs have little benefit as of today.  This is because of the complex semantics that we must model in order to properly understand code flow, especially in lower level machine code.  We need to keep track of variables and registers and function calls all while we need to make experiments and get feedback. Also here, LLMs can be of great help in brainstorming and tapping into details of known vulnerabilities, but the LLMs that we have today are very far from being able to find new vulnerabilities themselves.  
  2. Suggesting vulnerability fixes in highly specialized domain/frameworks - Let’s say we got our security analysts to roll up their sleeves and find a vulnerability in our system.  Suppose we’re doing post-quantum cryptography, building a new crypto library that provides secure operations that quantum algorithms can’t overcome.  Our analysts found a weakness in our encryption that allows them to guess with high confidence a few plain text bits by analyzing large amounts of ciphertext.  That’s a crypto weakness par excellence; now we face the question of remediating it.  Can we use a LLM to do that?  Well, no.  It would need an understanding of the mathematical principles on top of which we build our crypto framework, and their specific application in our implementation.  We’d have to fine tune the LLM, feeding it our code base and the recent literature on post-quantum cryptography.  And even then, since we’re the first team working with this framework and the first to face this vulnerability, the training set for the LLM contains hardly any relevant explicit data. Such scenarios are not uncommon–while many of the vulnerabilities in modern code bases are not that novel, we all specialize in something.  In the areas where we’re at the very front of innovation, LLMs can do far less for us.

(Here we see ChatGPT trying to deal with an unknown code framework, and resorting to very broad, sometimes plain wrong, suggestions)

What needs to be in place for these things to be useful? Most important is context, setting LLMs up to successfully complete the prompt. It’s all about integrating at the right places—and there are many places where these tools can streamline in cyber and remediation specifically. 

The right places to integrate

  • Exploration. LLMs are amazing partners for brainstorming, helping you get a high level understanding of your risks, explore different security strategies, do a soft prioritization process, and provide a bird’s eye view of your security posture. It’s a very different kind of use case compared to the extreme preciseness required when you fix/change code. When I go to Google and insert some text, the engine suggests several options and I don’t mind if a few of them are a bit off the mark from what I was looking for. Unlike, say, a compiler, I’m not completely dependent on the system to get exact results; I’ll get where I need to go. But the better the results are, the easier it’ll be. 
  • Contextualization security findings. Another great place to integrate LLMs in security is contextualization for findings that you don’t exactly know how to deal with.  Questions like “How do others perceive this issue?” and “What kind of common actions do people take in response to this issue?” are exactly what these models were made to know best.  Even without perfect knowledge of the specifics of your case, for most security findings you’ll get a pretty good ballpark since many of the issues we face everyday are issues that somebody already solved—and your friendly LLM may very well have been trained on that somebody’s blog post about it.
  • Complex query language. In the Dazz platform we have several AI-powered features that we built in, but one of my favorites is our natural language interface for filtering security findings.  We ingest a variety of security findings and expose them through a rich, uniform data scheme that users work with in the Dazz console and via our API.  Sometimes users create very complex filters in our console to focus on a precise set of security findings. The Dazz console makes it quite intuitive to slice and dice your findings, but our natural language filter model truly makes crafting these complex queries a breeze.  You describe the findings you want to see in plain English, and Dazz invokes an AI model to translate your description to a filter and apply it to our uniform findings data scheme.  This saves a lot of time tracking down and spelling out technical details, such as specific CVE numbers, and lets you dive in right ahead with higher level concepts instead, like “Exploitable Windows Server Vulnerabilities” as an example. The reason I think this feature works so well, and why I think of it as putting AI to good use, is that the natural language interface works great for the users, and the model is well suited to produce useful filters.  These filters need not be perfect; the user can further tweak them to his or her heart’s content—yet another reason LLMs work so well for this task.

In conclusion, there are several essential security tasks where LLMs fall short. It’s not always easy to recognize tasks that are likely to benefit from LLM-based solutions, but there’s a large class of problems for which it’s clear that LLMs aren’t a silver bullet.  It depends on the context that’s required for accomplishing the task and how bespoke and real time it tends to get.  It also depends on the computational nature of the task. 

So before spending a sprint or two integrating ChatGPT with your cybersecurity product, ask yourself whether it’s really the right tool for the task at hand.  Crucially, have a clear understanding of the task, and whether you throw AI at it or rely on your own manpower. 

For more information: 

See Dazz for  yourself.

Get a demo