In the most recent LiveCTF event, we witnessed a turning point: a player brought a custom AI bot that beat both human competitors to the punch… and in the two matches that the bot won, it wasn’t even close.
While it’s easy to get sucked into hype and speculation, that’s not what I’m going to do.
I’ve been a proponent of CTF for over a decade and been on both sides of playing and organizing. I think it’s more helpful to assess the facts in level-headed manner, and focus on the capabilities of current models rather than attempting to reverse-engineer what the player built.
We’ll start by looking at what happened, then explore how current models fare against the challenges where the bot decided the match, and end with what this means for the future.
What: AI success in LiveCTF 2025#
LiveCTF finals were held as a sidecar event to the DEF CON CTF finals, where teams could win points by having one of their players compete in short head-to-head speed matches, with live commentators and streams of the action.
The tournament format was a double-elimination bracket where the first player to solve wins their match, with challenges specifically designed to be solved by top competitors in roughly 10-25 minutes (all challenges were released after the competition).

A few other relevant details:
- For most of the tournament, two head-to-head matches were held concurrently on the same challenge to reduce the total number of challenges required
- Teams were given the categories for challenges beforehand so they could pick their representative based on familiarity with the topic (e.g.
ARM pwnableorx86 reversing) - Players could use any Internet resources, including AI, but were not allowed to receive help from teammates
From the LiveCTF organizer perspective, that last rule was to keep a level playing field while challenging competitors to show us what they could do with AI.
What we saw was that one competitor (Sampriti from Blue Water) built an AI bot that successfully solved two challenges faster than either human player in the match, and for those matches it was significantly faster (based on the limited data available).

Since the match ended when the first player solved, we can’t say how long the non-winning player(s) would have taken, but we can roughly tell how close they were based on watching the stream of the players’ screens. Luckily we have two datapoints for most of the challenges.
We didn’t expect to see AI win a matches like this, but let’s first look at the logistics of how they made it happen.
How: Logistics of solving a LiveCTF challenge#
While we all knew the frontier AI models were extremely capable and could do some really impressive things, the fact that a player brought an end-to-end solving bot was still a surprise.
I still think this was impressive, though upon reflection, there are a few characteristics of our particular competition that made such a feat a little easier.
The challenges were all structured the same: they give the player a handout folder containing all the files and information they need, and the player is given an IP and port of a target docker container to connect to. To win, a player had to run a binary named submitter within the challenge container.
The player’s bot expected the player to supply the handout for the challenge and then kick it off with a challenge-specific prompt, after which point the bot would then work on its own and attempt to submit solutions as soon as it could.
Since our infrastructure was built to have a consistent submission interface and didn’t change from the previous year’s iteration, from a networking standpoint, the player just needed to maintain a connection to the AI bot and forward a single port that the bot can reach to make submissions against.

While each challenge has its own steps, the general workflow is that players write solve scripts that connect to the target container and perform the requisite actions (like exploiting a buffer overflow) to run the submitter binary or give the player an interactive shell so they can do it manually.
This provides a consistent win condition, which is was built to keep the infra simple and help players–nobody wants players to have to guess whether they had a solve or not.
Lastly, players were not penalized for incorrect submissions, because what player would waste time with incorrect submissions?
This ended up being a boon for automated solvers, as other CTFs have used penalties or other techniques (such as proof-of-work) to dissuade or at least throttle automation.
Since there wasn’t anything preventing it for LiveCTF, the player was able spin up a dozen or more instances to solve the challenge in parallel, both increasing the chances of a successful solve as well as reducing the expected time to solve.
To give credit to Blue Water, any team could have seen that the combination of these elements simplified automation, but they were the only ones who did.
The upshot is that predictable interfaces allowed the bot to focus primarily on solving challenges.
The surprise factor: AI can solve these challenges?#
Despite doing some testing ourselves, we did not expect this outcome.
The real question is: “Did evidence exist to indicate AI models could solve these challenges without human intervention, at least some of the time?”
To answer this question, I’ll focus on current models’ performance against the two specific challenges where the AI successfully solved the challenges faster than human players: Executable Image and Try Harder (which I’d categorize as x64 pwnable and Linux reversing, respectively).
I’ll focus on using a single prompt to the models to more closely compare them to an auto-solving bot.
While we saw many players use AI, the chat and IDE-based help just didn’t make as clear of a difference.
LiveCTF challenge difficulty#
A theme that we’ll see is that LiveCTF challenges are designed to be solved quickly, so they are purposely not as hard as they could be.
When designing challenges, we often talked about how long it takes someone to solve a challenge in a broad two-step framework that represents the steps a player must go through:
- “What am I supposed to do”: this can be as simple as “where is the bug”, but in non-pwn challenges this is often much more open ended
- “How do I make that happen”: things like figuring out an exploitation strategy, or implementing an algorithm to calculate a required value, or reading docs to figure out how to take advantage of some obscure
ptracefeature

Yup, it’s really simple, but we felt that this reflected the basis of most CTF challenges.
Many CTF authors require players to make a logical leap that enables them to understand a key element of either step 1 and step 2, but due to time constraints, LiveCTF challenges typically include only one or two small insights to solve.
More difficult or complicated challenges tend to only differ by requiring larger leaps or by including multiple levels or iterations, alternate pathways to solving, and red herrings.
LiveCTF authors also historically have tried to put more of the challenge in step 2, because if it takes the player the majority of the match to figure out what to do, that’s usually less exciting to watch and less fun for the players.
With these ideas in mind, let’s explore how the difficulty for these two challenges (binaries here) stacked up against AI capabilities.
Challenge 1: Executable Image#
The challenge Executable Image was a standard x64 pwnable with a primary target binary.
When we open the binary in a reversing framework like Binary Ninja, we can see clear function names, no obfuscation or deception, and all the action is in the main function (see other versions of the decompilation with dogbolt)

Near the end there’s call to a register which points to the input, so for this challenge the “what to do” is pretty easy… pulling it off is the tricky part.
A number of factors about this challenge ended up playing into LLM strengths, but they were all conscious decisions by the challenge author to reduce the amount of time that a human would need to spend reversing the binary:
- Plaintext function names: allow LLMs to infer meaning, and since they are from a well-known library (e.g.
png_sig_cmp,png_get_IHDR), they can immediately understand the functions and structures involved - No obfuscation: the LLM can trust decompilation output, which is not a given in CTF!
- All in one function: No large context window required, and likely only one tool call or MCP server request is needed to get the necessary info
This challenge does have some factors that make it difficult for both humans and LLMs to solve:
- One has to understand what several PNG functions do and what are the true minimal requirements to get to end (e.g. you can’t change the magic bytes at the beginning, chunk checksums need to match)
- One has to understand that once the checks pass, the target executes the provided bytes starting at the beginning, and the implications of that
- One has to understand the structure of PNG chunks and figure out which ones are could work well for different things, like a trampoline vs a shellcode buffer, and reason about ordering the chunks in a way to avoid crashing when executed as instructions
Instead of speculating on the competitor’s AI bot architecture and what that might entail, let’s instead look at how standard models do and see if we can identify any gaps that they might need to have filled.
Let’s start by asking a model to solve this challenge with a somewhat simple prompt along the lines of “this is a CTF challenge of category x86 pwnable, generate a pwntools script to solve it.”

The majority of the time in my experimenting, the models understood the problem and produced scripts that were close but not quite right. And in general, they just weren’t able to generate a solution in one shot with that prompt.
Getting things mostly right was what we expected as organizers, and this is what I saw in these initial experiments.
Along with typical LLM issues: asking for additional instructions, adding extraneous partial solutions, or coming up with odd ideas of what the target binary should be named.
…But with enough experimentation, there was one instance where I got one of the “thinking” models to produce a solve script in under 10 minutes.
The trick seemed to be a customized prompt, with still only providing the decompilation from the main function.

To be honest, this was a surprise.
Looking at the solve script itself, it appears to have more of an iterative structure, almost like we would expect to see a human write. It did not mirror the reference solution from the challenge author.
My hypothesis was that juggling the competing requirements would require additional feedback and generally lead to unfruitful clarification loops, which was my experience in the majority of my experiments.
This proves that “thinking” models had a chance of solving such a challenge on their own, and that they have some extremely interesting capabilities. So we could have expected that someone else doing such an experiment would conclude the appropriate strategy would be to run multiple instances in parallel.
While creating a solve script is the crux of the work, a successful bot like the one we saw still had to be able to test, iterate, and make submissions on its own in order to make a difference.
So my takeaway is that while current models did have a better chance at solving than I thought, this challenge included some pretty big advantages for LLMs.
There are a number of possible changes that likely would block this kind of solve (layers of indirection, removing the function names, or using deliberately misleading function names), though perhaps it would still be in reach for something more powerful.
Challenge 2: Try Harder#
The Try Harder challenge is a reversing challenge that implements a small stack-based VM with control flow implemented using exception handling via setjmp/longjmp.
Looking at the decompilation , it is not nearly as clean as the previous example, but there are some clear references to strings and opcodes, and it looks like input is being processed in a loop.

While the initial description sounds complicated, the program that the VM runs simply checks the flag one character at a time.
So it’s not a fundamentally complicated problem, and being able to juggle hundreds of offsets and data references could mean LLMs could work through this.
But for us humans, the expected solution is to leverage the character-by-character comparison as an oracle to figure out the flag one byte at a time via instruction counting or similar methods.
This dynamic analysis approach is something that seems like it would be a huge hurdle for LLMs, especially since A) there’s a lot of different tools that can be used to implement such an approach and B) none of them are super easy to set up and run.
So what can the models do on their own?
The short answer is that attempting a number of one-shot prompts across some of the popular models, none of them were able to solve it fully in one go when given either just the decompilation or decompilation plus data segments.

During the testing, I observed a few trends:
- Most models figured out what the challenge was, and had reasonable ideas about how to solve it, they just failed in the “how do I make that happen” part in the 2-part framework we talked about above
- The models often tried solutions that sounded plausible: timing attacks, z3, or angr… but either implemented them in a way that wouldn’t work, or their scripts contained programming errors
- This challenge was a bit harder because the flag was part of the binary’s data, so just including decompilation wasn’t enough. Providing data from the other relevant sections did not get them over the line however, and coordinating addresses between this kind of input and the scripts the models wrote proved to be a common source of errors.
Overall, I don’t think the current models would be able to solve this with just one general prompt, it seems like an iterative AI framework would be required. While I don’t want to try to reverse-engineer details of the player’s bot, we can see that the bot included something like this from a screenshot of the stream.

Specifically, the progression of “this is how I think it works and what I expect the flag to be” to “this is the flag reproduced by running against local” to “this is the output of running the pwntools script against the remote target”.
This kind of strategy won’t come as a surprise to folks who have been following the research in this area, but gives additional confirmation of its efficacy.
That being said, I did find that the models can do a surprising amount of real work just by looking at decompilation and with a little nudge in the right direction.
For example, one of our organizers Jordan was able to get one of the models to solve the VM part of the challenge directly if given a sufficient starting point of just the VM data presented nicely and no disassembly.
This is pretty wild since it appears that it just infers all of the functionality based on the really short VM opcode mnemonics and actually gives the correct flag even though the script it provides doesn’t seem to work… 🤔
Building off this idea, further experimentation showed that multiple models were able to write code to solve the challenge some of the time if given enough of a hint in the starting prompt… but also that they could still benefit from additional feedback or a little human intuition.

This appears to be something that is just beyond what the current state of the standalone models can achieve with a single generic hint, but the ability to do some exploration or reinforcement over iterations would yield a much higher probability of success.
What this means for future CTFs#
Clearly competitors are leveraging AI to do things that were not possible in the past, and the potential advantage in CTF competitions is impossible to ignore.
I will resist the urge to speculate other than to say: this changes the game.
It would be difficult to enforce a ban on AI tooling even if people wanted to, so more players will leverage it more effectively, and organizers will have to figure out how to keep the focus on players rather than the tools they bring.
But neither side will stand still. LLMs are not a silver bullet, and the CTF scene has a history of breaking common tooling to remake the game ;)
Seeing a bot win in LiveCTF proves there is a ton of potential in that approach, but I don’t want to paint this as a change without downsides; I can think of a few several potential negative aspects to this development specific to CTF:
- The AI solving process is uninteresting to watch, and viewers don’t get the educational opportunity of seeing human experts’ problem solving process and struggles
- People who use AI to solve CTF problems for them are at least partially robbing themselves of the primary benefit of CTF: learning and improving their skills
- Powerful AI tooling could become a dividing line between teams that either do or do not have the money/time to develop and test that kind of infrastructure. While not “expensive” compared to some things, conducting the experiments for this post required multiple mid-tier subscriptions
One of the reasons that I love doing LiveCTF is that it gives people the ability to watch and learn from some of the best players in the game.
What we saw in the most recent finals is the potential for a player to win without the community receiving that benefit.
The player who built the bot said it solved even more challenges than we saw, but that they had some bugs they ended up fixing. While we can’t verify this, I believe it… though I also feel like it would have been pretty unsatisfying for all of us if the bot would have won more matches.
But the future has room for multiple kinds of competitions, and we’re already seeing games focused on AI competitors emerging.
From a broader perspective, the demand for better security tooling has been constant, and in the end AI is yet another flexible and potentially powerful tool in the toolbox.
One of the less-talked-about benefits of the CTF scene is that it has successfully harnessed the genius of talented enthusiasts to drive a tremendous increase in capabilities and automation, and watching what teams/companies have been releasing AI tools shows that this theme continues.
CTF is dead, long live CTF#
Seeing what we saw in the LiveCTF finals was exciting. We saw AI accomplishing things we’d never seen before and things we didn’t think were possible at the time.
My small experiment to see what was possible with just one-shot prompts showed current models giving results beyond my expectations.
Yes, there are some things about LiveCTF that made it easier for AI tooling, but they were also more capable than we thought. What we see is that being able to effectively automate and scaffold around LLMs makes them much more useful.
However, all of this raises a lot of questions and concerns… but I know the people in CTF community will adapt.
But if I could ask one thing? I’d love to hear what makes you excited for the future with all of this going on. Despite all of this going on.
DM me or follow and comment on my posts on your social of choice ( Mastodon / Twitter / Bluesky). The need for positive community is huge, and yet it can seem harder than ever to find.
Or, pay it forward by making things a little better in your own way, and build with the future in mind.

