Researcher tests LLM pentesting on BookNook
Security researcher Kasra Rahjerdi evaluated the penetration testing capabilities of 14 large language models using a deliberately vulnerable React Native app called BookNook. The experiment showed that GPT-5.5 achieved the highest success rate at 7/10 solves, while cheaper models like DeepSeek V4 Pro succeeded at a fraction of the cost and several models failed due to late-stage security refusals.
Guardrail design in mainstream LLMs renders them ineffective for legitimate penetration testing, while unrestricted or cheaper models are becoming highly viable, cost-effective security auditing agents.
- –GPT-5.5 demonstrated superior strategic focus, bypassing minor API vulnerabilities to directly exploit exposed Firebase configurations.
- –High cost and late-stage security refusals (e.g., in Claude Opus and Gemini 3.5 Flash) represent major bottlenecks for developers using LLMs for authorized vulnerability scanning.
- –DeepSeek V4 Pro offers an incredibly low cost per solve ($0.62) compared to Claude Sonnet 4.6 ($45.75), signaling that the economics of automated vulnerability exploitation favor smaller or open-weights providers.
DISCOVERED
2h ago
2026-06-04
PUBLISHED
5h ago
2026-06-04
RELEVANCE
AUTHOR
jc4p