AI's New Role in Cybersecurity: Can Large Language Models Truly Score Software Flaws?
📷 Image source: img.helpnetsecurity.com
The Promise and Peril of Automated Vulnerability Assessment
New research explores whether LLMs can move beyond chat to critical security scoring
The cybersecurity landscape is perpetually stretched thin, with human analysts drowning in a sea of software vulnerabilities. Could the same artificial intelligence powering chatbots and content creation become a reliable partner in triaging these threats? According to a new study highlighted by helpnetsecurity.com, large language models (LLMs) show a surprising aptitude for assisting with Common Vulnerability Scoring System (CVSS) assessments, a critical framework for rating the severity of software flaws. The research, conducted by academics from the University of Bologna and Sapienza University of Rome, suggests these AI systems can generate plausible CVSS vectors—the strings of metrics that define a vulnerability's characteristics. But the findings come with a stark and familiar caveat: context is everything, and the AI's performance hinges entirely on the quality and completeness of the information it's given.
Imagine a junior analyst, incredibly fast and knowledgeable, but prone to confident mistakes if given vague instructions. That, in essence, is the current state of LLMs in this domain. The study's core revelation is that these models can function as a powerful, preliminary scoring engine, potentially accelerating the initial stages of vulnerability management. However, their output is not a final verdict but a draft—one that requires expert human review to anchor it in the real-world technical and environmental context that the AI might miss.
Decoding the CVSS: More Than Just a Number
Why automated scoring is a complex linguistic and technical challenge
To understand the significance of this research, one must first grasp what a CVSS assessment entails. It's far more than assigning a simple severity score from 1 to 10. The CVSS vector is a detailed linguistic breakdown, a sequence of abbreviated codes that capture a dozen different metrics. These metrics evaluate everything from the attack vector (e.g., does an exploit require network access or local privileges?) to the impact on confidentiality, integrity, and availability of data. Crafting an accurate vector demands a deep, nuanced understanding of a technical vulnerability description.
The challenge for an LLM, then, is a sophisticated text comprehension and generation task. It must parse dense, technical prose describing a software flaw, interpret the implications of that description against a rigid standardized framework, and then output the correct sequence of codes. The researchers tested this capability by feeding models like GPT-4 and Llama 2 with descriptions from real-world vulnerabilities. The goal was to see if the AI could correctly generate the corresponding Base Score vector, which reflects the intrinsic qualities of a flaw, independent of any specific environment.
Benchmarking AI Performance on Security Tasks
So, how did the models actually perform? According to the report on helpnetsecurity.com, the results were a mixture of promising competence and illustrative limitations. The LLMs demonstrated a strong ability to generate syntactically valid CVSS vectors—the output looked correct in form and structure. In many cases, the AI-produced scores were plausible and closely aligned with human-assessed scores, suggesting the models had effectively learned the patterns and relationships within the CVSS framework from their training data.
However, accuracy was inconsistent. The study found that the models' performance was highly sensitive to the wording and detail present in the vulnerability description they were given. Vague or incomplete descriptions led to greater divergence from expert scores. This variability underscores a fundamental truth about current LLMs: they are brilliant pattern matchers and synthesizers, but their "understanding" is bounded by the information provided in their prompt. They lack the external, tacit knowledge a human analyst brings to the table, such as knowing how certain software is typically deployed or the relative value of different asset types within an organization.
The Indispensable Ingredient: Contextual Awareness
Why environmental and temporal metrics remain a human domain
This is where the research draws its most critical line. While LLMs showed potential with the Base Score, the more complex aspects of CVSS scoring proved far more challenging. The Environmental and Temporal metrics, which modify the base score to reflect a specific organizational context or the changing threat landscape, were largely beyond reliable automation by the tested models. The Environmental score requires intimate knowledge of an organization's unique security controls, asset criticality, and mitigations. Does the vulnerable system house public marketing data or sensitive patient health records? The AI, without being fed this proprietary context, cannot know.
Similarly, Temporal metrics account for factors like the availability of a proof-of-concept exploit or an official patch. This information is fluid, changing daily in the real world. An LLM's knowledge is static, frozen at its last training cut-off. It cannot inherently know if an exploit was released yesterday or if a patch was made available an hour ago. As the report states, this reliance on perfect, contextual information is the primary limitation. The AI can assist, but it cannot assume responsibility for the final, actionable risk assessment that drives business decisions.
A Tool, Not a Replacement: Envisioning the Human-AI Workflow
The practical implication of this study is a blueprint for collaboration, not replacement. The envisioned role for LLMs in vulnerability management is that of a force multiplier for human experts. In a typical workflow, an analyst faced with a stack of new vulnerability advisories could use an LLM to generate preliminary CVSS base vectors for each. This automated first pass would instantly provide a structured, standardized starting point, saving the analyst from manual data entry and initial categorization.
The human expert's role then evolves to higher-value tasks: validation, contextualization, and decision-making. They would review the AI-generated vectors, correct any misinterpretations based on their deeper technical knowledge, and crucially, apply the Environmental and Temporal scores based on intelligence unique to their company. This hybrid approach leverages the AI's speed and consistency in processing standardized information while retaining the human's strategic judgment, experience, and contextual awareness. It turns the analyst from a scorer into a validator and strategic advisor.
Navigating the Risks of AI Hallucination in Security
Integrating LLMs into a high-stakes field like cybersecurity is not without significant risk. The phenomenon of "hallucination"—where an AI generates confident but incorrect or fabricated information—takes on dangerous dimensions here. A mistakenly low CVSS score generated by an AI could lead a security team to deprioritize a critical patch, potentially leaving a door open for attackers. Conversely, an inflated score could trigger unnecessary panic and divert limited resources from more pressing threats.
The research implicitly calls for robust guardrails. Any production use of LLMs for vulnerability scoring would necessitate strict human-in-the-loop verification protocols. The output must be treated as an unverified suggestion, not an authoritative rating. Furthermore, the prompts fed to the model must be carefully engineered with rich, detailed, and unambiguous vulnerability descriptions to maximize accuracy. Trust in this system must be earned through rigorous, continuous testing and validation against ground-truth expert assessments.
The Future Roadmap for AI in Vulnerability Management
From text analysis to integrated threat intelligence platforms
Looking forward, the research points to several avenues for evolution. The next generation of tools might not rely solely on a general-purpose LLM prompted with text. Instead, we could see specialized security AI models fine-tuned exclusively on vulnerability databases, threat reports, and patch notes. These models could be integrated into Security Orchestration, Automation, and Response (SOAR) platforms, where they automatically ingest vulnerability feeds, generate draft assessments, and even suggest remediation steps based on playbooks.
The ultimate goal is a continuously learning system. Imagine an AI assistant that not only scores a new vulnerability but also cross-references it against the organization's asset inventory, checks for existing mitigations in place, and reviews past incident data for similar flaws. It could then present the human analyst with a consolidated dossier: the draft CVSS score, a list of potentially affected systems, suggested patch procedures, and an assessment of exploit likelihood based on real-time threat intelligence feeds. This moves the technology from a scoring assistant to a comprehensive analysis partner.
A Measured Step Forward for Cybersecurity AI
The study, as covered by helpnetsecurity.com on 2025-12-26T06:00:59+00:00, provides a sober and evidence-based perspective on a hyped technology. It tempers the excitement around AI's disruptive potential with a clear-eyed analysis of its current capabilities and boundaries. The conclusion is not that LLMs are ready to take over vulnerability management, but that they are becoming sophisticated enough to be genuinely useful within a carefully designed process.
The message to cybersecurity professionals is one of cautious optimism. The tedious, repetitive task of initial vulnerability parsing and scoring is ripe for augmentation. By offloading this work to an AI, analysts can reclaim time for the complex, contextual, and strategic work that machines cannot do. In an industry facing a chronic talent shortage and an ever-expanding attack surface, that is not a minor improvement. It represents a pragmatic path toward greater resilience, where artificial intelligence handles the volume, and human intelligence ensures the verdict is correct.
#Cybersecurity #AI #LLM #Vulnerability #CVSS

