Research | 4 mins
Can LLMs improve subdomain enumeration?
Subdomain enumeration is a critical step in penetration testing, particularly during the reconnaissance phase. By identifying all the subdomains associated with a target domain, security professionals can locate potential vulnerabilities and evaluate the security of an organization’s infrastructure. The more comprehensive the subdomain enumeration the more complete the map of an organization’s attack surface.
Hadrian has discovered that large language models (LLMs) improve the accuracy of subdomain enumeration, uncovering as much as 10% more subdomains. By uncovering servers, web applications, and other domains that could serve as entry points for exploitation organizations can take steps to mitigate threats.
What is subdomain enumeration?
As a brief recap, there are passive and active techniques used when mapping the subdomains belonging to organizations.
Passive Reconnaissance
Adversaries often search through DNS data, which includes various records like registered name servers and subdomain addressing details. MITRE defines this technique Search Open Technical Databases: DNS/Passive DNS (T1596.001), which is stored in centralized repositories, can be particularly useful.
By querying DNS servers or inspecting passive DNS logs, threat actors can uncover valuable details about an organization's network. This method is favored because it doesn't directly interact with the target’s servers, making it almost impossible to detect.
Active Reconnaissance
Active techniques are categorized by MITRE as Active Scanning: Wordlist Scanning (T1595.003), involves brute-forcing and crawling to discover infrastructure and content. This approach interacts with the target, probing it iteratively for valuable information.
Threat actors, such as Volatile Cedar, use tools like DirBuster and GoBuster to brute force web directories and DNS subdomains. Hadrian recently released SanicDNS, an ultrafast tool, which can resolve up to 5 million domain names per second to aid security researchers. These tools utilize generic or custom wordlists to prepend domains with common words.
Subdomain enumeration in practice
Many threat actors rely on a combination of passive and active methods. They often start with public passive sources such as DNS records, followed by active interaction with the target’s infrastructure.
Naturally, the wordlists and tools selected are extremely important to the efficacy of the active scanning. There are wordlists for the most common subdomains available of varying lengths. The longer the wordlist the longer it will take to complete enumeration but there is a chance of finding more subdomains.
Adversaries often use target-specific tools to improve the accuracy of subdomain discovery. For example, when enumerating cloud storage resources they may use tools like s3recon and GCPBucketBrute.
To further enhance the number of subdomains found permutation techniques will also be employed. These will extend the number of subdomains and further increase the chance of discovering hidden domains or services.
Can LLMs be used for subdomain enumeration?
Hadrian’s ethical hacking and AI teams have been investigating new techniques for uncovering hidden domains that may have been missed otherwise. One of the areas of research has been the use of LLMs to generate novel wordlists.
It is important to address the elephant in the room when it comes to LLMs, namely the over-hype and limited application of them.
LLM use cases are not fully understood
LLMs are powerful tools because they are contextual and pattern-driven. But, the use cases have been sadly limited. Many cybersecurity firms have been using LLMs, such as Microsoft Copilot, to make the complex output of their tools more actionable.
Microsoft even goes so far as to define AI for cybersecurity as “AI for cybersecurity uses AI to analyze and correlate event and cyber threat data across multiple sources, turning it into clear and actionable insights that security professionals use for further investigation, response, and reporting.”
This definition is unfortunately limited; AI can also be used in offensive security use cases, not just reactive response. Going beyond that, in this case, the LLM is not interacting with a human at all! This is just one of the many ways that LLMs and AI can be used to conduct reconnaissance.
Building an LLM for reconnaissance
Hadrian’s researchers have found that LLMs can be trained to recognize patterns in subdomain names such as common co-occuring words, number iterations and permutations. LLMs can then generate wordlists of predicted subdomains that can be tested with active scanning.
Initial testing indicates including LLM generated wordlist increases the number of subdomains found by as much as 10%.
An LLM tool for security researchers
One of the challenges with LLMs is the computing power required to run them. Models that are expensive to run would be impractical for security researchers. To enable organizations to build accurate maps of their attack surface security the LLM model must be lightweight.
To follow this project and learn more about Hadrian’s LLM subdomain enumeration tool follow us on LinkedIn.