Security Solutions

Predicting web pages’ age with AI: A new weapon in the fight against cyberattacks

Data Scientist

What does the age of your web page say about your vulnerabilities?

The average website is attacked 172 times per day, making them one of the most commonly used attack vectors. Web pages that utilize older technologies are more likely to have exploitable vulnerabilities that threat actors can exploit. Manually reviewing all of an organization’s web pages is a time-consuming and neverending task. As a result, organizations are often unaware when they have old, vulnerable pages that can be easily exploited.

Furthermore, knowing the age of a page informs which vulnerabilities should be tested. Speed is critical, as identifying vulnerabilities in web pages before a threat actor is critical. The team at Hadrian has developed an AI model that completes 8 hours of work in just 2.17 seconds.

Older web pages are more vulnerable

A study in 2021 on website security reports that:

  • 95% of web pages run on outdated software with known vulnerabilities.
  • The average software product is approximately 48 months behind the latest update – that’s four years.

Consequently, during reconnaissance, threat actors are more likely to focus on web pages running on legacy technology. “When assessing a potential target, a good way to figure out if something is vulnerable is its age,” Olivier Beg, Cofounder and Head of Hacking at Hadrian, further elaborates. “Pages built on old technologies are prone to vulnerabilities, indicating that more time should be spent on those assets.”

However, manually checking an organization’s site for old pages and technologies is a time-consuming task. The difficulty of the task, complicated by the plethora of technologies that can be implemented in a page, has prevented previous attempts to automate it. 

Klaas Meinke, a machine learning engineer at Hadrian, has developed a machine learning algorithm to identify the age of web pages. While AI is ideal for solving this type of technical problem, engineers must carefully choose and utilize different techniques to create robust and reliable algorithms. 

Building an accurate and reliable AI algorithm

Building and implementing an AI for practical application is a challenging task. Below is a brief outline of the steps taken to produce Hadrian’s web page age assessment algorithm.

Data collection

Hadrian collected historical webpages from The Internet Archive, sampling webpages from the Majestic Million, a list of the most-connected webpages. This yielded a data set of thousands of websites from industries, geographies, and time periods. Training and testing the algorithm with such a broad data set is essential for any algorithm facing real world application.

Tokenization of data

To turn the HTML body into a feature vector that can be fed into the neural network, Hadrian’s team had to first train a new tokenizer, since HTML patterns differ from natural language, making common tokenizers unsuitable for the task. Part of this challenge is the utilization of Unicode in web pages, a character set with over 149k unique symbols, which can add significant complexity.

The solution was to convert all characters to combinations of the 256 ASCII characters, enabling the page’s HTML to be quickly tokenized. Finally, the HTML bodies were then encoded by binary vectors, where each of the vector's values represents the presence or non-presence of a token in the HTML body.

Model architecture

After trying several models such as logistic regression and support vector regression, we found that a four-layer Multi-Layer Perceptron (MLP) performed best when determining the age of a web page. The training was performed by minimizing the mean squared error, and applying sample weighting to negate any potential training bias created by the data set or tokenization process. To further improve reliability and prevent overfitting (when machine learning algorithms over specialize on their training sets) dropout layers with a dropout rate of 20% were used.

The challenge of machine learning

When using ReLU activation functions, we encountered a challenge with "dead neurons" in the model causing an unrepresentative distribution of predictions. After logging the outputs of all layers, we found that a large number of neurons were dead. After performing hyper-parameter optimization, we find that a Softplus activation function fixes the dead neuron problem, leading to a 2-fold increase in accuracy.

To learn more about the methodology outlined above you can request a copy of our academic paper on the subject, or attend CSCML 2023, where it will be presented by Klaas Meinke.

Comparing AI to humans

Hadrian’s AI model can process 10,000 web pages in 2.16 seconds. For a human, this would take up to 8 hours – an entire workday – of just scanning to achieve the same result, while there are more high-skill important tasks that need to be done. The algorithm can be used to streamline the workflow of penetration testers or red teams. It can also inform the decision making process of Continuous Autonomous Red Teaming solutions, decreasing the impact on your infrastructure by reducing brute forcing.

The model can process 10,000 web pages in 2.16 secs vs 8 hours for a human

Age is one piece of the puzzle

Age is just one of many factors threat actors consider before attacking their target. Organizations must understand the mind of a threat actor to have a comprehensive picture of cyber attacks.

Hadrian’s Orchestrator AI can provide 24x7x365 risk insights like a real-life threat actor, and continuously prioritize risks based on the likelihood of exploitation and potential impact. Get in touch with Hadrian today to get a scan of all your external-facing assets, misconfigurations, exposed secrets, and vulnerabilities in real time.

Book a demo

Get started scanning in 5 minutes

We only need your domain for our system to get started autonomously scanning your attack surface.