top of page
Boris Rozenfeld

Best WAF Solutions in 2024-2025: Real-World Comparison

Introduction

This article describes how we tested the efficacy of several leading WAF solutions in real-world conditions for the second year in a row (see 2023-2024 blog here). In the test, we conducted an in-depth test of triggering both malicious and legitimate web requests at different WAFs and measuring the results.


Many WAF solutions in the market are based on ModSecurity engines and use the OWASP Core Rule Set (CRS) signatures. This year marks a significant change in these foundational components: ModSecurity reached its End of Life (EOL), and the OWASP CRS has released a major update from version 3.x.x to 4.x.x – the first major update in 8 years.


We anticipated that these developments would lead to substantial improvements across all WAFs dependent on OWASP CRS. However in practice, while some WAFs showed notable enhancements, others performed worse than before and also didn't update their CRS to the latest one.


The two most important parameters when selecting a Web Application Firewall are:

  • Security Quality (True Positive Rate) - the WAF's ability to correctly identify and block malicious requests is crucial in today's threat landscape. It must preemptively block zero-day attacks as well as effectively tackle known attack techniques utilized by hackers.

  • Detection Quality (False Positive Rate) – aka the WAF's ability to correctly allow legitimate requests is also critical because any interference with these valid requests could lead to significant business disruption and an increased workload for administrators as much tuning is required.


A very comprehensive data set was used to test the products:

  • 1,040,242 legitimate HTTP requests from 692 real websites in 14 categories

  • 73,924 malicious payloads from a broad spectrum of commonly experienced attack vectors


Loyal to the spirit of open-source, we provide in this GitHub repository all the details of the testing methodology, testing datasets, and open-source tools that are required to validate and reproduce this test and welcome the community's feedback.


Products Tested and Results

This year's test was conducted in October-November 2024, and compared the following popular WAF solutions:


  • Microsoft Azure WAF – OWASP CRS 3.2 ruleset

  • AWS WAF – AWS managed ruleset

  • AWS WAF – AWS managed ruleset and F5 Ruleset

  • CloudFlare WAF – Managed and OWASP Core Rulesets

  • F5 NGINX App Protect WAF – Default profile

  • F5 NGINX App Protect WAF – Strict profile

  • NGINX ModSecurity – OWASP CRS 4.3.0 (updated from previously tested version 3.3.4)

  • open-appsec / CloudGuard WAF – Default configuration (High Confidence)

  • open-appsec / CloudGuard WAF – Critical Confidence configuration


This year we also added the following WAF players:

  • Imperva Cloud WAF – Default configuration

  • F5 BIG-IP Advanced WAF – Rapid Deployment Policy

  • Fortinet FortiWeb  Default configuration

  • Google Cloud Armor – Preconfigured ModSecurity rules (Sensitivity level 2)


The two charts below summarize the main findings.

Security Quality and Detection Quality are often a tradeoff within security products. The first chart shows visually how different products perform in each category.

The test reveals significant differences in product performance. For example:

  • Imperva's WAF has a near-perfect Detection Quality (False Positive Rate) of 0.009% correctly allowing almost all legitimate traffic. However, surprisingly it achieves only an 11.97% Security Quality (True Positive Rate), missing 88.03% of actual threats and reducing its overall security effectiveness.

  • Azure WAF offers very high Security Quality (97.526%) but has an extremely high false positive rate of 54.242%, potentially blocking legitimate requests and disrupting normal operations. These results suggest that some products may pose security risks due to missed detections or require substantial tuning to balance security effectiveness with usability.


To provide security and allow minimal administration overhead, the optimal WAF solution should strike a balance, exhibiting high performance on both Security Quality and Detection Quality. This is aptly represented by a measurement called Balanced Accuracy - an arithmetic mean of the True Positives and True Negatives rates.

  

For the second year in a row, open-appsec / CloudGuard WAF leads in balanced accuracy, achieving the highest scores with 99.139% in the Critical Profile and 98.966% in the Default Profile. This performance outpaces all other WAF solutions, with NGINX AppProtect in the Default Profile following at 88.046%. 


 

Methodology

Datasets

Each WAF solution was tested against two large data sets: Legitimate and Malicious.


Legitimate Requests Dataset

The Legitimate Requests Dataset is carefully designed to test WAF behaviors in real-world scenarios. This year, we have updated the dataset to include 1,040,242 HTTP requests from 692 real-world websites.


The data set was recorded by browsing real-world websites and conducting various operations on the site (for example, sign-up, selecting products and placing in a cart, searching, file uploads, etc.) ensuring the presence of 100% legitimate requests.

The selection of real-world websites of different types is essential because it is important for WAFs to examine all components of an HTTP request, including headers, URL, and Body as well as complex request structures like large JSON or other complex body types. This allows for accurate testing, as these elements can sometimes be the source of False Positives in real-world applications. Often synthetic datasets will overlook some of these.


The dataset in this test allows us to challenge the WAF systems by examining their responses to a range of website functionalities. For example, many HTTP requests are traffic to e-commerce websites. These websites often employ more intricate logic, making them an ideal ground for rigorous testing. Features such as user login processes, complex inventory systems equipped with search and filter functionalities, dynamic cart management systems, and comprehensive checkout processes are common in e-commerce sites. The dataset also includes file uploads and many other types of web operations. Incorporating these features allows us to simulate a wide range of scenarios, enabling an exhaustive evaluation of the efficiency and reliability of WAF systems under diverse conditions.


The distribution of site categories in the dataset is as follows:

Category

Websites

Examples

E-Commerce

404

eBay, Ikea

Travel

75

Booking, Airbnb

Information

59

Wikipedia, Daily Mail

Food

40

Wolt, Burger King

Search Engines

24

Duckduckgo, Bing

Social media

17

Facebook, Instagram

Files uploads

16

Adobe, Shutterfly

Content creation

13

Office, Atlassian

Games

13

Roblox, Steam

Videos

8

YouTube, twitch

Files download

7

Google, Dropbox

Applications

7

IBM Qunatum Simulator, Planner 5d

Streaming

6

Spotify, Youtube Music

Technology

3

Microsoft, Lenovo

Total

692


The Legitimate Requests Dataset including all HTTP requests is available here. We think that it is a valuable resource for both, users as well as the industry, and we are planning to continue updating it every year.


Malicious Requests Dataset

The Malicious Requests Dataset includes 73,924 malicious payloads from a broad spectrum of commonly experienced attack vectors:


  • SQL Injection

  • Cross-Site Scripting (XSS)

  • XML External Entity (XXE)

  • Path Traversal

  • Command Execution

  • Log4Shell

  • Shellshock


The malicious payloads were sourced from the WAF Payload Collection GitHub page that was assembled by mgm security partners GmbH from Germany. This repository serves as a valuable resource, providing payloads specifically created for testing Web Application Firewall rules.


As explained on the GitHub page, mgm collected the payloads from many sources, such as SecLists, Foospidy's Payloads, PayloadsAllTheThings, Awesome-WAF, WAF Efficacy Framework, WAF community bypasses, GoTestWAF, and Payloadbox, among others. It even includes the Log4Shell Payloads from Ox4Shell and Tishna. Each of these sources offers a wealth of real-world, effective payloads and provides a comprehensive approach to testing WAF solutions.


For an in-depth view of each malicious payload utilized in this study, including specific parameters and corresponding attack types, refer to this link.


Combined, the Legitimate and Malicious Requests datasets present a detailed perspective on how each WAF solution handles traffic in the real world, thereby providing valuable insights into their efficacy and Detection Quality.


Tooling

As before, to ensure transparency and reproducibility, the tool is made available to the public here.


During the initial phase, the tool conducts a dual-layer health check for each WAF. This process first validates connectivity to each WAF, ensuring system communication. It then checks that each WAF is set to prevention mode, confirming its ability to actively block malicious requests.

 

The responses from each request sent by the test tool to the WAFs were systematically logged in a dedicated database for further analysis.


The database used for this test was an Sqlite db. However, the test tool is designed to be flexible. Readers can configure it to work with any SQL database of their preference by adjusting the settings in the config.py file.


Following the data collection phase, the performance metrics, including False Positive rates, False Negative rates, and Balanced Accuracy results, were calculated. This was done by executing specific SQL queries against the data in the database.


Based on user feedback, we have updated this year the testing tool to improve usability:

  1. Enhanced Error Reporting: If the pre-run test fails, the tool now provides a reproducible curl command instead of throwing an error. This helps users diagnose and resolve issues more efficiently.

  2. Bug Fixes: We've fixed a few bugs to improve the overall user experience.

  3. Shared NGINX Configuration: We are now sharing the NGINX configuration used for creating the simple upstream in our testing environment. More information can be found here.


Comparison Metrics

To quantify the efficacy of each WAF, we use statistical measures. These include Security Quality (also known as Sensitivity or True Positive Rate), Detection Quality (also known as Specificity or True Negative Rate), and Balanced Accuracy.

Security Quality, also known as the true positive rate (TPR), measures the proportion of actual positives correctly identified. In other words, it assesses the WAF's ability to correctly detect and block malicious requests.


Detection Quality, or the true negative rate (TNR), quantifies the proportion of actual negatives correctly identified. This pertains to the WAF's capacity to correctly allow legitimate traffic to pass.


Balanced Accuracy (BA), an especially crucial metric in this study, provides a balanced measurement by considering both metrics. It is calculated as the arithmetic mean of TPR and TNR. In other words, it provides a more balanced measure between True Positives and True Negatives, irrespective of their proportions in the data sets.


This choice of metrics is fundamental, as we aim to assess not just the WAF's ability to block malicious traffic but also to allow legitimate traffic. Most importantly, we want to evaluate the overall balance between these two abilities, given that both are critical for a real-world production system.


Thus, we not only examine the number of attacks each WAF can correctly identify and block but also scrutinize the number of legitimate requests it correctly allows. A WAF with high TPR but low TNR might block most attacks but at the cost of blocking too many legitimate requests, leading to poor user experience. Conversely, a WAF with high TNR but low TPR might allow most legitimate requests but fail to block a significant number of attacks, compromising the security of the system. Therefore, the optimal WAF solution should strike a balance, exhibiting high performance on both TPR and TNR, which is aptly represented by Balanced Accuracy.


Test Environment

The test includes both, products that are deployed as standard software and products available as SaaS. The standard software products were staged within Amazon Web Services (AWS). The main testing apparatus was an AWS EC2 instance, housed in a separate VPC. This facilitated the simulation of a real-world production environment while keeping the testing isolated from external influences.


To maintain the integrity of the test and ensure that performance wasn't a distorting factor, all embedded WAF solutions were hosted on AWS t3.xlarge instances. These instances are equipped with 4 virtual CPUs and 16GB of RAM, providing ample computational power and memory resources well beyond what is typically required for standard operations. This configuration was deliberately chosen to eliminate any possibility of hardware constraints influencing the outcome of the WAF comparison, thereby ensuring the results accurately reflect the inherent capabilities of each solution.



Findings

In this section, we describe the configuration of each product tested and the score of the Security Quality or True Positive Rate measurement (higher is better) and Detection Quality or False Positive Rate measurement (lower is better) as well as the Balanced Accuracy score (higher is better).

In the test we used the Default Profile Settings of each product without any tuning and when available also an additional Profile that allows for the highest Security Quality available in the product.


Microsoft Azure WAF

Azure WAF is a cloud-based service implementing ModSecurity with OWASP Core RuleSet. The Microsoft Azure WAF was configured with the Default suggested OWASP CRS 3.2 ruleset.


Microsoft Azure introduced a redesigned WAF interface this year. Despite the major release of OWASP CRS 4.x, Azure WAF still uses the 3.2 ruleset with no updates. Surprisingly, the detection rate decreased slightly, even though the rule set remained the same

The results were as follows:

Security Quality (True Positive Rate): 97.526%

Detection Quality (False Positive Rate): 54.242%

Balanced Accuracy: 71.642%


CloudFlare WAF

CloudFlare WAF is a cloud-based service based on ModSecurity with OWASP Core RuleSet. CloudFlare provides a Managed ruleset as well as a full OWASP Core RuleSet. We tested the product activating both rulesets.


The results were as follows:

Security Quality (True Positive Rate): 69.3%

Detection Quality (False Positive Rate): 0.062%

Balanced Accuracy: 84.619%


AWS WAF

AWS WAF is a cloud-based service implementing ModSecurity. AWS provides a default Managed ruleset and optional additional paid-for rulesets from leading vendors such as F5.


We tested the service in two configurations.


AWS WAF – AWS managed ruleset:

The results were as follows:

Security Quality (True Positive Rate): 79.751%

Detection Quality (False Positive Rate): 5.8%

Balanced Accuracy: 86.976%


AWS WAF – AWS managed ruleset plus F5 Rules:

The results were as follows:

Security Quality (True Positive Rate): 80.372%

Detection Quality (False Positive Rate): 5.879%

Balanced Accuracy: 87.246%


NGINX ModSecurity

ModSecurity has been the most popular open-source WAF Engine available in the market for 20 years and is signatures-based. Many of the solutions tested here use it as a base.

ModSecurity has reached its End-of-Life as of July 2024. This year, the OWASP Core Rule Set (CRS) received a major update, which impacted effectiveness. The updated CRS focused more on security, leading to an improvement in Security Quality but a decrease in Detection Quality, reflecting a shift in balance towards blocking threats over allowing legitimate traffic.

We tested NGINX with ModSecurity Core Rule Set v4.3.0 (latest version) Default settings:


The results were as follows:

Security Quality (True Positive Rate): 92.028%

Detection Quality (False Positive Rate): 17.523%

Balanced Accuracy: 87.253%


F5 NGINX AppProtect

NGINX AppProtect WAF is a paid add-on to NGINX Plus and NGINX Plus Ingress based on the traditional F5 signature-based WAF solution. The AppProtect WAF comes with two policies - Default and Strict. The Default policy provides OWASP-Top-10 protection. The Strict policy is recommended by NGINX for “protecting sensitive applications that require more security but with a higher risk of false positives." It includes over 6000 signatures.

We tested the product with both policies.


NGINX AppProtect Default was configured as follows:

The results were as follows:

Security Quality (True Positive Rate): 77.9%

Detection Quality (False Positive Rate): 1.808%

Balanced Accuracy: 88.046%


NGINX AppProtect Strict WAF was configured as follows:

The results were as follows:

Security Quality (True Positive Rate): 97.849%

Detection Quality (False Positive Rate): 22.084%

Balanced Accuracy: 86.882%


open-appsec / CloudGuard WAF

open-appsec/CloudGuard WAF by Check Point is a machine-learning-based WAF that uses supervised and unsupervised machine-learning models to determine whether traffic is malicious.

We tested the product in two configurations (Default and Critical) using out-of-the-box settings with no learning period.


This year, open-appsec / CloudGuard WAF achieved even better results than last year thanks to the new released v2.0 of the ML Engine, which features an updated supervised learning methodology alongside a modified scoring system, specifically designed to provide greater accuracy across both small and large indicator sets.


  • In the Default Profile, Balanced Accuracy improved from 97.32% to 98.966% (False Positive Rate improved from 4.253% to 1.436% and Security Quality improved from 98.895% to 99.368%)

  • In the Critical Profile Balanced Accuracy improved from 96.8% to 99.139% (False Positive Rate stayed the same and Security Quality improved from 94.405% to 99.087%)


Default - activate protections when Confidence is set to “High and above”:

The results were as follows:

Security Quality (True Positive Rate): 99.368%

Detection Quality (False Positive Rate): 1.436%

Balanced Accuracy: 98.966%


Critical - activate protections when Confidence is set to “Critical”:

The results were as follows:

Security Quality (True Positive Rate): 99.087%

Detection Quality (False Positive Rate): 0.81%

Balanced Accuracy:99.139%

 

Imperva Cloud WAF (New)

Imperva Cloud WAF is one of the well-known WAF solutions in the industry. We tested it in the Default configuration.


The results were as follows:

Security Quality (True Positive Rate): 11.97%

Detection Quality (False Positive Rate): 0.009%

Balanced Accuracy: 55.981%


F5 BIG-IP Advanced WAF (New)

We tested F5 BIG-IP Advanced WAF using the Rapid Deployment Policy configuration. The solution was deployed on a virtual machine within AWS.

The results were as follows:

Security Quality (True Positive Rate): 78.89%

Detection Quality (False Positive Rate): 2.8%

Balanced Accuracy: 88.045%

 

Fortinet FortiWeb (New)

Fortinet FortiWeb offers several deployment options. For this test, we used the SaaS deployment option to evaluate its performance and protection capabilities directly from the cloud environment.

The results were as follows:

Security Quality (True Positive Rate): 68.971%

Detection Quality (False Positive Rate): 20.925%

Balanced Accuracy: 74.023%


Google Cloud Armor (New)

Google Cloud Armor is a cloud-based WAF solution that utilizes ModSecurity with the OWASP Core Rule Set (CRS). For this test, we enabled all available rules with the base sensitivity set to level 2.

The results were as follows:

Security Quality (True Positive Rate): 83.537%

Detection Quality (False Positive Rate): 50.283%

Balanced Accuracy: 66.627%


Analysis

Understanding the metrics used in this comparison is key to interpreting the results accurately. Below we delve deeper into each one of these metrics.


Security Quality (True Positive Rate)

The True Positive Rate gauges the WAF's ability to correctly detect and block malicious requests. A higher TPR is desirable as it suggests a more robust protection against attacks.


In the test, the highest True Positive Rate (TPR) was achieved by the open-appsec/ CloudGuard WAF registering a TPR of 99.368% with the out-of-the-box Default profile. Close after was F5 NGINX App Protect with the Strict Rule set with a TPR of 97.849% and Microsoft Azure WAF with a TPR of 97.526%.


The remaining WAFs showed variable results with Imperva demonstrating the lowest security quality at 11.97% with default settings. The results of many products are discouraging as most attacks tested are known and published for a long time.


Detection Quality (False Positive Rate)

The False Positive Rate measures the WAF's ability to correctly identify and allow legitimate requests. A lower FPR means the WAF is better at recognizing correct traffic and letting it pass, which is critical to avoid unnecessary business disruptions and administration overhead.


In the test lowest False Positive Rate (FPR) was achieved by Imperva registering an almost perfect 0.009%. Cloudflare close behind with 0.062%. And open-appsec/CloudGuard WAF - followed behind with a FPR of 0.81% using the Critical Profile.


Google Cloud Armor WAF presented a very high false positive rate of 50.283% and Microsoft Azure WAF with the Default Rule set had the highest False Positive Rate of 54.242%. These products need very heavy tuning initially and ongoing, for the product to be used in real-world environments.


Below is a graphical representation of Security Quality and Detection Quality results. The visualization provides an immediate, intuitive understanding of how well each WAF solution achieves the dual goals of blocking malicious requests and allowing legitimate ones. WAF solutions that appear towards the top right of the graph have achieved a strong balance between these two objectives.

Balanced Accuracy

Balanced Accuracy provides a more holistic view of the WAF's performance, considering both Security Quality and Detection Quality. Higher balanced accuracy indicates an optimal WAF solution that balances attack detection and legitimate traffic allowance.


The open-appsec/CloudGuard WAF - Critical Profile led the pack with a BA of 99.139%, closely followed by the open-appsec/CloudGuard WAF - Default Profile, registering a BA of 98.966%. Imperva had the lowest BA, standing at 55.981%.


Summary

For WAF products to deliver the promise of protecting Web Applications and APIs, they must excel in both Security Quality and Detection Quality. This article provides a valuable lab-based comparison that allows us to see how leading WAF solutions perform in the real world. It reveals a significant difference between product performance with some products posing a significant security risk or need for very heavy tuning initially and ongoing for the product to be used in real-world environments.


We hope that by sharing not just the findings of our test, but also the methodology, datasets, and tooling, we allow users to test the solutions they use and contribute to much-needed transparency in the industry. We were truly surprised about some of the results and welcome the audience to reproduce the result of this test and ask questions.


Finally, we are proud that open-appsec / CloudGuard WAF proves once again that the best way to implement Web Application Security is by using a combination of supervised and unsupervised machine learning engines as they provide not just the best Security Quality and Detection Quality but also the best protection against zero-day attacks. The product was the only one that blocked zero-day attacks such as Log4Shell, Spring4Shell, Text4Shell, and Claroty WAF on Bypass.


 

To learn more about how open-appsec works, see this White Paper and the in-depth Video Tutorial. You can also experiment with deployment in the free Playground.

Experiment with open-appsec for Linux, Kubernetes or Kong using a free virtual lab

bottom of page