Qualys Blog


Finding HTML Injection Vulns, Part I

Text and strings form the building blocks of web apps. Developers and content creators mix text with other media, code, and HTML to produce all kinds of apps for our browsers. However, when developers mix text with code or they carelessly place strings inside of HTML they expose the app to one of the most common web-related vulns: HTML Injection, a.k.a. Cross-Site Scripting (XSS). One way this happens is when developers use string concatenation to piece together a web page with static HTML and user-supplied data. For example, think of a site’s search function. When you submit a search request, the site responds with something like, "Here are the results for XYZ," and lists whatever might have matched. HTML injection occurs when the search term contains markup instead of simple text, and the app treats it like this:

<span>Here are the results for "<script>alert(9)</script>"</span>

Security researchers have discussed and demonstrated HTML injection vulns since the HTML spec’s first draft roughly 20 years ago. The root cause of the problem hasn’t changed much, but the techniques for exploiting it have. Early examples of HTML injection and XSS talked about stealing session IDs from the document.cookie object, or showed how to steal passwords from a login form. Today’s exploits leverage HTML5 features and have been integrated into sophisticated exploit frameworks. HTML injection vulns infect all kinds of sites. They have appeared in search engines, social media, banks, web-based email, even security companies. (Even a book about web security can cause web security problems.) Sometimes the flaws are so obvious that you have to wonder how developers missed the problem in the first place. The vulns seem easy to find, but the process is tedious and time consuming. In other words, it’s an ideal candidate for automation.


We designed WAS to accurately identify several types of HTML injection flaws. The easiest one to start with is called reflected XSS. This happens when the web app receives a request with a test payload and responds with HTML that contains the payload written in a way that changes the document’s structure. The reflected search term we mentioned previously is a prime example of this.

But a good scanner needs to be careful about how it inspects a response. It can’t rely solely on pattern matching to see if bits of a payload like <script> or alert() shows up in the HTML. After all, the payload might have been treated as an innocuous string rather than a JavaScript function. The following example shows a payload that has landed inside the quoted value for a form field. We can’t be sure if this site is vulnerable without doing some further analysis.

<input type="text" name="email" value="<script>alert(9)</script>">

Our scanner constructs different payloads to test these kinds of scenarios. In the previous example, the web site might not strip quotation marks, which would allow an attacker to manipulate the form field’s markup and inject arbitrary JavaScript – just what we want the scanner to be able to figure out on its own.

Some web apps try naive (and ultimately futile) countermeasures like looking for "typical" attacks that have words like "<script>" or "alert" within them. Most of the time it’s possible to bypass such weak filters by slightly altering the payload. In fact, just being able to create a nonsense tag like <abcd> indicates the app isn’t handling user-supplied data securely; it’d be a bug that should be fixed. So, the scanner goes through various payloads to see if one might work; it doesn’t just stop at the first failure.

The scanner doesn’t just stop at pattern matching, either. Every time it detects a test payload in the response, it also makes sure the payload affects the page in a way that would be exploitable. In other words, the payload actually has to modify the document’s structure in such a way that it would create a new element, download a resource, or execute JavaScript within a browser. Avoiding false positives requires more than looking for a reflected string.

But if this was where we stopped testing for HTML injection we’d miss a huge amount of possible vulns. In the next part we’ll look at how a scanner avoids false negatives by paying attention to detail and using techniques other than just checking single request/response pairs.

Leave a Reply