If you need to allow users to enter Html for display on your web site or application you're asking for trouble in the form of a Cross Site Scripting (XSS) attack. This attack is pretty simple. Imagine that you have a text input field and then you display the value that was input back to the user. For example, an error message might return the value the user entered. The user enters something like <script>javascript:alert('hello, from a hacker');</script> and suddenly they can control javascript coming from your server.
The first level of response is to HtmlEncode anything that is input from the user. This is what many of the built-in ASP.NET controls do for you and something that you should alway do in ASP.NET MVC.
But, when you encode everything the user enters you can't let the user create a bold tag, <b>Name</b> , which is perfectly safe. How do you allow safe html to get posted without encoding while making sure everything else is safe?
One approach taken by most bulletin boards is to adopt a special language, like BBCode, which is a limited grammar of acceptable tags. BBCode is used by our DotNetBB software right now and looks something like this:
[b]This is bold[/b]
[url=http://www.bvsoftware.com]Link to BV Software[/url]
This type of language works well but it isn't good enough in my opinion. Why should users need to learn a new language instead of html which is a standard already. Furthermore, if I'm a designer and I'm working on a nicely formatted post in DreamWeaver I have to convert it from Html to BBCode before posting.
So, I decided to build an Html sanitizer that will allow a safe subset of code to be posted while encoding everything else. I'm not the first person to try this and I looked over a lot of community code for ideas. What I found was that most of the scrubbers used Regular Expressions to match potentially dangerous scripts and then tried to remove or encode them. Here's one from Jeff Atwood of Coding Horror and here's one by Rob Conery of SubSonic Fame.
I have a love/hat relationship with regular expressions. They can be huge time savers and can present a simple solution to complex problems. They can also end up many lines long and so un-readable that you never have any hope of debugging the code. The regular expressions I found in the other scrubber code were just that. Long, complicated and not error proof at all.
Here are some examples that you can use to test for XSS attacks. When you see the huge variety of attacks possible you'll realize that a simple regular expression isn't going to cut it. If you look at Rob's Code. You'll notice that he took a different approach. His choice was to "white list" the safe tags and encode everything else.
I also took the "white list" approach but after reviewing possible attacks decided that I needed an extra step. Instead of just allowing safe tags through, I would parse the tags and rewrite the safe ones with a subset of tag attributes that are also safe or easy to check.
Step 1: Tokenize the text to find all of the Html tags. This was a simple matter of splitting the string on the "<" character. Every opening and closing tag needs to start with this.
Step 2: Walk through the tokens and do a basic parser routine. When we're not parsing a tag, HtmlEncode everything else. When we are parsing a tag, get the start tag.
Step 3: When parsing the start tag check to see if it's an allowed tag. If not, HtmlEncode it. If it is, check to see if it's a self closing tag like <br/>. If it's self closing, rewrite it in a safe manner. If not, keep reading tokens until you find the end.
Step 4: If you haven't found a valid tag that is closed just HtmlEncode everything you have and dump it.
Step 5: Rewriting tags. When you do find a valid tag (self closed or not). Parse out the name and attributes from the tag. Look over the attributes in a name/value list and only rewrite out the attributes you've selected as safe.
Step 6: Some attributes, like SRC and HREF require extra attention. They are vulnerable to javascript: and vbscript: tags in the attribute value.
I've taken all of the examples from an XSS sample site and my code has safely taken care of everything I can throw at it. I don't want to get cocky about it because someone could find an exploit tomorrow.
Some other things to note:
I had to choose a subset of html that I thought was safe an appropriate for users to enter:b,i,u,em,strong,h1,h2,h3,h4,h5,h6,div,span,p,blockquote,ol,ul,li,address,strike,a,img,sup,sub and hr
I had to be VERY strict on the formatting I allowed for html. All tags must be lower case. All tags must be closed. All tag attributes must be wrapped in double quotes. etc. This strict xhtml formatting makes it easier to parse out the safe tags and is generally a good idea. Less sophisticated users may not understand why the Html of their Word doc didn't come out exactly as they expected but I'm okay with that.
It's not a simple problem but I think my solution is working well so far. This code will be included in the new version of DotNetBB and some other projects I'm working on.
Is this something that you need for your projects? Should we consider wrapping it into a nice library at an attractive price?