Converting Strings into Hypertext Markup Language (HTML)


A commonly used web attack is called Cross-Site Scripting (XSS). For example, a user enters some malicious data, such as JavaScript code, into a web form; the web page then at some point outputs this information verbatim, without proper escaping. Standard examples for this are web guest books or discussion forms. People enter text for others to see it.

Escaping Strings for HTML

  $input = '<script>alert("I have a bad
Föhnwelle, ' .
           'therefore I crack websites.");</script>';
  echo htmlspecialchars($input) . '<br />';
  echo htmlentities($input);

Here, it is important to remove certain HTML markup. To make a long story short: It is almost impossible to really catch all attempts to inject JavaScript into data. It's not only always done using the <script> tag, but also in other HTML elements, such as <img onabort="badCode()" />. Therefore, in most cases, all HTML must be removed.

The easiest way to do so is to call htmlspecialchars(); this converts the string into HTML, including replacement of all < and > characters by &lt; and &gt;. Another option is to call htmlentities(). This uses HTML entities for characters, if available. The preceding code shows the differences between these two methods. The German ö (o umlaut) is not converted by htmlspecialchars(); however, htmlentities() replaces it by its entity &ouml;.

The use of htmlspecialchars() and htmlentities() just outputs what the user entered in the browser. So if the user entered HTML markup, this very markup is shown. So htmlspecialchars() and htmlentities() please the browser, but might not please the user.

If you, however, want to prepare strings to be used within URLs, you have to use urlencode() to properly encode special characters such as the space character that can be used in URLs.

However, the function strip_tags() does completely get rid of all HTML elements. If you just want to keep some elements (for example, some limited formatting functionalities with <b> and <i> and <br /> tags), you provide a list of allowed values in the second parameter for strip_tags(). The following script shows this; figure depicts its output. As you can see, all unwanted HTML tags have been removed; however, its contents are still there.

Removing All HTML Tags

  $input = 'My parents <i>hate</i> me, <br />' .
    'therefore I <b>crack</b> websites. ' .
    '<script>alert("Nice try!");</script>' .
    '<img src="explicit.jpg" />';
  echo strip_tags($input, '<b><br><i>');

Some HTML tags were stripped, but not all.