Converting Strings into HTML

A commonly used web attack is called Cross-Site Scripting (XSS). For example, a user enters some malicious data, such as JavaScript code, into a web form; the web page then at some point outputs this information verbatim, without proper escaping. Standard examples for this are your blog’s comments section or discussion forms.

Escaping Strings for HTML

 $input = '<script>alert("I have a bad Föhnwelle...");</script>';

 echo htmlspecialchars($input);
&lt;script&gt;alert(&quot;I have a bad Föhnwelle...&quot;);&lt;/script&gt;*/

  echo htmlentities($input);
&lt;script&gt;alert(&quot;I have a bad F&ouml;hnwelle...&quot;);&lt;/script&gt;*/

Here, it is important to remove certain HTML markup. To make a long story short: It is almost impossible to really catch all attempts to inject JavaScript into data. It’s not only always done using the <script> tag, but also in other HTML elements, such as <img onabort="badCode()" />. Therefore, in most cases, all HTML must be removed.

The easiest way to do so is to call htmlspecialchars(); this converts the string into HTML, including the replacement of all < and > characters by &lt; and &gt;. Another option is to call htmlentities(). This uses HTML entities for characters, if available. The preceding code shows the differences between these two methods. The German ö (o umlaut) is not converted by htmlspecialchars(); however, htmlentities() replaces it by its entity &ouml;.

The use of htmlspecialchars() and htmlentities() just outputs what the user entered in the browser. So if the user entered HTML markup, this very markup is shown. So htmlspecialchars() and htmlentities() please the browser, but might not please the user.

If you, however, want to prepare strings to be used within URLs, you have to use urlencode() to properly encode special characters such as the space character that can be used in URLs.

Removing All HTML Tags

 strip_tags(string $string,
           array|string|null $allowed_tags = null
          ): string

The function strip_tags() does completely get rid of all HTML elements. If you just want to keep some elements (for example, some limited formatting functionalities with <b> and <i> and <br> tags), you provide a list of allowed values in the second parameter for strip_tags().

The following script shows this; the figure depicts its output. As you can see, all unwanted HTML tags have been removed; however, their contents are still there.

 $text = 'A commonly used web <i>attack</i> is called<br>
Cross-Site Scripting <b>XSS</b>.<br>
For example:<br>
<script>alert("Nice try!");</script>
<img src="explicit.jpg">';

echo strip_tags($text, '<br><i><b>');
PHP strip_tags example: Some HTML tags were stripped, but not all.

Working with Strings: