What is unbescape?

unbescape is a Java library aimed at performing fully-featured and high-performance escape and unescape operations for:

  • HTML (HTML5 and HTML 4)
  • XML (XML 1.0 and XML 1.1)
  • JavaScript
  • JSON
  • URI / URL (both paths and query parameters)
  • CSS (both identifiers and string literals)
  • CSV (Comma-Separated Values)
  • Java literals
  • Java .properties files (both keys and values)

Its goals are:

  • To be easy to use. Few lines of code needed. No additional dependencies.
  • To be fast. Faster and lighter than most other options available in Java.
  • To be versatile. Provides different escaping types and levels in order to better adapt to different scenarios and contexts.
  • To be feature-complete. Includes full HTML5 support, careful implementation of the JavaScript, JSON, Java, etc specifications, advanced format tweaks...

What does it look like? Why should I use it?

Performing an unescape operation on HTML code is as easy as:

final String unescapedText = HtmlEscape.unescapeHtml(escapedText);

Note this includes support for unescaping HTML5's more than 2,300 entities or named character references (like ţ = 'ţ'). And also decimal/hexadecimal references (like á or á = 'á'). And the full Unicode character set, including codepoints > U+FFFF (like 𠀀 = '𠀀'). And even performing a good amount of magic included in the HTML5 specification (like allowing some references not to end in a semicolon, like &aacute = 'á'). So this text:

In order to have full HTML support, escape tools should provide
complete support for characters like the Czech letter 'Ť'
or the Chinese symbol '𠂭'. And implement some legacy
compatibility rules like displaying an “A with an acute
accent” like '&aacute' or an EURO symbol like '€'.

...will be unescaped as:

In order to have full HTML support, escape tools should provide
complete support for characters like the Czech letter 'Ť' or the
Chinese symbol '𠂭'. And implement some legacy-compatibility rules
like displaying an “A with an acute accent” like 'á' or an EURO
symbol like '€'.

(If you don't believe the code above is valid HTML5, just click View Source on your HTML5-enabled browser and look for "HTML5-SAMPLE" ;))

Escape operations are also extremely powerful, allowing you to (optionally) determine the type and level of escape you want. For example, this:

final String escapedText =
        HtmlEscape.escapeHtml(
                unescapedText,
                HtmlEscapeType.HTML4_NAMED_REFERENCES_DEFAULT_TO_DECIMAL,
                HtmlEscapeLevel.LEVEL_1_ONLY_MARKUP_SIGNIFICANT);

...will perform an HTML escape operation using the HTML 4 set of entities (named character references), defaulting to decimal escapes when needed. But because the established level is 1, only the markup significant characters will be escaped: <, >, &, " and ' (there is no ' in HTML 4, only in HTML5).

Easy, preconfigured methods are already provided, in fact the above call is equivalent to:

final String escapedText = HtmlEscape.escapeHtml4Xml(unescapedText);

This kind of escape is called HTML4-XML because is mimics the behaviour of the escapeXml attribute in JSP <c:out /> tags.

Want to ensure your escaped text is ASCII-only? Easy: use level 2 and the results will be entirely composed of characters under U+007F —bye-bye complex encoding issues!—. Let's try this time with HTML5:

final String escapedText =
        HtmlEscape.escapeHtml(
                unescapedText,
                HtmlEscapeType.HTML5_NAMED_REFERENCES_DEFAULT_TO_DECIMAL,
                HtmlEscapeLevel.LEVEL_2_ALL_NON_ASCII_PLUS_MARKUP_SIGNIFICANT);

Which in fact is the same as the preconfigured:

final String escapedText = HtmlEscape.escapeHtml5(unescapedText);
But of course, Unbescape is much more than HTML: XML, JavaScript, JSON, CSS... have a look at the large feature list below to know more.

The features

High performance
  • No unneeded String or char[] objects are created, and specific optimizations are applied in order to provide maximum performance and reduce Garbage Collector latency (e.g. if a String has the same content after escaping/unescaping, exactly the same String object is returned, no copy is made).
  • See (and execute) the benchmark.sh script in the unbescape-tests repository for specific figures.
Highly configurable
  • Most escaped languages allow specifying the type of escape to be performed: based on literals, on decimal numbers, hexadecimal, octal, etc.
  • Most escaped languages allow specifying the level of escape to be performed: only escape the basic set, escape all non-ASCII characters, escape all non-alphanumeric, etc.
  • Provides sensible defaults and pre-configured, easy-to-use methods.
Documented API
  • Includes full JavaDoc API documentation for all public classes, explaining each escape and unescape operation in detail.
  • See the JavaDoc API Documentation.
Unicode
  • All escape and unescape operations support the whole Unicode character set: U+0000 to U+10FFFF, including characters not representable by only one char in Java (>U+FFFF).
HTML
  • Whole HTML5 NCR (Named Character Reference) set supported, if required: &rsqb;,&NewLine;, etc. (HTML 4 set available too).
  • Mixed named and numerical (decimal or hexa) character references supported.
  • Ability to default to numerical (decimal or hexa) references when an applicable NCR does not exist (depending on the selected operation level).
  • Support for unescape of double-char NCRs in HTML5: &fjlig;fj.
  • Support for a set of HTML5 unescape tweaks included in the HTML5 specification:
    • Unescape of numerical character references not ending in semi-colon (e.g. &#x23ac).
    • Unescape of specific NCRs not ending in semi-colon (e.g. &aacute).
    • Unescape of specific numerical character references wrongly specified by their Windows-1252 codepage code instead of the Unicode one (e.g. &#x80; for (&euro;) instead of &#x20ac;).
XML
  • Support for both XML 1.0 and XML 1.1 escape/unescape operations.
  • No support for DTD-defined or user-defined entities. Only the five predefined XML character entities are supported: &lt;, &gt;, &amp;, &quot; and &apos;.
  • Automatic escaping of allowed control characters.
JavaScript
  • Support for the JavaScript basic escape set: \0, \b, \t, \n, \v, \f, \r, \", \', \\. Note that \v (U+000B) will not be used in escape operations (only unescape) because it is not supported by Microsoft Internet Explorer versions < 9.
  • Automatic escape of / (as \/ if possible) when it appears after <, as in </something>.
  • Support for escaping non-displayable, control characters: U+0001 to U+001F and U+007F to U+009F.
  • Support for X-based hexadecimal escapes (a.k.a. hexadecimal escapes) both in escape and unescape operations: \xE1.
  • Support for U-based hexadecimal escapes (a.k.a. unicode escapes) both in escape and unescape operations: \u00E1.
  • Support for Octal escapes, though only in unescape operations: \071. Not supported in escape operations (octal escapes were deprecated in version 5 of the ECMAScript specification).
JSON
  • Support for the JSON basic escape set: \b, \t, \n, \f, \r, \", \\.
  • Automatic escape of / (as \/ if possible) when it appears after <, as in </something>.
  • Support for escaping non-displayable, control characters: U+0000 to U+001F and U+007F to U+009F.
  • Support for U-based hexadecimal escapes (a.k.a. unicode escapes) both in escape and unescape operations: \u00E1.
URI / URL
  • Support for escape operations using percent-encoding (%HH).
  • Escape URI paths, path fragments, query parameters and fragment identifiers.
CSS
  • Complete set of CSS Backslash Escapes supported (e.g. \+, \;, \(, \), etc.).
  • Full set of escape syntax rules supported, both for CSS identifiers and CSS Strings (or literals).
  • Non-standard tweaks supported: \: not used because of lacking support in Internet Explorer < 8, \_ escaped at the beginning of identifiers for better Internet Explorer 6 support, etc.
  • Hexadecimal escapes (a.k.a. unicode escapes) are supported both in escape and unescape operations, and both in compact (\E1 ) and six-digit forms (\0000E1).
  • Support for unescaping unicode characters >\uFFFF both when represented in standard form (one char, \20000) and non-standard (surrogate pair, \D840\DC00, used by older WebKit browsers).
CSV (Comma-Separated Values)
  • Works according to the rules specified in RFC4180 (there is no CSV standard as such).
  • Encloses escaped values in double-quotes ("value") if they contain any non-alphanumeric characters.
  • Escapes double-quote characters (") by writing them twice: "".
  • Honors rules for maximum compatibility with Microsoft Excel.
Java literals
  • Support for the Java basic escape set: \b, \t, \n, \f, \r, \", \', \\. Note \' will not be used in escaping levels < 3 (= all but alphanumeric) because escaping the apostrophe is not really required in Java String literals (only in Character literals).
  • Support for escaping non-displayable, control characters: U+0001 to U+001F and U+007F to U+009F.
  • Support for U-based hexadecimal escapes (a.k.a. unicode escapes) both in escape and unescape operations: \u00E1.
  • Support for Octal escapes, though only in unescape operations: \071. Not supported in escape operations (use of octal escapes is not recommended by the Java Language Specification).
Java .properties files
  • Support for the Java Properties basic escape set: \t, \n, \f, \r, \\. When escaping .properties keys (not values) \ , \: and \= will be applied too.
  • Support for escaping non-displayable, control characters: U+0001 to U+001F and U+007F to U+009F.
  • Support for U-based hexadecimal escapes (a.k.a. unicode escapes) both in escape and unescape operations: \u00E1.

How is it distributed?

unbescape is Open Source Software, and it is distributed under the terms of the Apache License 2.0.

Project status

unbescape is stable and production-ready. Current version is 1.1.0.RELEASE.