Monday, May 25, 2009

Java/JSP/JSF and JavaScript

Introduction

Today I investigated using Google Analytics the search keywords used to find this blog, so that I can get some blogging inspiration based on the "missing hits" (i.e. the actual search incorrectly showed my blog between the results). Just for fun I entered "javascript" in the Analytics keyword-search tool to see what happens. To my surprise I see that fairly a lot of Google search keywords in the following context were used:

  • access jsf managed bean in javascript
  • javascript pass variable to jsp
  • check jsf facesmessages in javascript
  • call java method in javascript
  • set javascript variable in jsp
  • communicate jsf and javascript

There's apparently still a lot of confusion and unawareness in the (young) web developer world with regard to Java/JSP/JSF and JS. Well, let's write a blog about it to clear it all out.

Back to top

Server side and client side

You probably have ever heard of "server side" and "client side" in terms of web development. You have server side specific technologies and programming languages and also client side specific technologies and programming languages (well, not specifically programming, it's in terms of web development more scripting and markup, but OK, let's make it easy understandable).

When one develops websites, one would often use physically the same machine (PC, laptop, whatever) to develop and test websites on. I.e., both the webserver and webbrowser runs at physically the same machine. This would never, I repeat, never occur in real production world. The server machine and the client machine are always physically different machines, connected with each other by some kind of a network (usually the Internet).

Java/JSP/JSF are server side specific technologies. It runs at the server machine and produces HTML/CSS/JS output based on a HTTP request of the client machine. This output will be sent over network from the server side to the client side as part of the HTTP response. Open up such a JSP/JSF page in your favourite webbrowser and choose the "View Source" option (or something similar). What do you see? Right, it's all plain vanilla HTML/CSS/JS. If the code at the server side has done its task the right way, you should not see any line of Java/JSP/JSF code in the generated output.

When the server application (in this case, the webserver) has sent the output, it has finished its task of HTTP request/response cycle and is now waiting for the next HTTP request from the client side. When the client application (in this case, the webbrowser) has received the HTML/CSS/JS output, displayed the HTML, applied the CSS and executed any "onload" JS, it has finished its task of showing the result of some client interaction (entering URL in address bar, clicking a link, submitting a form, etcetera) and is now waiting for the next client interaction to fire a new HTTP request for.

JS is a client side specific technology. It runs entirely at the client machine and it only has direct access to the whole HTML DOM tree (the stuff which is accessible through the document object). It would only get executed directly when the client application (the webbrowser) encounters inline JS code or onload events during loading of the page for display. It can also get executed later when the client application receives specific events which have a JS function attached, such as onclick, onkeypress, onmouseover, etcetera. All of those events can be specified as HTML element attributes.

You should now realize that the only way to let Java/JSP/JSF at the server side pass variables to or execute something at the client side is simply by generating the HTTP response for the client side that way so that the JS action will be taken accordingly. You should also realize that the only way to let JS at the client side pass variables to or execute something at the server side is by simply firing a new HTTP request to the server side that way so that the Java/JSP/JSF action will be taken accordingly.

Back to top

Pass variables from server side to client side

As you might have understood now, this can only happen by writing inline JS code or onload events accordingly. Let's check first how the generated HTML output should look like then:

<!doctype html>
<html>
    <head>
        <title>Test</title>
        <script type="text/javascript">
            // Do something inline with variable from server.
            var variableFromServer = 'variableFromServer';
            doSomethingInline(variableFromServer);
            function doSomethingInline(variable) {
                alert('doSomethingInline: ' + variable);
            }

            // Do something onload with variable from server.
            function doSomethingOnload(variable) {
                alert('doSomethingOnload: ' + variable);
            }
        </script>
    </head>
    <body onload="doSomethingOnload('variableFromServer');">
        <h1>Test</h1>
    </body>
</html>

The above is just a basic example, you could also use for example an onclick event instead. OK, in the place of variableFromServer we thus want to set a variable from the server side. You can just do it by writing the server side code accordingly that the desired generated HTML output is exact the same as above. Let's suppose that the variable have to be retrieved as a bean property. Here's an example with JSP:

<!doctype html>
<html>
    <head>
        <title>Test</title>
        <script type="text/javascript">
            // Do something inline with variable from server.
            var variableFromServer = '${someBean.someProperty}';
            doSomethingInline(variableFromServer);
            function doSomethingInline(variable) {
                alert('doSomethingInline: ' + variable);
            }

            // Do something onload with variable from server.
            function doSomethingOnload(variable) {
                alert('doSomethingOnload: ' + variable);
            }
        </script>
    </head>
    <body onload="doSomethingOnload('${someBean.someProperty}');">
        <h1>Test</h1>
    </body>
</html>

Fairly simple, is it? Take note that you still need those singlequotes around the variable; not for Java/JSP, but for JavaScript! Only pure number or boolean variables can be left unquoted.

In JSF the principle is not that much different. You could use EL the same way, but if it has to be retrieved as a managed bean property and you're using JSF on JSP, you can just use h:outputText instead to output it. Here's an example:

<%@taglib uri="http://java.sun.com/jsf/core" prefix="f" %>
<%@taglib uri="http://java.sun.com/jsf/html" prefix="h" %>

<!doctype html>
<f:view>
    <html>
        <head>
            <title>Test</title>
            <script type="text/javascript">
                // Do something inline with variable from server.
                var variableFromServer = '<h:outputText value="#{someBean.someProperty}" />';
                doSomethingInline(variableFromServer);
                function doSomethingInline(variable) {
                    alert('doSomethingInline: ' + variable);
                }

                // Do something onload with variable from server.
                function doSomethingOnload(variable) {
                    alert('doSomethingOnload: ' + variable);
                }
            </script>
        </head>
        <body onload="doSomethingOnload('<h:outputText value="#{someBean.someProperty}" />');">
            <h1>Test</h1>
        </body>
    </html>
<f:view>

Run the page and check the generated HTML output. Do you see it? That's the whole point! This way you can let JS "access" the server side variables. When you're using JSF on Facelets, then you can just leave the h:outputText away and use #{someBean.someProperty} instead since Facelets supports unified EL in template text.

In the code examples the JS code will be executed two times. The first time when the line with the inline JS function call is interpreted by the client application and the second time when the page is loaded completely (the body onload attribute). The biggest difference is that the second function will only be executed when the entire HTML DOM tree is built up, this is important if the function require HTML DOM elements which are to be obtained by the document reference in JS.

Back to top

Pass variables from client side to server side

As you probably already know, you can do that with plain HTML by just clicking a link with query parameters, or submitting a form with (hidden) input parameters. Links sends HTTP GET requests only, while forms are capable of sending HTTP POST requests. In JS, on the other hand, there are in general three ways to accomplish this:

  1. The first way is to simulate invocation of an existing link/button/form. Here are three examples:
    <a id="linkId" href="http://www.google.com/search?q=balusc">Link</a>
    
    ...
    
    <script type="text/javascript">
        document.getElementById('linkId').click();
    </script>
    
    <form action="http://www.google.com/search">
        <input type="text" id="inputId" name="q">
        <input type="submit" id="buttonId" value="Button">
    </form>
    
    ...
    
    <script type="text/javascript">
        document.getElementById('inputId').value = 'balusc';
        document.getElementById('buttonId').click();
    </script>
    
    <form id="formId" action="http://www.google.com/search">
        <input type="text" id="inputId" name="q">
    </form>
    
    ...
    
    <script type="text/javascript">
        document.getElementById('inputId').value = 'balusc';
        document.getElementById('formId').submit();
    </script>
    
    In case of JSF, keep in mind that you must use the element ID's of the JSF generated HTML elements (the JSF Client ID's) in JS. Those are namely not necessarily the same as JSF component ID's! So, view the generated HTML output when writing JS code specific for JSF output. For example the following JSF code..
    <h:form id="formId">
        <h:inputText id="inputId" value="#{bean.property}" />
        <h:commandButton id="buttonId" value="Button" action="#{bean.action}" />
    </h:form>
    
    ..would produce roughly the following HTML output..
    <form id="formId" action="current request URL">
        <input type="text" id="formId:inputId" name="formId:inputId" />
        <input type="submit" id="formId:buttonId" name="formId:buttonId" value="Button" />
    </form>
    
    ..for which you should write JS like the following:
    <script type="text/javascript">
        document.getElementById('formId:inputId').value = 'foo';
        document.getElementById('formId:buttonId').click();
        // Or: document.getElementById('formId').submit();
    </script>
    
  2. The second way is to use window.location to fire a plain GET request. For example:
    <script type="text/javascript">
        var search = 'balusc';
        window.location = 'http://www.google.com/search?q=' + search;
    </script>
    
  3. The third way is to use XMLHttpRequest object to fire an asynchronous request and process the results. This technique is the base idea of "Ajax". Here's a Firefox compatible example:
    <script type="text/javascript">
        function getUrl(search) {
            var xhr = new XMLHttpRequest();
            xhr.onreadystatechange = function() {
                if (xhr.readyState == 4) {
                    var responseJson = eval('(' + xhr.responseText + ')');
                    var url = responseJson.responseData.results[0].unescapedUrl;
                    var link = document.getElementById('linkId');
                    link.href = link.firstChild.nodeValue = url;
                    link.onclick = null;
                }
            }
            var google = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q='
            xhr.open('GET', google + search, true);
            xhr.send(null);
        }
    </script>
    
    ...
    
    <p>My homepage is located at: <a id="linkId" href="#" onclick="getUrl('balusc')">click me!</a></p>
    
    Funny, is it? You didn't see a "flash of content" (reload/refresh) because it all happens fully in the background. That's the nice thing of Ajax. For compatibility with certain webbrowsers which uses proprietary API's (cough) you'll need to write some supplementary code. Also see the Ajax tutorial at w3schools.

    In case of JSP, it's better to start with a widely used Ajaxical API than reinventing the wheel by writing it yourself, which would at end only lead to trouble in terms of maintainability and reusability. I highly recommend to take a look for the great jQuery API. Its replacement of the getUrl() would then look something like:
    function getUrl(search) {
        var google = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=';
        $.getJSON(google + search, function(responseJson) {
            var url = responseJson.responseData.results[0].unescapedUrl;
            $('#linkId').attr('href', url).text(url).attr('onclick', null);
        });
    }
    
    On the server side, instead of the Google example you could of course create a Servlet which has doGet() implemented and returns for example a JSON string so that JS could process it further. JSON is a shorthand for JavaScript Object Notation. It is roughly a blueprint of a JavaScript object in String format. You can even specify and access object properties. In the eye of a Java developer, a JSON has much Javabean characteristics. If you're new to JSON (or even JavaScript or HTML DOM), here are more tutorials: JavaScript, HTML DOM and JSON. For converting Java objects to a JSON string which you just write to response, I can recommend the GSON API.
    
        /**
         * Write given Java object as JSON to the output of the given response.
         * @param response The response to write the given Java object as JSON to its output.
         * @param object Any Java Object to be written as JSON to the output of the given response. 
         * @throws IOException If something fails at IO level.
         */
        public void writeJson(HttpServletResponse response, Object object) throws IOException {
            String json = new Gson().toJson(object);
            response.setContentType("application/json");
            response.setCharacterEncoding("UTF-8");
            response.getWriter().write(json);
        }
    
    

    In case of JSF, just take a look for Ajaxical component libraries such as JBoss RichFaces. It provides "supplementary" Ajax support in form of the Ajax4jsf (a4j) components, also see this document.

Hopefully it's now all clear what the whole point of "server side" and "client side" is and what their capabilities are.

Happy coding!

Back to top

Copyright - No text of this article may be taken over without explicit authorisation. Only the code is free of copyright. You can copy, change and distribute the code freely. Just mentioning this site should be fair.

(C) May 2009, BalusC

Wednesday, May 6, 2009

Unicode - How to get the characters right?

Introduction

Computers understand only bits and bytes. You know, the binary numeral system of zeros and ones. Humans, on the other hand, understand characters only. You know, the building blocks of the natural languages. So, to handle human readable characters using a computer (read, write, store, transfer, etcetera), they have to be converted to bytes. One byte is an ordered collection of eight zeros or ones (bits). The characters are only used for pure presentation to humans. Behind any character you see, there is a certain order of bits. For a computer a character is in fact nothing less or more than a simple graphical picture (a font) which has an unique "identifier" in form of a certain order of bits.

To convert between chars and bytes a computer needs a mapping where every unique character is associated with unique bytes. This mapping is also called the character encoding. The character encoding exist of basically two parts. The one is the character set (charset), which represents all of the unique characters. The other is the numeral representation of each of the characters of the charset. The numeral representation to humans is usually in hexadecimal, which is in turn easily to be "converted" to bytes (both are just numeral systems, only with a different base).

Character Encoding
Character set
(human presentation)
Numeral representation
(computer identification)
Ax0041 (01000001)
Bx0042 (01000010)
Cx0043 (01000011)
Dx0044 (01000100)
Ex0045 (01000101)
Fx0046 (01000110)
Gx0047 (01000111)
Hx0048 (01001000)
Ix0049 (01001001)
Jx004A (01001010)
Kx004B (01001011)
Lx004C (01001100)
Mx004D (01001101)
Nx004E (01001110)
Ox004F (01001111)
......
Back to top

Well, where does it go wrong?

The world would be much simpler if only one character encoding existed. That would have been clear enough for everyone. Unfortunately the truth is different. There are a lot of different character encodings, each with its own charsets and numeral mappings. So it may be obvious that a character which is converted to bytes using character encoding X may not be the same character when it is converted back from bytes using character encoding Y. That would in turn lead to confusion among humans, because they wouldn't understand the way the computer represented their natural language. Humans would see completely different characters and thus not be able to understand the "language" which is also known as the "mojibake". It can also happen that humans would not see any linguistic character at all, because the numeral representation of the character in question isn't covered by the numeral mapping of the character encoding used. It's simply unknown.

How such an unknown character is displayed differs per application which handles the character. In the webbrowser world, Firefox would display an unknown character as a black diamond with a question mark in it, while Internet Explorer would display it as an empty white square with a black border. Both represents the same Unicode character though: xFFFD, which is displayed in your webbrowser as "�". Internet Explorer simply doesn't have a font (a graphical picture) for it, hence the empty square. In Java/JSP/Servlet world, any unknown character which is passed through the write() methods of an OutputStream (e.g. the one obtained by ServletResponse#getOutputStream()) get printed as a plain question mark "?". Those question marks can in some cases also be caused by the database. Most database engines replaces uncovered numeral representations by a plain question mark during save (INSERT/UPDATE), which is in turn later displayed to the human when the data is been queried and sent to the webbrowser. The plain question marks are thus not per se caused by the webbrowser.

Here is a small test snippet which demonstrates the problem. Keep in mind that Java supports and uses Unicode all the time. So the encoding problem which you see in the output is not caused by Java itself, but by using the ISO 8859-1 character encoding to display Unicode characters. The ISO 8859-1 character encoding namely doesn't cover the numeral representations of a large part of the Unicode charset which is also known as the UTF-8 charset. By the way, the term "Unicode character" is nowhere defined, but it usually used by (unaware) programmers/users who actually meant "Any character which is not covered by the ISO 8859 character encoding".

package test;

public class Test {

    public static void main(String... args) throws Exception {
        String czech = "Český";
        String japanese = "日本語";

        System.out.println("UTF-8 czech: " + new String(czech.getBytes("UTF-8")));
        System.out.println("UTF-8 japanese: " + new String(japanese.getBytes("UTF-8")));

        System.out.println("ISO-8859-1 czech: " + new String(czech.getBytes("ISO-8859-1")));
        System.out.println("ISO-8859-1 japanese: " + new String(japanese.getBytes("ISO-8859-1")));
    }

}

UTF-8 czech: Český
UTF-8 japanese: 日本語
ISO-8859-1 czech: ?esk�
ISO-8859-1 japanese: ???

These kinds of problems are often referred to as the "Unicode problem".

Important note: your own operating system should of course have the proper fonts (yes, the human representations) supporting those Unicode charsets for both Czech and Japanese languages installed to see the proper characters/glyphs at this webpage :) Otherwise you will see in for example Firefox a black-bordered square with hexcode inside (0-9 and/or A-F) and in most other webbrowsers such as IE, Safari and Chrome a nothing-saying empty square with a black border. Below is a screenshot from Chrome which shows the right characters, so you can compare if necessary:

If your operating system for example doesn't have the Japanese glyphs in the font as required by this page, then you should in Firefox see three squares with hexcodes 65E5, 672C and 8A9E. Those hexcodes are actually also called 'Unicode codepoints'. In Windows, you can view all available fonts and the supported characters using 'charmap.exe' (Start > Run > charmap).

Another important note: if you have problems when copypasting the test snippet in your development environment (e.g. you are not seeing the proper characters, but only empty squares or something like), then please wait with playing until you have read the entire article, including step 1 of the OK .. So, I have an "Unicode problem", what now? chapter ;)

Back to top

Unicode, what's it all about?

Let's go back in the history of character encoding. Most of you may be familiar with the term "ASCII". This was less or more the first character encoding ever. At the ages when a byte was very expensive and 1MHz was extremely fast, only the characters which appeared on those ancient US typewriters (as well as at the average US International keyboard nowadays) were covered by the charset of the ASCII character encoding. This includes the complete Latin alphabet (A-Z, in both the lowercased and uppercased flavour), the numeral digits (0-9), the lexical control characters (space, dot, comma, colon, etcetera) and some special characters (the at sign, the sharp sign, the dollar sign, etcetera). All those characters fill up the space of 7 bits, half of the room a byte provides, with a total of 128 characters.

Later the remaining bit of a byte is used for Extended ASCII which provides room for a total of 255 characters. Most of the remaining room is used by special characters, such as diacritical characters and line drawing characters. Because everyone used the remaining room their own way (IBM, Commodore, Universities, etcetera), it was not interchangeable. Later ISO came up with standard character encoding definitions for 8 bit ASCII extensions, resulting in the known ISO 8859 character encoding standards such as ISO 8859-1.

8 bits may be enough for the languages using the Latin alphabet, but it is certainly not enough for the remaining non-Latin languages in the world, such as Chinese, Japanese, Hebrew, Cyrillic, Sanskrit, Arabic, etcetera. They developed their own non-ISO character encodings which was -again- not interchangeable, such as Guobiao, BIG5, JIS, KOI, MIK, TSCII, etcetera. Finally a new 16 bits character encoding standard based on top of ISO 8859-1 was established to cover any of the characters used at the world so that it is interchangeable everywhere: Unicode. You can find all of those linguistic characters here. Unicode also covers many special characters (symbols) such as punctuation and mathematical operators, which you can find here.

Back to top

OK .. So, I have an "Unicode problem", what now?

To the point: just ensure that you use UTF-8 (a character encoding which conforms the Unicode standard) all the way. There are more Unicode character encodings as well, but as far they are used very, very seldom. UTF-8 is likely the Unicode standard. To solve the "Unicode problem" you need to ensure that every step which involves byte-character conversion uses the one and the same character encoding: reading data from input stream, writing data to output stream, querying data from database, storing data in database, manipulating the data, displaying the data, etcetera. For a Java EE web developer, there are a lot of things you have to take into account.

  1. Development environment: yes, the development environment has to use UTF-8 as well. By default most text files are saved using the operating system default encoding such as ISO 8859-1 or even an proprietary encoding such as Windows ANSI (also known as CP-1252, which is in turn not interchangeable with non-Windows platforms!). The most basic text editor of Windows, Notepad, uses Windows ANSI by default, but Notepad supports UTF-8 as well. To save a text file containing Unicode characters using Notepad, you need to choose the File » Save As option and select UTF-8 from the Encoding dropdown. The same Save As story applies on many other self-respected text editors as well, like EditPlus, UltraEdit and Notepad++.


    In an IDE such as Eclipse you can set the encoding at several places. You need to explore the IDE preferences thoroughly to find and change them. In case of Eclipse, just go to Window » Preferences and enter filter text encoding. In the filtered preferences (Workspace, JSP files, etcetera) you can select the desired encoding from a dropdown. Important note: the Workspace encoding also covers the output console and thus also the outcome of System.out.println(). If you sysout an Unicode character using the default encoding, it would likely be printed as a plain vanilla question mark!


    In the command console it is not possible.

    C:\Java>java test.Test
    UTF-8 czech: Český
    UTF-8 japanese: 日本語
    ISO-8859-1 czech: ─?esk├╜
    ISO-8859-1 japanese: µ?ѵ?¼Φ¬?

    C:\Java>_

    In theory, in the Windows command prompt you have to use a font which supports a broad range of Unicode characters. You can set the font by opening the command console (Start > Run > cmd), then clicking the small cmd icon at the left top, then choosing Properties and finally choosing the Font tab. In a default Windows environment only the Lucida Console font has the "best" support of Unicode fonts. It unfortunately lacks a broad range of Unicode characters though.

    The cmd.exe parameter \U and/or the command chcp 65001 (which changes the code page to UTF-8) doesn't help much if the font already doesn't support the desired characters. You could hack the registry to add more fonts, but you still have to find a specific command console font which supports all of the desired characters. In the end it's better to use Swing to create a command console like UI instead of using the standard command console. Especially if the application is intended to be distributed (you don't want to require the enduser to hack/change their environment to get your application to work, do you? ;) ).

  2. Java properties files: as stated in its Javadoc the load(InputStream) method of the java.util.Properties API uses ISO 8859-1 as the default encoding. Here's an extract of the class' Javadoc:

    .. the input/output stream is encoded in ISO 8859-1 character encoding. Characters that cannot be directly represented in this encoding can be written using Unicode escapes ; only a single 'u' character is allowed in an escape sequence. The native2ascii tool can be used to convert property files to and from other character encodings.

    If you have full control over loading of the properties files, then you should use the Java 1.6 load(Reader) method in combination with an InputStreamReader instead:
    
    Properties properties = new Properties();
    properties.load(new InputStreamReader(classLoader.getResourceAsStream(filename), "UTF-8"));
    
    
    If you don't have full control over loading of the properties files (e.g. managed by some framework), then you need the in the Javadoc mentioned native2ascii tool. The native2ascii tool can be found in the /bin folder of the JDK installation directory. When you for example need to maintain properties files with Unicode characters for i18n (Internationalization; also known as resource bundles), then it's a good practice to have both an UTF-8 properties file and an ISO 8859-1 properties file and some batch program to convert from the UTF-8 properties file to an ISO 8859-1 properties file. You use the UTF-8 properties file for editing only. You use the converter to convert it to ISO 8859-1 properties file after every edit. You finally just leave the ISO 8859-1 properties file as it is. In most (smart) IDE's like Eclipse you cannot use the .properties extension for those UTF-8 properties files, it would complain about unknown characters because it is forced to save properties files in ISO 8859-1 format. Name it .properties.utf8 or something else. Here's an example of a simple Windows batch file which does the conversion task:
    
    cd c:\path\to\properties\files
    c:\path\to\jdk\bin\native2ascii.exe -encoding UTF-8 text_cs.properties.utf8 text_cs.properties
    c:\path\to\jdk\bin\native2ascii.exe -encoding UTF-8 text_ja.properties.utf8 text_ja.properties
    c:\path\to\jdk\bin\native2ascii.exe -encoding UTF-8 text_zh.properties.utf8 text_zh.properties
    # You can add more properties files here.
    
    
    Save it as utf8.converter.bat (or something like) and run it once to convert all UTF-8 properties files to standard ISO 8859-1 properties files. If you're using Maven and/or Ant, this can even be automated to take place during the build of the project.

    For JSF there are better ways using ResourceBundle.Control API. Check this blog article: Internationalization in JSF with UTF-8 properties files.

  3. JSP/Servlet request: during request processing an average application server will by default use the ISO 8859-1 character encoding to URL-decode the request parameters. You need to force the character encoding to UTF-8 yourself. First this: "URL encoding" must not to be confused with "character encoding". URL encoding is merely a conversion of characters to their numeral representations in the %xx format, so that special characters can be passed through URL without any problems. The client will URL-encode the characters before sending them to the server. The server should URL-decode the characters using the same character encoding. Also see "percent encoding".

    How to configure this depends on the server used, so the best is to refer its documentation. In case of for example Tomcat you need to set the URIEncoding attribute of the <Connector> element in Tomcat's /conf/server.xml to set the character encoding of HTTP GET requests, also see this document:
    
    <Connector (...) URIEncoding="UTF-8" />
    
    
    In for example Glassfish you need to set the <parameter-encoding> entry in webapp's /WEB-INF/sun-web.xml (or, since Glassfish 3.1, glassfish-web.xml), see also this document:
    
    <parameter-encoding default-charset="UTF-8" />
    
    
    URL-decoding POST request parameters is a story apart. The webbrowser is namely supposed to send the charset used in the Content-Type request header. However, most webbrowsers doesn't do it. Those webbrowsers will just use the same character encoding as the page with the form was delivered with, i.e. it's the same charset as specified in Content-Type header of the HTTP response or the <meta> tag. Only Microsoft Internet explorer will send the character encoding in the request header when you specify it in the accept-charset attribute of the HTML form. However, this implementation is broken in certain circumstances, e.g. when IE-win says "ISO-8859-1", it is actually CP-1252! You should really avoid using it. Just let it go and set the encoding yourself.

    You can solve this by setting the same character encoding in the ServletRequest object yourself. An easy solution is to implement a Filter for this which is mapped on an url-pattern of /* and basically contains only the following lines in the doFilter() method:
    
    if (request.getCharacterEncoding() == null) {
        request.setCharacterEncoding("UTF-8");
    }
    chain.doFilter(request, response);
    
    
    Note: URL-decoding POST request parameters the above way is not necessary when you're using Facelets instead of JSP as it defaults to UTF-8 already. It's also not necessary when you're using Glassfish as the <parameter-encoding> also takes care about this.

    Here's a test snippet which demonstrates what exactly happens behind the scenes when it all fails:
    package test;
    
    import java.net.URLDecoder;
    import java.net.URLEncoder;
    
    public class Test {
    
        public static void main(String... args) throws Exception {
            String input = "日本語";
            System.out.println("Original input string from client: " + input);
    
            String encoded = URLEncoder.encode(input, "UTF-8");
            System.out.println("URL-encoded by client with UTF-8: " + encoded);
    
            String incorrectDecoded = URLDecoder.decode(encoded, "ISO-8859-1");
            System.out.println("Then URL-decoded by server with ISO-8859-1: " + incorrectDecoded);
    
            String correctDecoded = URLDecoder.decode(encoded, "UTF-8");
            System.out.println("Server should URL-decode with UTF-8: " + correctDecoded);
        }
    
    }
    
    Original input string from client: 日本語
    URL-encoded by client with UTF-8: %E6%97%A5%E6%9C%AC%E8%AA%9E
    Then URL-decoded by server with ISO-8859-1: 日本語
    Server should URL-decode with UTF-8: 日本語

  4. JSP/Servlet response: during response processing an average application server will by default use ISO 8859-1 to encode the response outputstream. You need to force the response encoding to UTF-8 yourself. If you use JSP as view technology, then adding the following line to the top (yes, as the first line) of your JSP ought to be sufficient:
    
    <%@ page pageEncoding="UTF-8" %>
    
    
    This will set the response outputstream encoding to UTF-8 and set the HTTP response content-type header to text/html;charset=UTF-8. To apply this setting globally so that you don't need to edit every individual JSP, you can also add the following entry to your /WEB-INF/web.xml file:
    
    <jsp-config>
        <jsp-property-group>
            <url-pattern>*.jsp</url-pattern>
            <page-encoding>UTF-8</page-encoding>
        </jsp-property-group>
    </jsp-config>
    
    
    Note: this is not necessary when you're using Facelets instead of JSP as it defaults to UTF-8 already.

    The HTTP content-type header actually does nothing at the server side, but it should instruct the webbrowser at the client side which character encoding to use for display. The webbrowser must use it above any specified HTML meta content-type header as specified by w3 HTML spec chapter 5.2.2. In other words, the HTML meta content-type header is totally ignored when the page is served over HTTP. But when the enduser saves the page locally and views it from the local disk file system, then the meta content-type header will be used. To cover that as well, you should add the following HTML meta content-type header to your JSP anyway:

    
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    
    
    Note: lowercase utf-8 or uppercase UTF-8 doesn't really matter in all circumstances.

    If you (ab)use a HttpServlet instead of a JSP to generate HTML content using out.write(), out.print() statements and so on, then you need to set the encoding in the ServletResponse object itself inside the servlet method block before you call getWriter() or getOutputStream() on it:
    
    response.setCharacterEncoding("UTF-8");
    
    
    You can do that in the aforementioned Filter, but this can lead to problems if you have servlets in your webapplication which uses the response for something else than generating HTML content. After all, there shouldn't be any need to do this. Use JSP to generate HTML content, that's where it is for. When generating other plain text content than HTML, such as XML, CSV, JSON, etcetera, then you need to set the response character encoding the above way.

  5. JSF/Facelets request/response: JSF/Facelets uses by default already UTF-8 for all HTTP requests and responses. You only need to configure the server as well to use the same encoding as described in JSP/Servlet request section.

    Only when you're using a custom filter or a 3rd party component library which calls request.getParameter() or any other method which implicitly needs to parse the request body in order to extract the data, then there's chance that it's too late for JSF/Facelets to set the UTF-8 character encoding before the request body is been parsed for the first time. PrimeFaces 3.2 for example is known to do that. In that case, you'd still need a custom filter as described in JSP/Servlet request section.

  6. Databases: also the database has to take the character encoding in account. In general you need to specify it during the CREATE and if necessary also during the ALTER statements and in some cases you also need to specify it in the connection string or the connection parameters. The exact syntax depends on the database used, best is to refer its documentation using the keywords "character set". In for example MySQL you can use the CHARACTER SET clause as pointed out here:
    
    CREATE DATABASE db_name CHARACTER SET utf8;
    CREATE TABLE tbl_name (...) CHARACTER SET utf8;
    
    
    Usually the database's JDBC driver is smart enough to use the database and/or table specified encoding for querying and storing the data. But in worst cases you have to specify the character encoding in the connection string as well. This is true in case of MySQL JDBC driver because it does not use the database-specified encoding, but the client-specified encoding. How to configure it should already be answered in the JDBC driver documentation. In for example MySQL you can read it here:
    
    jdbc:mysql://localhost:3306/db_name?useUnicode=true&characterEncoding=UTF-8
    
    

  7. Text files: when reading/writing a text file with unicode characters using Reader/Writer, you need java.io.InputStreamReader/java.io.OutputStreamWriter where in you can specify the UTF-8 encoding in one of its constructors:
    
    Reader reader = new InputStreamReader(new FileInputStream("c:/file.txt"), "UTF-8");
    Writer writer = new OutputStreamWriter(new FileOutputStream("c:/file.txt"), "UTF-8");
    
    

    Otherwise the operating system default encoding will be used.


  8. Strings: although Java uses Unicode all the time under the hood, when you convert between String and byte[] using String#getBytes() or String(byte[]), you should rather use the overloaded method/constructor which takes the character encoding:
    
    byte[] bytesInDefaultEncoding = someString.getBytes(); // May generate corrupt bytes.
    byte[] bytesInUTF8 = someString.getBytes("UTF-8"); // Correct.
    String stringUsingDefaultEncoding = new String(bytesInUTF8); // Unknown bytes becomes "?".
    String stringUsingUTF8 = new String(bytesInUTF8, "UTF-8"); // Correct.
    
    

    Otherwise the platform default encoding will be used, which can be the one of the underlying operating system or the IDE(!).

Summarized: everywhere where you have the possibility to specify the character encoding, you should make use of it and set it to UTF-8.

Back to top

References

Here are some very useful references.

Last but not least, as Java just supports and uses Unicode all the time, also internally in the compiler, it's cool to know that it's possible to have such a class in Java:

\u0070\u0075\u0062\u006C\u0069\u0063\u0020\u0020\u0020\u0063\u006C\u0061\u0073\u0073\u0020\u0020
\u0055\u006E\u0069\u0063\u006F\u0064\u0065\u0020\u007B\u0020\u0070\u0075\u0062\u006C\u0069\u0063
\u0020\u0020\u0073\u0074\u0061\u0074\u0069\u0063\u0020\u0020\u0076\u006F\u0069\u0064\u0020\u0020
\u006D\u0061\u0069\u006E\u0020\u0028\u0020\u0053\u0074\u0072\u0069\u006E\u0067\u0020\u005B\u005D
\u0061\u0072\u0067\u0073\u0020\u0029\u0020\u007B\u0020\u0053\u0079\u0073\u0074\u0065\u006D\u002E
\u006F\u0075\u0074\u002E\u0070\u0072\u0069\u006E\u0074\u006C\u006E\u0028\u0022\u0049\u0022\u002B
\u0022\u0020\u2665\u0020\u0055\u006E\u0069\u0063\u006F\u0064\u0065\u0022\u0029\u003B\u007D\u007D

Save it unchanged as Unicode.java (without package), compile it and run it ;)

Back to top

Copyright - None of this article may be taken over without explicit authorisation.

(C) May 2009, BalusC