Showing posts with label Tomcat. Show all posts
Showing posts with label Tomcat. Show all posts

Monday, July 4, 2016

Integrating Tomcat 8.5.x and TomEE 7.x in Eclipse

Are you also seeing below error when trying to integrate Tomcat 8.5.x or TomEE 7.x using Tomcat v8.0 Server plugin in Eclipse?

For searchbots, the error says The Apache Tomcat installation at this directory is version 8.5.3. A Tomcat 8.0 installation is expected. This error even occurs in the current Eclipse Neon release. Perhaps they'll fix it in Neon SR1 by adding a new Tomcat v8.5 Server plugin. But for now and for older versions a workaround is needed.

The Eclipse built-in Tomcat server plugin basically detects the server version based on server.info property of org/apache/catalina/util/ServerInfo.properties file of Tomcat's /lib/catalina.jar file which looks like below in case of Tomcat 8.5.3:

server.info=Apache Tomcat/8.5.3
server.number=8.5.3.0
server.built=Jun 9 2016 11:16:29 UTC

All we need to do is to edit the version in server.info property to start with 8.0. Any ZIP or JAR aware tool should do it to edit it on the fly. I'm myself using WinRAR for the job.

server.info=Apache Tomcat/8.0.8.5.3
server.number=8.5.3.0
server.built=Jun 9 2016 11:16:29 UTC

Finally, it works.

INFO: Starting Servlet Engine: Apache Tomcat/8.0.8.5.3

The same procedure applies to TomEE 7.x which is based on Tomcat 8.5.x.

The difference between Tomcat 8.0 and 8.5 is the integration of Jaspic which is the first step towards standardizing Java EE authentication as part of JSR375. You probably already know that the ways to configure Java EE container managed authentication are totally scattered across different servers, making third party libraries such as Shiro and Spring Security more attractive. Each server needed its own way of configuring "realms" or "identity stores" to manage the database of users, passwords and roles. This will with JSR375 be unified using annotations and CDI provided via standard javax.security.* API. See also a.o. the question Java EE authentication: how to capture login event?

Friday, October 30, 2015

The empty String madness

Introduction

When we submit a HTML form with empty input fields which are bound to non-primitive bean properties, we'd rather like to keep them null instead of being polluted with empty strings or zeroes. This is very significant as to validation constraints such as @NotNull in Bean Validation and NOT NULL in relational databases. Across years and JSF/EL versions this turned out to be troublesome as not anyone agreed on each other. I sometimes even got momentarily confused myself when it would work and when not. I can imagine that a lot of other JSF developers have the same feeling. So let's do some digging in history and list all the facts and milestones in one place for best overview, along with an useful summary table with the correct solutions.

JSF 1.0/1.1 (2004-2006)

Due to the nature of HTTP, empty input fields arrive as empty strings instead of null. The underlying servlet request.getParameter(name) call returns an empty string on empty input fields. Nothing to do against, that's just how HTTP and Servlets work. A value of null represents the complete absence of the request parameter, which is also very significant (e.g. the servlet could this way check if a certain form button is pressed or not, irrespective of its value/label which could be i18n'ed). So we can't fix this in HTTP/Servlet side and have to do it in MVC framework's side. To avoid the model being polluted with empty strings, you would in JSF 1.0/1.1 need to create a custom Converter like below which you explicitly register on the inputs tied to java.lang.String typed model value.

public class EmptyToNullStringConverter implements Converter {

    @Override
    public Object getAsObject(FacesContext facesContext, UIComponent component, String submittedValue) {
        if (submittedValue == null || submittedValue.isEmpty()) {
            if (component instanceof EditableValueHolder) {
                ((EditableValueHolder) component).setSubmittedValue(null);
            }

            return null;
        }

        return submittedValue;
    }

    @Override
    public String getAsString(FacesContext facesContext, UIComponent component, Object modelValue) {
        return (modelValue == null) ? "" : modelValue.toString();
    }

}

Which is registered in faces-config.xml as below:

<converter>
    <converter-id>emptyToNull</converter-id>
    <converter-class>com.example.EmptyToNullStringConverter</converter-class>
</converter>

And used as below:

<h:inputText value="#{bean.string1}" converter="emptyToNull" />
<h:inputText value="#{bean.string2}" converter="emptyToNull" />
<h:inputText value="#{bean.string3}" converter="emptyToNull" />

The converter-for-class was not supported on java.lang.String until JSF 1.2.

The non-primitive numbers wasn't a problem in JSF 1.x, but only in specific server/EL versions. See later.

JSF 1.2 (2006-2009)

Since JSF 1.2, the converter-for-class finally supports java.lang.String (see also spec issue 131). So you can simply register the above converter as below and it'll get automatically applied on all inputs tied to java.lang.String typed model value.

<converter>
    <converter-for-class>java.lang.String</converter-for-class>
    <converter-class>com.example.EmptyToNullStringConverter</converter-class>
</converter>
<h:inputText value="#{bean.string1}" />
<h:inputText value="#{bean.string2}" />
<h:inputText value="#{bean.string3}" />

Tomcat 6.0.16 - 7.0.x (2007-2009)

Someone reported Tomcat issue 42385 wherein EL failed to set an empty String value representing an integer into a primitive int bean property. This uncovered a long time RI bug which violated section 1.18.3 of EL 2.1 specification.

1.18.3 Coerce A to Number type N

  • If A is null or "", return 0.
  • ...

In other words, when the model type is a number, and the submitted value is an empty string or null, then EL should coerce all integer based numbers int, long, Integer, Long and BigInteger to 0 (zero) before setting the model value. The same applies to decimal based numbers float, double, Float, Double and BigDecimal, which will then be coerced to 0.0. This was not done rightly in Oracle (Sun) nor in Apache EL implementations at the date. They both just set null in the number/decimal typed model value and only Apache EL failed on primitives whereas Oracle EL properly set the default value of zero (and hence that Tomcat issue report).

Since Tomcat 6.0.16, Apache EL started to set all number/decimal typed model values with 0 and 0.0 respectively. That's okay for primitive types like int, long, float and double, but that's absolutely not okay for non-primitive types like Integer, Long, Float, Double, BigInteger and BigDecimal. They should stay null when the submitted value is empty or null. The same applies to Boolean fields which got a default value of false and Character fields which got a default value of \u0000.

So I created JSP spec issue 184 for that (EL was then still part of JSP). This coercion doesn't make sense for non-primitives. The issue got a lot of recognition and votes. After complaints from JSF users, since Tomcat 6.0.17 a new VM argument was added to disable this Apache EL behavior on non-primitive number/decimal types.

-Dorg.apache.el.parser.COERCE_TO_ZERO=false

It became the most famous Tomcat-specific setting among JSF developers. It even worked in JBoss and all other servers using Apache EL parser (WebSphere a.o). It could even be set programmatically with help of a ServletContextListener.

@WebListener
public class Config implements ServletContextListener {

    @Override
    public void contextInitialized(ServletContextEvent event) {
        System.setProperty("org.apache.el.parser.COERCE_TO_ZERO", "false");
    }

    @Override
    public void contextDestroyed(ServletContextEvent event) {
        // NOOP.
    }

}

JSF 2.x (2009-current)

To reduce the EmptyToNullStringConverter boilerplate, JSF 2.0 introduced a new context param with a rather long name which should achieve exactly the desired behavior of interpreting empty string submitted values as null.

<context-param>
    <param-name>javax.faces.INTERPRET_EMPTY_STRING_SUBMITTED_VALUES_AS_NULL</param-name>
    <param-value>true</param-value>
</context-param>

To avoid non-primitive number/decimal typed model values being set with zeroes, on Tomcat and clones you still need the VM argument for the Apache EL parser as explained in the previous section. See also a.o. the Communication in JSF 2.0 article here.

EL 3.0 (2013-current)

And then EL 3.0 was introduced as part of Java EE 7 (which also covers JSF 2.2). With this version, the aforementioned JSP spec issue 184 was finally fixed. EL specification does no longer require to coerce non-primitive number/decimal types to zero. Apache EL parser was fixed in this regard. The -Dorg.apache.el.parser.COERCE_TO_ZERO=false is now the default behavior and the VM argument became superflous.

However, the EL guys went a bit overboard with fixing issue 184. They also treated java.lang.String the same way as a primitive! See also section 1.23.1 and 1.23.2 of EL 3.0 specification (emphasis mine):

1.23.1 To Coerce a Value X to Type Y

  • If X is null and Y is not a primitive type and also not a String, return null.
  • ...

1.23.2 Coerce A to String

  • If A is null: return “”
  • ...

They didn't seem to realize that coercion can work in two ways: when performing a "get" and when performing a "set". Coercion from null string to empty string makes definitely sense during invoking the getter (you don't want to see "null" being printed over all place in HTML output, right?). Only, it really doesn't make sense during invoking the setter (as the model would be polluted with empty strings over all place).

And suddenly, the javax.faces.INTERPRET_EMPTY_STRING_SUBMITTED_VALUES_AS_NULL didn't have any effect anymore. Even when JSF changes the empty string submitted value to null as instructed, EL 3.0 will afterwards coerce the null string back to empty string again right before invoking the model value setter. This was first noticeable in Oracle EL (WildFly, GlassFish, etc) and only later in Apache EL (see next chapter). This was discussed in JSF spec issue 1203 and JSF issue 3071, and finally EL spec issue 18 was created to point out this mistake in EL 3.0.

Until they fix it, this could be workarounded with a custom ELResolver for common property type of java.lang.String like below which utilizes the new EL 3.0 introduced ELResolver#convertToType() method. The remainder of the methods is not relevant.

public class EmptyToNullStringELResolver extends ELResolver {

    @Override
    public Class<?> getCommonPropertyType(ELContext context, Object base) {
        return String.class;
    }

    @Override
    public Object convertToType(ELContext context, Object value, Class<?> targetType) {
        if (value == null && targetType == String.class) {
            context.setPropertyResolved(true);
        }

        return value;
    }

    @Override
    public Iterator<FeatureDescriptor> getFeatureDescriptors(ELContext context, Object base) {
        return null;
    }

    @Override
    public Class<?> getType(ELContext context, Object base, Object property) {
        return null;
    }

    @Override
    public Object getValue(ELContext context, Object base, Object property) {
        return null;
    }

    @Override
    public boolean isReadOnly(ELContext context, Object base, Object property) {
        return true;
    }

    @Override
    public void setValue(ELContext context, Object base, Object property, Object value) {
        // NOOP.
    }

}

Which is registered in faces-config.xml as below:

<application>
    <el-resolver>com.example.EmptyToNullStringELResolver</el-resolver>
</application>

This was finally fixed in Oracle EL 3.0.1-b05 (July 2014). It is shipped as part of a.o. GlassFish 4.1 and WildFly 8.2. So the above custom ELResolver is unnecessary on those servers. Do note that you still need to keep The Context Param With The Long Name in EL 3.0 reagardless of the fix and the custom ELResolver!

Tomcat 8.0.7 - 8.0.15 (2014)

Apache EL 3.0 worked flawlessly until someone reported Tomcat issue 56522 that it didn't comply the new EL 3.0 requirement of coercing null string to empty string, even though that new requirement didn't make sense. So since Tomcat 8.0.7, Apache EL also suffered from this EL 3.0 problem of unnecessarily coercing null string to empty string during setting the model value. However, the above EmptyToNullStringELResolver workaround in turn still failed in all Tomcat versions until 8.0.15, because it didn't take any custom ELResolver into account. See also Tomcat issue 57309. This was fixed in Tomcat 8.0.16.

If upgrading to at least Tomcat 8.0.16 in order to utilize the EmptyToNullStringELResolver is not an option, the only way to get it to work is to replace Apache EL by Oracle EL in Tomcat-targeted JSF web applications. This can be achieved by dropping the current latest release in webapp's /WEB-INF/lib (which is javax.el-3.0.1-b08.jar at the time of writing) and adding the below context parameter to web.xml to tell Mojarra to use that EL implementation instead:

<context-param>
    <param-name>com.sun.faces.expressionFactory</param-name>
    <param-value>com.sun.el.ExpressionFactoryImpl</param-value>
</context-param>

Or when you're using MyFaces:

<context-param>
    <param-name>org.apache.myfaces.EXPRESSION_FACTORY</param-name>
    <param-value>com.sun.el.ExpressionFactoryImpl</param-value>
</context-param>

Of course, this is also a good alternative to the custom EmptyToNullStringELResolver in its entirety. Also here, you still need to keep The Context Param With The Long Name.

Summary

Here's a summary table which should help you in figuring out what to do in order to keep non-primitive bean properties null when the submitted value is empty or null (so, to avoid pollution of model with empty strings or zeroes over all place).

Note: Tomcat and JBoss use Apache EL, and GlassFish and WildFly use Oracle EL. Other servers (mainly the closed source ones such as WebSphere, WebLogic, etc) are not covered as I can't tell the exact versions being affected, but generally the same rules apply depending on the EL implementation being used.

JSFTomcatJBoss ASWildFlyGlassFish
5.5.x-6.0.156.0.166.0.17+7.0.x8.0.0-68.0.7-158.0.16+4.x/5.05.1-26.x/7.x8.0-18.2/9.0+3.x4.04.1+
1.0-1.1MCUTMC,CZMC,CZMCMC,UEMC,ERMCMC,CZMC,CZMC,ERMCMCMC,ERMC
1.2ACUTAC,CZAC,CZACAC,UEAC,ERACAC,CZAC,CZAC,ERACACAC,ERAC
2.0-2.1JFUTJF,CZJF,CZJFJF,UEJF,ERJFJF,CZJF,CZJF,ERJFJFJF,ERJF
2.2JF,CZJFJF,UEJF,ERJF,CZJF,ERJFJFJF,ERJF
  • MC: manually register EmptyToNullStringConverter over all place in <h:inputXxx converter>.
  • AC: automatically register EmptyToNullStringConverter on java.lang.String class.
  • UT: upgrade Tomcat to at least 6.0.17 as version 6.0.16 introduced the broken behavior on non-primitive number/decimal types and the VM argument was only added in 6.0.17.
  • CZ: add -Dorg.apache.el.parser.COERCE_TO_ZERO=false VM argument.
  • JF: add javax.faces.INTERPRET_EMPTY_STRING_SUBMITTED_VALUES_AS_NULL=true context param.
  • ER: register EmptyToNullStringELResolver, or alternatively, just do UE.
  • UE: migrate/upgrade to Oracle EL implementation version 3.0.1-b05 or newer.
  • : this JSF version is not supported on this server anyway.

Monday, October 14, 2013

How to install CDI in Tomcat?

Introduction

JSF is moving towards CDI for bean management. Since JSF 2.2, as part of Java EE 7, there's the new CDI compatible @ViewScoped and there's the CDI-only @FlowScoped which doesn't have an equivalent for @ManagedBean. Since JSF 2.3, as part of Java EE 8, the @ManagedBean and associated scopes from javax.faces.bean package are deprecated in favor of CDI.

Now, there are some JSF users using Tomcat which does as being a barebones JSP/Servlet container not support CDI out the box (also not JSF, you know, you had to supply JSF JARs yourself). If you intend to use CDI on Tomcat, the most straightforward step would be to upgrade it to TomEE. It's exactly like Tomcat, but then with among others OpenWebBeans on top of it, which is Apache's CDI implementation. TomEE installs as easy as Tomcat: just download the ZIP and unzip it. TomEE integrates in Eclipse as easy as Tomcat: just use existing Tomcat server plugin. As a bonus, TomEE also comes with EJB and JPA, making services and DB interaction a breeze.

However, perhaps you just have no control over upgrading the server. In that case, you'd like to supply CDI along with the webapp itself then in flavor of some JARs and additional configuration entries/files. So far, there are 2 major CDI implementations: Weld (the reference implementation) and OpenWebBeans. We'll treat them both in this article.

Install Weld in Tomcat 10+ (last updated: 13 March 2021)

Tomcat 10 is the first version to be "Jakartified", i.e. it's using jakarta.* package instead of javax.* package for the API classes. Weld 4 was the first version to also be Jakartified. Perform the following steps:

  1. Drop weld-servlet-shaded.jar in webapp's /WEB-INF/lib. In case you're using Maven, this is the coordinate:
    <dependency>
        <groupId>org.jboss.weld.servlet</groupId>
        <artifactId>weld-servlet-shaded</artifactId>
        <version>4.0.0.Final</version>
    </dependency>
    
  2. Create /META-INF/context.xml file in webapp's web content with following content (or, if you already have one, add just the <Resource> entry to it):
    <Context>
        <Resource name="BeanManager" 
            auth="Container"
            type="jakarta.enterprise.inject.spi.BeanManager"
            factory="org.jboss.weld.resources.ManagerObjectFactory" />
    </Context>
    
    This will register Weld's BeanManager factory in Tomcat's JNDI. This cannot be performed programmatically by Weld because Tomcat's JNDI is strictly read-only. This step is not necessary for Mojarra and OmniFaces because both libraries are able to find it in ServletContext instead. However, there may be other libraries which still expect to find BeanManager in JNDI, so you'd then best keep this configuration file anyway for those libraries.
  3. Create a (empty) /WEB-INF/beans.xml file (no, not in /META-INF! that's only for inside JAR files such as OmniFaces).
  4. Optionally: if you also want to use JSR-303 Bean Validation (@NotNull and friends), then drop jakarta.validation-api.jar and hibernate-validator.jar in webapp's /WEB-INF/lib, or use below Maven coordinate:
    <dependency>
        <groupId>org.hibernate.validator</groupId>
        <artifactId>hibernate-validator</artifactId>
        <version>7.0.1.Final</version>
    </dependency>
    

Now your webapp is ready for CDI in Tomcat 10+ via Weld!

Install Weld in Tomcat 9- (last updated: 13 March 2021)

The difference with Tomcat 10+ is that Tomcat 9- still uses the old javax.* package instead of the new jakarta.* package. This is not compatible with Weld 4+, you need Weld 3 instead. Perform the following steps:

  1. Drop weld-servlet-shaded.jar in webapp's /WEB-INF/lib. In case you're using Maven, this is the coordinate:
    <dependency>
        <groupId>org.jboss.weld.servlet</groupId>
        <artifactId>weld-servlet-shaded</artifactId>
        <version>3.1.6.Final</version>
    </dependency>
    
  2. Create /META-INF/context.xml file in webapp's web content with following content (or, if you already have one, add just the <Resource> entry to it):
    <Context>
        <Resource name="BeanManager" 
            auth="Container"
            type="javax.enterprise.inject.spi.BeanManager"
            factory="org.jboss.weld.resources.ManagerObjectFactory" />
    </Context>
    
    This will register Weld's BeanManager factory in Tomcat's JNDI. This cannot be performed programmatically by Weld because Tomcat's JNDI is strictly read-only. This step is not necessary if you're targeting at least Mojarra 2.2.11 and/or OmniFaces 2.4 or newer. Both are able to find it in ServletContext instead. However, there may be other libraries which still expect to find BeanManager in JNDI, you'd then best keep this configuration file anyway for those libraries.
  3. Create a (empty) /WEB-INF/beans.xml file (no, not in /META-INF! that's only for inside JAR files such as OmniFaces).
  4. Only if your web.xml is declared conform Servlet version 4.0 instead of 3.1, then you also need to put the @javax.faces.annotation.FacesConfig annotation on an arbitrary CDI managed bean somewhere in the project (usually the one representing the "application-wide config" would be OK).
    package com.example;
    
    import javax.enterprise.context.ApplicationScoped;
    import javax.faces.annotation.FacesConfig;
    
    @FacesConfig
    @ApplicationScoped
    public class Config {
    
    }
    It is indeed utterly unnecessary, but it is what it is.
  5. Optionally: if you also want to use JSR-303 Bean Validation (@NotNull and friends), then drop jakarta.validation-api.jar and hibernate-validator.jar in webapp's /WEB-INF/lib, or use below Maven coordinate:
    <dependency>
        <groupId>org.hibernate.validator</groupId>
        <artifactId>hibernate-validator</artifactId>
        <version>6.2.0.Final</version>
    </dependency>
    

Now your webapp is ready for CDI in Tomcat 9- via Weld! Note that in previous Weld versions you needed to register a <listener> in web.xml. This is not necessary anymore with at least Weld 2.2.0 on a "recent" Tomcat 9- version!

Install OpenWebBeans in Tomcat 9- (last updated: 3 January 2021)

The difference with Tomcat 10+ is that Tomcat 9- still uses the old javax.* package instead of the new jakarta.* package. Perform the following steps:

  1. This is easiest with Maven as OpenWebBeans has quite some sub-dependencies. Here are the coordinates (do note that it also includes JSR-303 Bean Validation API as without it OpenWebBeans would unexpectedly fail deployment with java.lang.TypeNotPresentException: Type javax.validation.ConstraintViolation not present caused by java.lang.ClassNotFoundException: javax.validation.ConstraintViolation):
    <dependency>
        <groupId>javax.enterprise</groupId>
        <artifactId>cdi-api</artifactId>
        <version>2.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.openwebbeans</groupId>
        <artifactId>openwebbeans-jsf</artifactId>
        <version>2.0.20</version>
    </dependency>
    <dependency>
        <groupId>jakarta.validation</groupId>
        <artifactId>validation-api</artifactId>
        <version>2.0.2</version>
    </dependency>
    
  2. Create /META-INF/context.xml file in webapp's web content with following content (or, if you already have one, add just the <Resource> entry to it):
    <Context>
        <Resource name="BeanManager" 
            auth="Container"
            type="javax.enterprise.inject.spi.BeanManager"
            factory="org.apache.webbeans.container.ManagerObjectFactory" />
    </Context>
    
    This will register OpenWebBeans' BeanManager factory in Tomcat's JNDI. This cannot be performed programmatically by OpenWebBeans because Tomcat's JNDI is strictly read-only.
  3. Add the below <listener> entry to webapp's web.xml:
    <listener>
        <listener-class>org.apache.webbeans.servlet.WebBeansConfigurationListener</listener-class>
    </listener>
    
    This will make sure that OpenWebBeans initializes before OmniFaces, otherwise you may face a java.lang.IllegalStateException: It's not allowed to call getBeans(Type, Annotation...) before AfterBeanDiscovery.
  4. Create a (empty) /WEB-INF/beans.xml file (no, not in /META-INF! that's only for JARs such as OmniFaces).
  5. Optionally: if you also want to use JSR-303 Bean Validation (@NotNull and friends), add the below Maven coordinate:
    <dependency>
        <groupId>org.hibernate.validator</groupId>
        <artifactId>hibernate-validator</artifactId>
        <version>6.2.0.Final</version>
    </dependency>
    

Now your webapp is ready for CDI in Tomcat 9- via OpenWebBeans!

Thursday, September 10, 2009

Webapplication performance tips and tricks

Introduction

Yahoo has a great performance analysis tool in flavor of a Firefox addon: YSlow (yes, you need to install the -also great- Firebug addon first). The YSlow site has already explained all of the best practices in detail here.

Yahoo's explanations are in general clear enough for the average Java EE web application developer, but when the YSlow's Server category comes into the picture, Yahoo unfortunately only gives examples based on Apache HTTP server and PHP and in a few cases also IIS. In this article I'll "translate" the relevant subcategories into the Java EE approach based on Apache Tomcat 6.0. As a bonus, a few more best practices are added and explained in detail.

Back to top

Use a Content Delivery Network

This is the first rule of the YSlow's Server category. Well, the idea is nice, but this is in my opinion not a "must". Having a secondary domain (no, not a subdomain) for pure static content is a more general practice to gain performance in serving static content. A webbrowser is namely restricted to have a certain maximum amount of simultaneous open connections on a single domain. In the older browser versions this is usually limited to 2 and ranges nowadays around 10-15 connections. This can also be changed using a simple regedit (MSIE) or by editing about:config (Firefox). Those kind of tweaks are usually only done by the more advanced users with an above average knowledge of the software they use.

So, to give a broader area of visitors a better performance experience, it may be better to have a secondary domain for pure static content only. E.g. onedomain.com for JSP files and anotherdomain.com for CSS/JS/Flash/etc files. Or of course such a CDN as suggested by Yahoo, but again, a CDN for private static data is in my opinion a bit nonsensicial. After all, if you respect the performance rules for static content the correct way, then the static content will actually only be requested whenever really needed, so this makes a secondary domain or CDN more superfluous. Or you must have a webapplication which needs to serve a lot of non-layout-related images, such as photography.

For 3rd party public static content it's however definitely worth the effort to link it to a CDN which is provided by themselves, if any. For example jQuery offers several CDN hosts. It's a win-win situation for both your server and the client.

Back to top

Add an Expires or a Cache-Control Header

This is the second rule of the YSlow's Server category. A very good point. The Expires header prevents the browser to re-request the same static content (JS/CSS/images/etc) everytime, which is only a waste of the available time, connections and bandwidth. When you're serving static content from public webcontent in Tomcat, then the DefaultServlet is responsible for serving the content. It unfortunately does nothing with the Expires header. Although it supports the Last-Modified headers, this costs effectively a HEAD request which is already one connection and request too much when the content is actually not changed after all. You can however override the DefaultServlet with an own implementation as outlined here. How to do it effectively is already covered by the earlier FileServlet article at this blog. This servlet is a well suited solution for the second, third as well as the fourth rule of the YSlow's Server category.

About the cache-control header for dynamic content, the general practice is that we just want to avoid caching of dynamic content, especially the pages containing forms or the pages in restricted area. You can do that by adding the following response headers to the base controller Servlet or Filter of your webapplication:


    ...

    response.setHeader("Cache-Control", "no-cache, no-store, must-revalidate"); // HTTP 1.1.
    response.setHeader("Pragma", "no-cache"); // HTTP 1.0.
    response.setDateHeader("Expires", 0); // Proxies.

    ...

There is a little story behind the no-store and must-revalidate attributes of the cache-control header: some webbrowsers (including Firefox) doesn't cache the page when those attributes are omitted! According to the HTTP specification only the no-cache should have been sufficient. But OK, now we at least have the 'magic' three headers which should work for all decent webbrowsers and proxies.

Back to top

Use Query String with a timestamp to force re-request

The Expires header is useful, but .. with a (too) far-future Expires header, the client won't check for any updates on the static resource anymore until the expire date has passed, or you clear the browser cache, or you do a hard-refresh (CTRL+F5)! A common practice is then to append an unique query string to the URL of the static content denoting a timestamp of the last file modification or the server startup time, so that the browser is forced to re-request it whenever the query string changes.

Determining the last modification time on every request is more expensive than just determining the server startup time only once in application's lifetime. It is generally sufficient to do so. Whenever the server restarts, the browser will send a HEAD request to check if there are any updates. Assuming that your server doesn't restart every minute or so, this doesn't harm that much. Here's an example of how to do it using a ServletContextListener:

package mypackage;

import javax.servlet.ServletContextEvent;
import javax.servlet.ServletContextListener;

/**
 * Configure the webapplication context. This is to be placed in the application scope.
 * As far now this example only sets the startup time.
 * @author BalusC
 * @see http://balusc.blogspot.com/2009/09/webapplication-performance-tips-and.html
 */
public class Config implements ServletContextListener {

    // Constants ----------------------------------------------------------------------------------

    private static final String CONFIG_ATTRIBUTE_NAME = "config";

    // Properties ---------------------------------------------------------------------------------

    private long startupTime;

    // Actions ------------------------------------------------------------------------------------

    /**
     * Obtain startup time and put Config itself in the application scope.
     * @see ServletContextListener#contextInitialized(ServletContextEvent)
     */
   public void contextInitialized(ServletContextEvent event) {
        this.startupTime = System.currentTimeMillis() / 1000;
        event.getServletContext().setAttribute(CONFIG_ATTRIBUTE_NAME, this);
    }

    /**
     * @see ServletContextListener#contextDestroyed(ServletContextEvent)
     */
    public void contextDestroyed(ServletContextEvent event) {
        // Nothing to do here.
    }

    // Getters ------------------------------------------------------------------------------------

    /**
     * Returns the startup time associated with this configuration.
     * @return The startup time associated with this configuration.
     */
    public long getStartupTime() {
        return this.startupTime;
    }

}

Just add it as a listener to the web.xml the usual way:


    ...

    <listener>
        <listener-class>mypackage.Config</listener-class>
    </listener>

    ...

Here is an example of how to use it in JSP:


        ...

        <link rel="stylesheet" type="text/css" href="/static/style.css?${config.startupTime}">
        <script type="text/javascript" src="/static/script.js?${config.startupTime}"></script>

        ...

As a side-note, if you're using the aforementioned FileServlet as well, then you can in theory postpone the default expire time more. For example 1 year (365 days):


    ...

    private static final long DEFAULT_EXPIRE_TIME = 31536000000L; // ..ms = 365 days.

    ...

Back to top

Add LastModified timestamp to CSS background images

Appending query string with a timestamp to static CSS files is nice, but .. this doesn't cover the CSS background images! Those counts each as a separate request. If you don't append a timestamp query string to them, then they won't be checked for any updates. How to handle it may differ per environment, so I'll only describe my general approach to give the idea. You might need to finetune it further to suit your environment. I myself use a batch job using YUI Compressor (yes, it's a Java API!) to minify all CSS and JS files before deploy. After getting the minified result, regexp is used to find all background images in the CSS source and File#lastModified() is used to get the last modification timestamp from it and finally the originals will be replaced. Here's a basic example of the Minifier -keep in mind, this may needed to be modified to suit your environment:

package mypackage;

import java.io.Closeable;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.Reader;
import java.io.StringWriter;
import java.io.Writer;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import com.yahoo.platform.yui.compressor.CssCompressor;

/**
 * The Minifier.
 * @author BalusC
 * @see http://balusc.blogspot.com/2009/09/webapplication-performance-tips-and.html
 */
public class Minifier {

    // Actions ------------------------------------------------------------------------------------

    /**
     * Minify all CSS files on given basePath + cssPath to the given basePath + minPath and append
     * lastmodified timestamps to CSS background images relative to the given basePath.
     * @param basePath The base path of static content.
     * @param cssPath The path of all CSS files, relative to the given basePath.
     * @param minPath The path of all minified CSS files, relative to the given basePath.
     * @throws IOException If something fails at I/O level.
     */
    public static void minifyCss(String basePath, String cssPath, String minPath) throws IOException {
        for (File cssFile : new File(basePath + cssPath).listFiles()) {
            if (cssFile.isFile()) {
                File minFile = new File(basePath + minPath, cssFile.getName());
                minifyCss(basePath, cssFile, minFile);
            }
        }
    }

    /**
     * Minify given cssFile to the given minFile and append lastmodified timestamps to CSS
     * background images relative to the given basePath.
     * @param basePath The base path of static content.
     * @param cssFile The CSS file to be minified.
     * @param minFile The minified CSS file.
     * @throws IOException If something fails at I/O level.
     */
    public static void minifyCss(String basePath, File cssFile, File minFile) throws IOException {
        Reader reader = null;
        Writer writer = null;

        try {
            // Read original CSS file.
            reader = new InputStreamReader(new FileInputStream(cssFile), "UTF-8");

            // Minify original CSS file.
            StringWriter stringWriter = new StringWriter();
            new CssCompressor(reader).compress(stringWriter, -1);
            String line = stringWriter.toString();

            // Find all CSS background images.
            Matcher matcher = Pattern.compile("url\\([\'\"]?([/\\w\\.]*)[\'\"]?\\)").matcher(line);
            Set<String> imagePaths = new HashSet<String>();
            while (matcher.find()) {
                imagePaths.add(matcher.group(1));
            }

            // Append lastmodified timestamps to CSS background images and replace originals.
            for (String imagePath : imagePaths) {
                long lastModified = new File(basePath + imagePath).lastModified() / 1000;
                line = line.replace(imagePath, imagePath + "?" + lastModified);
            }

            // Write minified CSS file.
            writer = new OutputStreamWriter(new FileOutputStream(minFile), "UTF-8");
            writer.write(line);
        } finally {
            close(writer);
            close(reader);
        }

        // Dumb sysout, replace by Logger if needed ;)
        System.out.println("Minifying " + cssFile + " to " + minFile + " succeed!");
    }

    // Helpers ------------------------------------------------------------------------------------

    /**
     * Silently close given resource. Any IOException will be printed to stdout.
     * This global method can easily be extracted to your "IOUtil" class, if not already exist.
     * @param resource Resource to be closed.
     */
    private static void close(Closeable resource) {
        if (resource != null) {
            try {
                resource.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

    // Main method --------------------------------------------------------------------------------
    
    /**
     * Just to demonstrate how your batch job thing should use the Minifier.
     */
    public static void main(String... args) throws Exception {
        String basePath = "C:/Workspace/YourProject/WebContent/WEB-INF";
        String cssPath = "/static/css";
        String minPath = cssPath + "/min";
        Minifier.minifyCss(basePath, cssPath, minPath);
    }
    
}
Back to top

Gzip Components

This is the third rule of the YSlow's Server category. Yes, that's also a very good point. Gzip is relatively fast and can save up to 70% of the network bandwidth. For static text content you can just use the aforementioned FileServlet article at this blog. For dynamic text content you'll need to configure the application server so that it uses GZIP compression. This is usually explained in the documentation of the application server in question. In case of Apache Tomcat 6.0 you can find it here. You need to extend the <Connector> element in Tomcat/conf/server.xml with a compression attribute which is set to "on". Here's a basic example (note the last attribute):


    ...

    <Connector
        protocol="HTTP/1.1"
        port="80"
        redirectPort="8443"
        connectionTimeout="20000"
        compression="on" />

    ...

That's all! Restart Tomcat and all dynamic response will be Gzipped. And no, this does not affect the aforementioned FileServlet for static content, you can just keep it as is.

Back to top

Configure ETags

This is the fourth rule of the YSlow's Server category. Again a good point and again also covered by the aforementioned FileServlet article at this blog. The ETags are not needed for dynamic content as they are usually not to be cached.

Back to top

Flush the Buffer Early

This is the fifth rule of the YSlow's Server category. Well, that's also a good point. Flushing the response between </head> and <body>. But that's one of the 0,01% cases where in you can't quickly go around a (cough) scriptlet and thus its use is less or more forgiveable.


        ...

    </head>
    <% response.flushBuffer(); %>
    <body>

        ...

However, in case of Apache Tomcat 6.0 the HTTP connector uses a buffer size of 2KB (2048 bytes) by default which is configureable using the bufferSize attribute. This is generally more than good enough. The average HTML head with the "default" minimum tags (doctype, html, head, meta content type, meta description, base, favicon, CSS file, JS file and title) already accounts 1 up to 1.5KB in size. In any way, in one of my last webapps I have used a slightly modified WhitespaceFilter which removes all whitespace inside the <body> and instantly pre-flushes the stream before the <body>.

Back to top

Use NIO

When your webapplication needs to handle more than around 1.000 concurrent connections, or when your webserver is also used for other purposes than only serving the web, then it's generally better to use non-blocking IO streams instead of blocking IO streams. It scales much better as you don't need one implicitly opened thread per opened IO resource anymore, instead basically all resources are managed by a single thread. This saves the server from a lot of threads and the overhead of controlling them and the exponentially growing performance drop when the amount of concurrent threads (HTTP connections) gets high. You're for performance also not dependent on the amount of available threads anymore, but more on the amount of available heap memory. It can go up to around 20.000 concurrent connections on a single thread instead of around 5.000 concurrent connections on that much threads.

Most decent servers supports NIO, as does Apache Tomcat 6.0 in the HTTP connector. Basically all you need to do is to replace the default protocol attribute of "HTTP/1.1" with "org.apache.coyote.http11.Http11NioProtocol". The Tomcat NIO connector implementation is also known as "Grizzly". In some full fledged Java EE application servers like Sun Glassfish, this is by default turned on.


    ...

    <Connector
        protocol="org.apache.coyote.http11.Http11NioProtocol"
        port="80"
        redirectPort="8443"
        connectionTimeout="20000"
        compression="on" />

    ...

That's basically all! Restart Tomcat and now it will use NIO to handle HTTP connections. Only ensure that you give it enough memory (also in the IDE when developing with it). You can start with 512MB, but 1024MB is better.

Back to top

Copyright - No text of this article may be taken over without explicit authorisation. Only the code is free of copyright. You can copy, change and distribute the code freely. Just mentioning this site should be fair.

(C) September 2009, BalusC

Wednesday, May 6, 2009

Unicode - How to get the characters right?

Introduction

Computers understand only bits and bytes. You know, the binary numeral system of zeros and ones. Humans, on the other hand, understand characters only. You know, the building blocks of the natural languages. So, to handle human readable characters using a computer (read, write, store, transfer, etcetera), they have to be converted to bytes. One byte is an ordered collection of eight zeros or ones (bits). The characters are only used for pure presentation to humans. Behind any character you see, there is a certain order of bits. For a computer a character is in fact nothing less or more than a simple graphical picture (a font) which has an unique "identifier" in form of a certain order of bits.

To convert between chars and bytes a computer needs a mapping where every unique character is associated with unique bytes. This mapping is also called the character encoding. The character encoding exist of basically two parts. The one is the character set (charset), which represents all of the unique characters. The other is the numeral representation of each of the characters of the charset. The numeral representation to humans is usually in hexadecimal, which is in turn easily to be "converted" to bytes (both are just numeral systems, only with a different base).

Character Encoding
Character set
(human presentation)
Numeral representation
(computer identification)
Ax0041 (01000001)
Bx0042 (01000010)
Cx0043 (01000011)
Dx0044 (01000100)
Ex0045 (01000101)
Fx0046 (01000110)
Gx0047 (01000111)
Hx0048 (01001000)
Ix0049 (01001001)
Jx004A (01001010)
Kx004B (01001011)
Lx004C (01001100)
Mx004D (01001101)
Nx004E (01001110)
Ox004F (01001111)
......
Back to top

Well, where does it go wrong?

The world would be much simpler if only one character encoding existed. That would have been clear enough for everyone. Unfortunately the truth is different. There are a lot of different character encodings, each with its own charsets and numeral mappings. So it may be obvious that a character which is converted to bytes using character encoding X may not be the same character when it is converted back from bytes using character encoding Y. That would in turn lead to confusion among humans, because they wouldn't understand the way the computer represented their natural language. Humans would see completely different characters and thus not be able to understand the "language" which is also known as the "mojibake". It can also happen that humans would not see any linguistic character at all, because the numeral representation of the character in question isn't covered by the numeral mapping of the character encoding used. It's simply unknown.

How such an unknown character is displayed differs per application which handles the character. In the webbrowser world, Firefox would display an unknown character as a black diamond with a question mark in it, while Internet Explorer would display it as an empty white square with a black border. Both represents the same Unicode character though: xFFFD, which is displayed in your webbrowser as "�". Internet Explorer simply doesn't have a font (a graphical picture) for it, hence the empty square. In Java/JSP/Servlet world, any unknown character which is passed through the write() methods of an OutputStream (e.g. the one obtained by ServletResponse#getOutputStream()) get printed as a plain question mark "?". Those question marks can in some cases also be caused by the database. Most database engines replaces uncovered numeral representations by a plain question mark during save (INSERT/UPDATE), which is in turn later displayed to the human when the data is been queried and sent to the webbrowser. The plain question marks are thus not per se caused by the webbrowser.

Here is a small test snippet which demonstrates the problem. Keep in mind that Java supports and uses Unicode all the time. So the encoding problem which you see in the output is not caused by Java itself, but by using the ISO 8859-1 character encoding to display Unicode characters. The ISO 8859-1 character encoding namely doesn't cover the numeral representations of a large part of the Unicode charset which is also known as the UTF-8 charset. By the way, the term "Unicode character" is nowhere defined, but it usually used by (unaware) programmers/users who actually meant "Any character which is not covered by the ISO 8859 character encoding".

package test;

public class Test {

    public static void main(String... args) throws Exception {
        String czech = "Český";
        String japanese = "日本語";

        System.out.println("UTF-8 czech: " + new String(czech.getBytes("UTF-8")));
        System.out.println("UTF-8 japanese: " + new String(japanese.getBytes("UTF-8")));

        System.out.println("ISO-8859-1 czech: " + new String(czech.getBytes("ISO-8859-1")));
        System.out.println("ISO-8859-1 japanese: " + new String(japanese.getBytes("ISO-8859-1")));
    }

}

UTF-8 czech: Český
UTF-8 japanese: 日本語
ISO-8859-1 czech: ?esk�
ISO-8859-1 japanese: ???

These kinds of problems are often referred to as the "Unicode problem".

Important note: your own operating system should of course have the proper fonts (yes, the human representations) supporting those Unicode charsets for both Czech and Japanese languages installed to see the proper characters/glyphs at this webpage :) Otherwise you will see in for example Firefox a black-bordered square with hexcode inside (0-9 and/or A-F) and in most other webbrowsers such as IE, Safari and Chrome a nothing-saying empty square with a black border. Below is a screenshot from Chrome which shows the right characters, so you can compare if necessary:

If your operating system for example doesn't have the Japanese glyphs in the font as required by this page, then you should in Firefox see three squares with hexcodes 65E5, 672C and 8A9E. Those hexcodes are actually also called 'Unicode codepoints'. In Windows, you can view all available fonts and the supported characters using 'charmap.exe' (Start > Run > charmap).

Another important note: if you have problems when copypasting the test snippet in your development environment (e.g. you are not seeing the proper characters, but only empty squares or something like), then please wait with playing until you have read the entire article, including step 1 of the OK .. So, I have an "Unicode problem", what now? chapter ;)

Back to top

Unicode, what's it all about?

Let's go back in the history of character encoding. Most of you may be familiar with the term "ASCII". This was less or more the first character encoding ever. At the ages when a byte was very expensive and 1MHz was extremely fast, only the characters which appeared on those ancient US typewriters (as well as at the average US International keyboard nowadays) were covered by the charset of the ASCII character encoding. This includes the complete Latin alphabet (A-Z, in both the lowercased and uppercased flavour), the numeral digits (0-9), the lexical control characters (space, dot, comma, colon, etcetera) and some special characters (the at sign, the sharp sign, the dollar sign, etcetera). All those characters fill up the space of 7 bits, half of the room a byte provides, with a total of 128 characters.

Later the remaining bit of a byte is used for Extended ASCII which provides room for a total of 255 characters. Most of the remaining room is used by special characters, such as diacritical characters and line drawing characters. Because everyone used the remaining room their own way (IBM, Commodore, Universities, etcetera), it was not interchangeable. Later ISO came up with standard character encoding definitions for 8 bit ASCII extensions, resulting in the known ISO 8859 character encoding standards such as ISO 8859-1.

8 bits may be enough for the languages using the Latin alphabet, but it is certainly not enough for the remaining non-Latin languages in the world, such as Chinese, Japanese, Hebrew, Cyrillic, Sanskrit, Arabic, etcetera. They developed their own non-ISO character encodings which was -again- not interchangeable, such as Guobiao, BIG5, JIS, KOI, MIK, TSCII, etcetera. Finally a new 16 bits character encoding standard based on top of ISO 8859-1 was established to cover any of the characters used at the world so that it is interchangeable everywhere: Unicode. You can find all of those linguistic characters here. Unicode also covers many special characters (symbols) such as punctuation and mathematical operators, which you can find here.

Back to top

OK .. So, I have an "Unicode problem", what now?

To the point: just ensure that you use UTF-8 (a character encoding which conforms the Unicode standard) all the way. There are more Unicode character encodings as well, but as far they are used very, very seldom. UTF-8 is likely the Unicode standard. To solve the "Unicode problem" you need to ensure that every step which involves byte-character conversion uses the one and the same character encoding: reading data from input stream, writing data to output stream, querying data from database, storing data in database, manipulating the data, displaying the data, etcetera. For a Java EE web developer, there are a lot of things you have to take into account.

  1. Development environment: yes, the development environment has to use UTF-8 as well. By default most text files are saved using the operating system default encoding such as ISO 8859-1 or even an proprietary encoding such as Windows ANSI (also known as CP-1252, which is in turn not interchangeable with non-Windows platforms!). The most basic text editor of Windows, Notepad, uses Windows ANSI by default, but Notepad supports UTF-8 as well. To save a text file containing Unicode characters using Notepad, you need to choose the File » Save As option and select UTF-8 from the Encoding dropdown. The same Save As story applies on many other self-respected text editors as well, like EditPlus, UltraEdit and Notepad++.


    In an IDE such as Eclipse you can set the encoding at several places. You need to explore the IDE preferences thoroughly to find and change them. In case of Eclipse, just go to Window » Preferences and enter filter text encoding. In the filtered preferences (Workspace, JSP files, etcetera) you can select the desired encoding from a dropdown. Important note: the Workspace encoding also covers the output console and thus also the outcome of System.out.println(). If you sysout an Unicode character using the default encoding, it would likely be printed as a plain vanilla question mark!


    In the command console it is not possible.

    C:\Java>java test.Test
    UTF-8 czech: Český
    UTF-8 japanese: 日本語
    ISO-8859-1 czech: ─?esk├╜
    ISO-8859-1 japanese: µ?ѵ?¼Î¦¬?

    C:\Java>_

    In theory, in the Windows command prompt you have to use a font which supports a broad range of Unicode characters. You can set the font by opening the command console (Start > Run > cmd), then clicking the small cmd icon at the left top, then choosing Properties and finally choosing the Font tab. In a default Windows environment only the Lucida Console font has the "best" support of Unicode fonts. It unfortunately lacks a broad range of Unicode characters though.

    The cmd.exe parameter \U and/or the command chcp 65001 (which changes the code page to UTF-8) doesn't help much if the font already doesn't support the desired characters. You could hack the registry to add more fonts, but you still have to find a specific command console font which supports all of the desired characters. In the end it's better to use Swing to create a command console like UI instead of using the standard command console. Especially if the application is intended to be distributed (you don't want to require the enduser to hack/change their environment to get your application to work, do you? ;) ).

  2. Java properties files: as stated in its Javadoc the load(InputStream) method of the java.util.Properties API uses ISO 8859-1 as the default encoding. Here's an extract of the class' Javadoc:

    .. the input/output stream is encoded in ISO 8859-1 character encoding. Characters that cannot be directly represented in this encoding can be written using Unicode escapes ; only a single 'u' character is allowed in an escape sequence. The native2ascii tool can be used to convert property files to and from other character encodings.

    If you have full control over loading of the properties files, then you should use the Java 1.6 load(Reader) method in combination with an InputStreamReader instead:
    
    Properties properties = new Properties();
    properties.load(new InputStreamReader(classLoader.getResourceAsStream(filename), "UTF-8"));
    
    
    If you don't have full control over loading of the properties files (e.g. managed by some framework), then you need the in the Javadoc mentioned native2ascii tool. The native2ascii tool can be found in the /bin folder of the JDK installation directory. When you for example need to maintain properties files with Unicode characters for i18n (Internationalization; also known as resource bundles), then it's a good practice to have both an UTF-8 properties file and an ISO 8859-1 properties file and some batch program to convert from the UTF-8 properties file to an ISO 8859-1 properties file. You use the UTF-8 properties file for editing only. You use the converter to convert it to ISO 8859-1 properties file after every edit. You finally just leave the ISO 8859-1 properties file as it is. In most (smart) IDE's like Eclipse you cannot use the .properties extension for those UTF-8 properties files, it would complain about unknown characters because it is forced to save properties files in ISO 8859-1 format. Name it .properties.utf8 or something else. Here's an example of a simple Windows batch file which does the conversion task:
    
    cd c:\path\to\properties\files
    c:\path\to\jdk\bin\native2ascii.exe -encoding UTF-8 text_cs.properties.utf8 text_cs.properties
    c:\path\to\jdk\bin\native2ascii.exe -encoding UTF-8 text_ja.properties.utf8 text_ja.properties
    c:\path\to\jdk\bin\native2ascii.exe -encoding UTF-8 text_zh.properties.utf8 text_zh.properties
    # You can add more properties files here.
    
    
    Save it as utf8.converter.bat (or something like) and run it once to convert all UTF-8 properties files to standard ISO 8859-1 properties files. If you're using Maven and/or Ant, this can even be automated to take place during the build of the project.

    For JSF there are better ways using ResourceBundle.Control API. Check this blog article: Internationalization in JSF with UTF-8 properties files.

  3. JSP/Servlet request: during request processing an average application server will by default use the ISO 8859-1 character encoding to URL-decode the request parameters. You need to force the character encoding to UTF-8 yourself. First this: "URL encoding" must not to be confused with "character encoding". URL encoding is merely a conversion of characters to their numeral representations in the %xx format, so that special characters can be passed through URL without any problems. The client will URL-encode the characters before sending them to the server. The server should URL-decode the characters using the same character encoding. Also see "percent encoding".

    How to configure this depends on the server used, so the best is to refer its documentation. In case of for example Tomcat you need to set the URIEncoding attribute of the <Connector> element in Tomcat's /conf/server.xml to set the character encoding of HTTP GET requests, also see this document:
    
    <Connector (...) URIEncoding="UTF-8" />
    
    
    In for example Glassfish you need to set the <parameter-encoding> entry in webapp's /WEB-INF/sun-web.xml (or, since Glassfish 3.1, glassfish-web.xml), see also this document:
    
    <parameter-encoding default-charset="UTF-8" />
    
    
    URL-decoding POST request parameters is a story apart. The webbrowser is namely supposed to send the charset used in the Content-Type request header. However, most webbrowsers doesn't do it. Those webbrowsers will just use the same character encoding as the page with the form was delivered with, i.e. it's the same charset as specified in Content-Type header of the HTTP response or the <meta> tag. Only Microsoft Internet explorer will send the character encoding in the request header when you specify it in the accept-charset attribute of the HTML form. However, this implementation is broken in certain circumstances, e.g. when IE-win says "ISO-8859-1", it is actually CP-1252! You should really avoid using it. Just let it go and set the encoding yourself.

    You can solve this by setting the same character encoding in the ServletRequest object yourself. An easy solution is to implement a Filter for this which is mapped on an url-pattern of /* and basically contains only the following lines in the doFilter() method:
    
    if (request.getCharacterEncoding() == null) {
        request.setCharacterEncoding("UTF-8");
    }
    chain.doFilter(request, response);
    
    
    Note: URL-decoding POST request parameters the above way is not necessary when you're using Facelets instead of JSP as it defaults to UTF-8 already. It's also not necessary when you're using Glassfish as the <parameter-encoding> also takes care about this.

    Here's a test snippet which demonstrates what exactly happens behind the scenes when it all fails:
    package test;
    
    import java.net.URLDecoder;
    import java.net.URLEncoder;
    
    public class Test {
    
        public static void main(String... args) throws Exception {
            String input = "日本語";
            System.out.println("Original input string from client: " + input);
    
            String encoded = URLEncoder.encode(input, "UTF-8");
            System.out.println("URL-encoded by client with UTF-8: " + encoded);
    
            String incorrectDecoded = URLDecoder.decode(encoded, "ISO-8859-1");
            System.out.println("Then URL-decoded by server with ISO-8859-1: " + incorrectDecoded);
    
            String correctDecoded = URLDecoder.decode(encoded, "UTF-8");
            System.out.println("Server should URL-decode with UTF-8: " + correctDecoded);
        }
    
    }
    
    Original input string from client: 日本語
    URL-encoded by client with UTF-8: %E6%97%A5%E6%9C%AC%E8%AA%9E
    Then URL-decoded by server with ISO-8859-1: 日本語
    Server should URL-decode with UTF-8: 日本語

  4. JSP/Servlet response: during response processing an average application server will by default use ISO 8859-1 to encode the response outputstream. You need to force the response encoding to UTF-8 yourself. If you use JSP as view technology, then adding the following line to the top (yes, as the first line) of your JSP ought to be sufficient:
    
    <%@ page pageEncoding="UTF-8" %>
    
    
    This will set the response outputstream encoding to UTF-8 and set the HTTP response content-type header to text/html;charset=UTF-8. To apply this setting globally so that you don't need to edit every individual JSP, you can also add the following entry to your /WEB-INF/web.xml file:
    
    <jsp-config>
        <jsp-property-group>
            <url-pattern>*.jsp</url-pattern>
            <page-encoding>UTF-8</page-encoding>
        </jsp-property-group>
    </jsp-config>
    
    
    Note: this is not necessary when you're using Facelets instead of JSP as it defaults to UTF-8 already.

    The HTTP content-type header actually does nothing at the server side, but it should instruct the webbrowser at the client side which character encoding to use for display. The webbrowser must use it above any specified HTML meta content-type header as specified by w3 HTML spec chapter 5.2.2. In other words, the HTML meta content-type header is totally ignored when the page is served over HTTP. But when the enduser saves the page locally and views it from the local disk file system, then the meta content-type header will be used. To cover that as well, you should add the following HTML meta content-type header to your JSP anyway:

    
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    
    
    Note: lowercase utf-8 or uppercase UTF-8 doesn't really matter in all circumstances.

    If you (ab)use a HttpServlet instead of a JSP to generate HTML content using out.write(), out.print() statements and so on, then you need to set the encoding in the ServletResponse object itself inside the servlet method block before you call getWriter() or getOutputStream() on it:
    
    response.setCharacterEncoding("UTF-8");
    
    
    You can do that in the aforementioned Filter, but this can lead to problems if you have servlets in your webapplication which uses the response for something else than generating HTML content. After all, there shouldn't be any need to do this. Use JSP to generate HTML content, that's where it is for. When generating other plain text content than HTML, such as XML, CSV, JSON, etcetera, then you need to set the response character encoding the above way.

  5. JSF/Facelets request/response: JSF/Facelets uses by default already UTF-8 for all HTTP requests and responses. You only need to configure the server as well to use the same encoding as described in JSP/Servlet request section.

    Only when you're using a custom filter or a 3rd party component library which calls request.getParameter() or any other method which implicitly needs to parse the request body in order to extract the data, then there's chance that it's too late for JSF/Facelets to set the UTF-8 character encoding before the request body is been parsed for the first time. PrimeFaces 3.2 for example is known to do that. In that case, you'd still need a custom filter as described in JSP/Servlet request section.

  6. Databases: also the database has to take the character encoding in account. In general you need to specify it during the CREATE and if necessary also during the ALTER statements and in some cases you also need to specify it in the connection string or the connection parameters. The exact syntax depends on the database used, best is to refer its documentation using the keywords "character set". In for example MySQL you can use the CHARACTER SET clause as pointed out here:
    
    CREATE DATABASE db_name CHARACTER SET utf8;
    CREATE TABLE tbl_name (...) CHARACTER SET utf8;
    
    
    Usually the database's JDBC driver is smart enough to use the database and/or table specified encoding for querying and storing the data. But in worst cases you have to specify the character encoding in the connection string as well. This is true in case of MySQL JDBC driver because it does not use the database-specified encoding, but the client-specified encoding. How to configure it should already be answered in the JDBC driver documentation. In for example MySQL you can read it here:
    
    jdbc:mysql://localhost:3306/db_name?useUnicode=true&characterEncoding=UTF-8
    
    

  7. Text files: when reading/writing a text file with unicode characters using Reader/Writer, you need java.io.InputStreamReader/java.io.OutputStreamWriter where in you can specify the UTF-8 encoding in one of its constructors:
    
    Reader reader = new InputStreamReader(new FileInputStream("c:/file.txt"), "UTF-8");
    Writer writer = new OutputStreamWriter(new FileOutputStream("c:/file.txt"), "UTF-8");
    
    

    Otherwise the operating system default encoding will be used.


  8. Strings: although Java uses Unicode all the time under the hood, when you convert between String and byte[] using String#getBytes() or String(byte[]), you should rather use the overloaded method/constructor which takes the character encoding:
    
    byte[] bytesInDefaultEncoding = someString.getBytes(); // May generate corrupt bytes.
    byte[] bytesInUTF8 = someString.getBytes("UTF-8"); // Correct.
    String stringUsingDefaultEncoding = new String(bytesInUTF8); // Unknown bytes becomes "?".
    String stringUsingUTF8 = new String(bytesInUTF8, "UTF-8"); // Correct.
    
    

    Otherwise the platform default encoding will be used, which can be the one of the underlying operating system or the IDE(!).

Summarized: everywhere where you have the possibility to specify the character encoding, you should make use of it and set it to UTF-8.

Back to top

References

Here are some very useful references.

Last but not least, as Java just supports and uses Unicode all the time, also internally in the compiler, it's cool to know that it's possible to have such a class in Java:

\u0070\u0075\u0062\u006C\u0069\u0063\u0020\u0020\u0020\u0063\u006C\u0061\u0073\u0073\u0020\u0020
\u0055\u006E\u0069\u0063\u006F\u0064\u0065\u0020\u007B\u0020\u0070\u0075\u0062\u006C\u0069\u0063
\u0020\u0020\u0073\u0074\u0061\u0074\u0069\u0063\u0020\u0020\u0076\u006F\u0069\u0064\u0020\u0020
\u006D\u0061\u0069\u006E\u0020\u0028\u0020\u0053\u0074\u0072\u0069\u006E\u0067\u0020\u005B\u005D
\u0061\u0072\u0067\u0073\u0020\u0029\u0020\u007B\u0020\u0053\u0079\u0073\u0074\u0065\u006D\u002E
\u006F\u0075\u0074\u002E\u0070\u0072\u0069\u006E\u0074\u006C\u006E\u0028\u0022\u0049\u0022\u002B
\u0022\u0020\u2665\u0020\u0055\u006E\u0069\u0063\u006F\u0064\u0065\u0022\u0029\u003B\u007D\u007D

Save it unchanged as Unicode.java (without package), compile it and run it ;)

Back to top

Copyright - None of this article may be taken over without explicit authorisation.

(C) May 2009, BalusC