Showing posts with label Filter. Show all posts
Showing posts with label Filter. Show all posts

Sunday, December 27, 2009

Uploading files in Servlet 3.0

Introduction

The new Servlet 3.0 specification includes among others support for parsing multipart/form-data requests. All you basically need to do is to annotate the Servlet with @MultipartConfig annotation. No need for Apache Commons FileUpload anymore! Interesting detail is however that both Oracle Glassfish v3 and Apache Tomcat 7.0 actually silently uses Apache Commons FileUpload under the covers to fulfill the new Servlet 3.0 feature!


@MultipartConfig
public class UploadServlet extends HttpServlet {}

This way all multipart/form-data parts are available by HttpServletRequest#getParts(). It returns a collection of Part elements. This is to be used instead of the normal getParameter() calls and so on. The Part API itself is however somewhat limited in the degree of abstraction. To find out whether the part represents a normal text field or a file field, you'll have to parse the content-disposition header yourself to find out if the filename parameter is in there. Also, when you want to get the actual parameter value as String, you need to read the Part#getInputStream() into a String yourself. You'll also have to collect multiple parameter values together yourself based on the part name, where you could have used getParameterValues().

All that extra work does not harm if you have only one file upload servlet in your webapplication. But at times you would like to avoid repeating the same code again and again. Or you would like to continue using the getParameter() stuff the same way as for normal request. Or you would like to have all the parts be available as HttpServletRequest#getParameterMap() in Expression Language as you did before by ${param}.

Back to top

MultipartMap

For that I've created the MultipartMap. It simulates the HttpServletRequest#getParameterXXX() methods to ease the processing in @MultipartConfig servlets. You can access the normal request parameters by getParameter() and you can access multiple request parameter values by getParameterValues().

On creation, the MultipartMap will put itself in the request scope, identified by the attribute name parts, so that you can access the parameters in EL by for example ${parts.fieldname} where you would have used ${param.fieldname}. In case of file fields, the ${parts.filefieldname} returns a File object.

It was a design decision to extend HashMap<String, Object> instead of having just Map<String, String[]> and Map<String, File> properties, because of the accessibility in Expression Language. Also, when the value is obtained by get(), as will happen in EL, then multiple parameter values will be converted from String[] to List<String>, so that you can use it in the JSTL fn:contains function.

/*
 * net/balusc/http/multipart/MultipartMap.java
 *
 * Copyright (C) 2009 BalusC
 *
 * This program is free software: you can redistribute it and/or modify it under the terms of the
 * GNU Lesser General Public License as published by the Free Software Foundation, either version 3
 * of the License, or (at your option) any later version.
 *
 * This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without
 * even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * Lesser General Public License for more details.
 *
 * You should have received a copy of the GNU Lesser General Public License along with this library.
 * If not, see <http://www.gnu.org/licenses/>.
 */

package net.balusc.http.multipart;

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.util.Arrays;
import java.util.Collections;
import java.util.Enumeration;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;

import javax.servlet.MultipartConfigElement;
import javax.servlet.Servlet;
import javax.servlet.ServletException;
import javax.servlet.ServletRequest;
import javax.servlet.annotation.MultipartConfig;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.Part;

/**
 * The MultipartMap. It simulates the <code>HttpServletRequest#getParameterXXX()</code> methods to
 * ease the processing in <code>@MultipartConfig</code> servlets. You can access the normal request
 * parameters by <code>{@link #getParameter(String)}</code> and you can access multiple request
 * parameter values by <code>{@link #getParameterValues(String)}</code>.
 * <p>
 * On creation, the <code>MultipartMap</code> will put itself in the request scope, identified by
 * the attribute name <code>parts</code>, so that you can access the parameters in EL by for example
 * <code>${parts.fieldname}</code> where you would have used <code>${param.fieldname}</code>. In
 * case of file fields, the <code>${parts.filefieldname}</code> returns a <code>{@link File}</code>.
 * <p>
 * It was a design decision to extend <code>HashMap&lt;String, Object&gt;</code> instead of having
 * just <code>Map&lt;String, String[]&gt;</code> and <code>Map&lt;String, File&gt;</code>
 * properties, because of the accessibility in Expression Language. Also, when the value is obtained
 * by <code>{@link #get(Object)}</code>, as will happen in EL, then multiple parameter values will
 * be converted from <code>String[]</code> to <code>List&lt;String&gt;</code>, so that you can use
 * it in the JSTL <code>fn:contains</code> function.
 *
 * @author BalusC
 * @link http://balusc.blogspot.com/2009/12/uploading-files-in-servlet-30.html
 */
public class MultipartMap extends HashMap<String, Object> {

    // Constants ----------------------------------------------------------------------------------

    private static final String ATTRIBUTE_NAME = "parts";
    private static final String CONTENT_DISPOSITION = "content-disposition";
    private static final String CONTENT_DISPOSITION_FILENAME = "filename";
    private static final String DEFAULT_ENCODING = "UTF-8";
    private static final int DEFAULT_BUFFER_SIZE = 10240; // 10KB.

    // Vars ---------------------------------------------------------------------------------------

    private String encoding;
    private String location;
    private boolean multipartConfigured;

    // Constructors -------------------------------------------------------------------------------

    /**
     * Construct multipart map based on the given multipart request and the servlet associated with
     * the request. The file upload location will be extracted from <code>@MultipartConfig</code>
     * of the servlet. When the encoding is not specified in the given request, then it will default
     * to <tt>UTF-8</tt>.
     * @param multipartRequest The multipart request to construct the multipart map for.
     * @param servlet The servlet which is responsible for the given request.
     * @throws ServletException If something fails at Servlet level.
     * @throws IOException If something fails at I/O level.
     */
    public MultipartMap(HttpServletRequest multipartRequest, Servlet servlet)
        throws ServletException, IOException
    {
        this(multipartRequest, new MultipartConfigElement(
            servlet.getClass().getAnnotation(MultipartConfig.class)).getLocation(), true);
    }

    /**
     * Construct multipart map based on the given multipart request and file upload location. When
     * the encoding is not specified in the given request, then it will default to <tt>UTF-8</tt>.
     * @param multipartRequest The multipart request to construct the multipart map for.
     * @param location The location to save uploaded files in.
     * @throws ServletException If something fails at Servlet level.
     * @throws IOException If something fails at I/O level.
     */
    public MultipartMap(HttpServletRequest multipartRequest, String location)
        throws ServletException, IOException
    {
        this(multipartRequest, location, false);
    }

    /**
     * Global constructor.
     */
    private MultipartMap
        (HttpServletRequest multipartRequest, String location, boolean multipartConfigured)
            throws ServletException, IOException
    {
        multipartRequest.setAttribute(ATTRIBUTE_NAME, this);

        this.encoding = multipartRequest.getCharacterEncoding();
        if (this.encoding == null) {
            multipartRequest.setCharacterEncoding(this.encoding = DEFAULT_ENCODING);
        }
        this.location = location;
        this.multipartConfigured = multipartConfigured;

        for (Part part : multipartRequest.getParts()) {
            String filename = getFilename(part);
            if (filename == null) {
                processTextPart(part);
            } else if (!filename.isEmpty()) {
                processFilePart(part, filename);
            }
        }
    }

    // Actions ------------------------------------------------------------------------------------

    @Override
    public Object get(Object key) {
        Object value = super.get(key);
        if (value instanceof String[]) {
            String[] values = (String[]) value;
            return values.length == 1 ? values[0] : Arrays.asList(values);
        } else {
            return value; // Can be File or null.
        }
    }

    /**
     * @see ServletRequest#getParameter(String)
     */
    public String getParameter(String name) {
        Object value = super.get(name);
        if (value instanceof File) {
            return ((File) value).getName();
        }
        String[] values = (String[]) value;
        return values != null ? values[0] : null;
    }

    /**
     * @see ServletRequest#getParameterValues(String)
     */
    public String[] getParameterValues(String name) {
        Object value = super.get(name);
        if (value instanceof File) {
            return new String[] { ((File) value).getName() };
        }
        return (String[]) value;
    }

    /**
     * @see ServletRequest#getParameterNames()
     */
    public Enumeration<String> getParameterNames() {
        return Collections.enumeration(keySet());
    }

    /**
     * @see ServletRequest#getParameterMap()
     */
    public Map<String, String[]> getParameterMap() {
        Map<String, String[]> map = new HashMap<String, String[]>();
        for (Entry<String, Object> entry : entrySet()) {
            Object value = entry.getValue();
            if (value instanceof String[]) {
                map.put(entry.getKey(), (String[]) value);
            } else {
                map.put(entry.getKey(), new String[] { ((File) value).getName() });
            }
        }
        return map;
    }

    /**
     * Returns uploaded file associated with given request parameter name.
     * @param name Request parameter name to return the associated uploaded file for.
     * @return Uploaded file associated with given request parameter name.
     * @throws IllegalArgumentException If this field is actually a Text field.
     */
    public File getFile(String name) {
        Object value = super.get(name);
        if (value instanceof String[]) {
            throw new IllegalArgumentException("This is a Text field. Use #getParameter() instead.");
        }
        return (File) value;
    }

    // Helpers ------------------------------------------------------------------------------------

    /**
     * Returns the filename from the content-disposition header of the given part.
     */
    private String getFilename(Part part) {
        for (String cd : part.getHeader(CONTENT_DISPOSITION).split(";")) {
            if (cd.trim().startsWith(CONTENT_DISPOSITION_FILENAME)) {
                return cd.substring(cd.indexOf('=') + 1).trim().replace("\"", "");
            }
        }
        return null;
    }

    /**
     * Returns the text value of the given part.
     */
    private String getValue(Part part) throws IOException {
        BufferedReader reader = 
            new BufferedReader(new InputStreamReader(part.getInputStream(), encoding));
        StringBuilder value = new StringBuilder();
        char[] buffer = new char[DEFAULT_BUFFER_SIZE];
        for (int length = 0; (length = reader.read(buffer)) > 0;) {
            value.append(buffer, 0, length);
        }
        return value.toString();
    }

    /**
     * Process given part as Text part.
     */
    private void processTextPart(Part part) throws IOException {
        String name = part.getName();
        String[] values = (String[]) super.get(name);

        if (values == null) {
            // Not in parameter map yet, so add as new value.
            put(name, new String[] { getValue(part) });
        } else {
            // Multiple field values, so add new value to existing array.
            int length = values.length;
            String[] newValues = new String[length + 1];
            System.arraycopy(values, 0, newValues, 0, length);
            newValues[length] = getValue(part);
            put(name, newValues);
        }
    }

    /**
     * Process given part as File part which is to be saved in temp dir with the given filename.
     */
    private void processFilePart(Part part, String filename) throws IOException {
        // First fix stupid MSIE behaviour (it passes full client side path along filename).
        filename = filename
            .substring(filename.lastIndexOf('/') + 1)
            .substring(filename.lastIndexOf('\\') + 1);

        // Get filename prefix (actual name) and suffix (extension).
        String prefix = filename;
        String suffix = "";
        if (filename.contains(".")) {
            prefix = filename.substring(0, filename.lastIndexOf('.'));
            suffix = filename.substring(filename.lastIndexOf('.'));
        }

        // Write uploaded file.
        File file = File.createTempFile(prefix + "_", suffix, new File(location));
        if (multipartConfigured) {
            part.write(file.getName()); // Will be written to the very same File.
        } else {
            InputStream input = null;
            OutputStream output = null;
            try {
                input = new BufferedInputStream(part.getInputStream(), DEFAULT_BUFFER_SIZE);
                output = new BufferedOutputStream(new FileOutputStream(file), DEFAULT_BUFFER_SIZE);
                byte[] buffer = new byte[DEFAULT_BUFFER_SIZE];
                for (int length = 0; ((length = input.read(buffer)) > 0);) {
                    output.write(buffer, 0, length);
                }
            } finally {
                if (output != null) try { output.close(); } catch (IOException logOrIgnore) { /**/ }
                if (input != null) try { input.close(); } catch (IOException logOrIgnore) { /**/ }
            }
        }

        put(part.getName(), file);
        part.delete(); // Cleanup temporary storage.
    }

}

It is necessary to know the file upload location in the MultipartMap as well, because we can then make use of File#createTempFile() to create files with an unique filename to avoid them being overwritten by another files with a (by coincidence) same name. Once you have the uploaded file at hands in the servlet or bean, you can always make use of File#renameTo() to do a fast rename/move.

Back to top

Basic use example

Here is a basic use example of a servlet and JSP file which demonstrates the working of the MultipartMap.

package net.balusc.example.upload;

import java.io.File;
import java.io.IOException;
import java.util.Arrays;

import javax.servlet.ServletException;
import javax.servlet.annotation.MultipartConfig;
import javax.servlet.annotation.WebServlet;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import net.balusc.http.multipart.MultipartMap;

@WebServlet(urlPatterns = { "/upload" })
@MultipartConfig(location = "/upload", maxFileSize = 10485760L) // 10MB.
public class UploadServlet extends HttpServlet {

    @Override
    protected void doGet(HttpServletRequest request, HttpServletResponse response)
        throws ServletException, IOException
    {
        request.getRequestDispatcher("/WEB-INF/upload.jsp").forward(request, response);
    }

    @Override
    protected void doPost(HttpServletRequest request, HttpServletResponse response)
        throws ServletException, IOException
    {
        MultipartMap map = new MultipartMap(request, this);
        String text = map.getParameter("text");
        File file = map.getFile("file");
        String[] check = map.getParameterValues("check");

        // Now do your thing with the obtained input.
        System.out.println("Text: " + text);
        System.out.println("File: " + file);
        System.out.println("Check: " + Arrays.toString(check));

        request.getRequestDispatcher("/WEB-INF/upload.jsp").forward(request, response);
    }

}

That was the UploadServlet. Note the two annotations. The @WebServlet annotation definies under each the url-pattern, the URL pattern on which the servlet should listen. The @MultipartConfig annotation defines the location at the local disk file system where uploaded files are to be stored. In this case it is the /upload folder. In Windows environments with the application server running on the C:/ disk, this location effectively points to C:/upload. Ensure that you have created this folder beforehand!

Here's the JSP file, the /WEB-INF/upload.jsp:

<%@ page pageEncoding="UTF-8" %>
<%@ taglib uri="http://java.sun.com/jsp/jstl/core" prefix="c" %>
<%@ taglib uri="http://java.sun.com/jsp/jstl/functions" prefix="fn" %>

<!doctype html>
<html lang="en">
    <head>
        <title>Servlet 3.0 file upload test</title>
        <style>label { float: left; display: block; width: 75px; }</style>
    </head>
    <body>
        <form action="upload" method="post" enctype="multipart/form-data">
            <label for="text">Text:</label>
            <input type="text" id="text" name="text" value="${parts.text}">
            <br>
            <label for="file">File:</label>
            <input type="file" id="file" name="file">
            <c:if test="${not empty parts.file}">
                File ${parts.file.name} successfully uploaded!
            </c:if>
            <br>
            <label for="check1">Check 1:</label>
            <input type="checkbox" id="check1" name="check" value="check1"
                ${fn:contains(parts.check, 'check1') ? 'checked' : ''}>
            <br>
            <label for="check2">Check 2:</label>
            <input type="checkbox" id="check2" name="check" value="check2"
                ${fn:contains(parts.check, 'check2') ? 'checked' : ''}>
            <br>
            <input type="submit" value="submit">
        </form>
    </body>
</html>

Copy'n'paste the stuff and run it at http://localhost:8080/playground/upload (assuming that your local development server runs at port 8080 and that the context root of your playground web application project is called 'playground') and see it working! And no, you don't need to declare the servlet in web.xml, the servlets are automagically loaded and initialized with help of the new Servlet 3.0 annotations.

Note: this all is developed and tested with Eclipse 3.5 and Glassfish v3.

Back to top

More abstraction

As you might have noticed, the MultipartMap class here above has a second public constructor taking the file upload location as String parameter instead of the involved servlet. This is useful in circumstances where you'd like to abstract the entire HttpServletRequest, including the parameter map, away with help of a Filter and a HttpServletRequestWrapper. This way you can just access the request parameters the unchanged EL way by ${param}. This is also useful if you're running a MVC framework on top of the Servlet API which doesn't support the @MultipartConfig annotation, such as JSF 2.0 (here's an article about uploading files in JSF 2.0 + Servlet 3.0). The use of @MultipartConfig annotation is restricted to servlets only, so with a filter you need to specify the file upload location yourself, hence the second constructor of MultipartMap.

Here's the Filter which could be used to process multipart/form-data request transparently:

/*
 * net/balusc/http/multipart/MultipartFilter.java
 * 
 * Copyright (C) 2009 BalusC
 * 
 * This program is free software: you can redistribute it and/or modify it under the terms of the
 * GNU Lesser General Public License as published by the Free Software Foundation, either version 3
 * of the License, or (at your option) any later version.
 * 
 * This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without
 * even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
 * Lesser General Public License for more details.
 * 
 * You should have received a copy of the GNU Lesser General Public License along with this library.
 * If not, see <http://www.gnu.org/licenses/>.
 */

package net.balusc.http.multipart;

import java.io.IOException;

import javax.servlet.Filter;
import javax.servlet.FilterChain;
import javax.servlet.FilterConfig;
import javax.servlet.ServletException;
import javax.servlet.ServletRequest;
import javax.servlet.ServletResponse;
import javax.servlet.annotation.WebFilter;
import javax.servlet.annotation.WebInitParam;
import javax.servlet.http.HttpServletRequest;

/**
 * This filter detects <tt>multipart/form-data</tt> and <tt>multipart/mixed</tt> POST requests and
 * will then replace the <code>HttpServletRequest</code> by a <code>{@link MultipartRequest}</code>.
 * 
 * @author BalusC
 * @link http://balusc.blogspot.com/2009/12/uploading-files-in-servlet-30.html
 */

@WebFilter(urlPatterns = { "/*" }, initParams = {
    @WebInitParam(name = "location", value = "/upload") })
public class MultipartFilter implements Filter {

    // Constants ----------------------------------------------------------------------------------

    private static final String INIT_PARAM_LOCATION = "location";
    private static final String REQUEST_METHOD_POST = "POST";
    private static final String CONTENT_TYPE_MULTIPART = "multipart/";

    // Vars --------------------------------------------------------------------------------------

    private String location;

    // Actions ------------------------------------------------------------------------------------

    @Override
    public void init(FilterConfig config) throws ServletException {
        this.location = config.getInitParameter(INIT_PARAM_LOCATION);
    }

    @Override
    public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
        throws IOException, ServletException {
        HttpServletRequest httpRequest = (HttpServletRequest) request;
        if (isMultipartRequest(httpRequest)) {
            request = new MultipartRequest(httpRequest, location);
        }
        chain.doFilter(request, response);
    }

    @Override
    public void destroy() {
        // NOOP.
    }

    // Helpers ------------------------------------------------------------------------------------

    /**
     * Returns true if the given request is a multipart request.
     * @param request The request to be checked.
     * @return True if the given request is a multipart request.
     */
    public static final boolean isMultipartRequest(HttpServletRequest request) {
        return REQUEST_METHOD_POST.equalsIgnoreCase(request.getMethod())
            && request.getContentType() != null
            && request.getContentType().toLowerCase().startsWith(CONTENT_TYPE_MULTIPART);
    }

}

It is true that the location property is a bit nonsensicial since it is already "hardcoded" by an annotation in the very same filter class. It is however overrideable by a real init param in web.xml!

And now the MultipartRequest which the filter needs to replace the request with:

/*
 * net/balusc/http/multipart/MultipartRequest.java
 * 
 * Copyright (C) 2009 BalusC
 * 
 * This program is free software: you can redistribute it and/or modify it under the terms of the
 * GNU Lesser General Public License as published by the Free Software Foundation, either version 3
 * of the License, or (at your option) any later version.
 * 
 * This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without
 * even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * Lesser General Public License for more details.
 * 
 * You should have received a copy of the GNU Lesser General Public License along with this library.
 * If not, see <http://www.gnu.org/licenses/>.
 */

package net.balusc.http.multipart;

import java.io.File;
import java.io.IOException;
import java.util.Enumeration;
import java.util.Map;

import javax.servlet.ServletException;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletRequestWrapper;
import javax.servlet.http.Part;

/**
 * This class represents a multipart request. It not only abstracts the <code>{@link Part}</code>
 * away, but it also provides direct access to the <code>{@link MultipartMap}</code>, so that one 
 * can get the uploaded files out of it.
 * 
 * @author BalusC
 * @link http://balusc.blogspot.com/2009/12/uploading-files-in-servlet-30.html
 */
public class MultipartRequest extends HttpServletRequestWrapper {

    // Vars ---------------------------------------------------------------------------------------

    private MultipartMap multipartMap;

    // Constructors -------------------------------------------------------------------------------

    /**
     * Construct MultipartRequest based on the given HttpServletRequest.
     * @param request HttpServletRequest to be wrapped into a MultipartRequest.
     * @param location The location to save uploaded files in.
     * @throws IOException If something fails at I/O level.
     * @throws ServletException If something fails at Servlet level.
     */
    public MultipartRequest(HttpServletRequest request, String location)
        throws ServletException, IOException
    {
        super(request);
        this.multipartMap = new MultipartMap(request, location);
    }

    // Actions ------------------------------------------------------------------------------------

    @Override
    public String getParameter(String name) {
        return multipartMap.getParameter(name);
    }

    @Override
    public String[] getParameterValues(String name) {
        return multipartMap.getParameterValues(name);
    }

    @Override
    public Enumeration<String> getParameterNames() {
        return multipartMap.getParameterNames();
    }

    @Override
    public Map<String, String[]> getParameterMap() {
        return multipartMap.getParameterMap();
    }

    /**
     * @see MultipartMap#getFile(String)
     */
    public File getFile(String name) {
        return multipartMap.getFile(name);
    }

}

That should be it. And no, also no web.xml modifications are needed here. The web.xml is pretty superflous with the new Servlet 3.0 annotations.

When the Filter is in use, then the first lines of UploadServlet#doPost() can now be changed as follows:


    @Override
    protected void doPost(HttpServletRequest request, HttpServletResponse response)
        throws ServletException, IOException
    {
        String text = request.getParameter("text");
        File file = ((MultipartRequest) request).getFile("file");
        String[] check = request.getParameterValues("check");
        ...
    }

This also implies that the @MultipartConfig annotation can be removed from the servlet. You only need to handle file size limits yourself, but that can now be done more nicely (it would by default abort the entire request and show a HTTP 500 error page otherwise, not very good for User eXperience). The ${parts} in the EL throughout the JSP file can also be changed back to the normal ${param}, including the ones for the uploaded files.

Back to top

Copyright - GNU Lesser General Public License

(C) December 2009, BalusC

Wednesday, May 6, 2009

Unicode - How to get the characters right?

Introduction

Computers understand only bits and bytes. You know, the binary numeral system of zeros and ones. Humans, on the other hand, understand characters only. You know, the building blocks of the natural languages. So, to handle human readable characters using a computer (read, write, store, transfer, etcetera), they have to be converted to bytes. One byte is an ordered collection of eight zeros or ones (bits). The characters are only used for pure presentation to humans. Behind any character you see, there is a certain order of bits. For a computer a character is in fact nothing less or more than a simple graphical picture (a font) which has an unique "identifier" in form of a certain order of bits.

To convert between chars and bytes a computer needs a mapping where every unique character is associated with unique bytes. This mapping is also called the character encoding. The character encoding exist of basically two parts. The one is the character set (charset), which represents all of the unique characters. The other is the numeral representation of each of the characters of the charset. The numeral representation to humans is usually in hexadecimal, which is in turn easily to be "converted" to bytes (both are just numeral systems, only with a different base).

Character Encoding
Character set
(human presentation)
Numeral representation
(computer identification)
Ax0041 (01000001)
Bx0042 (01000010)
Cx0043 (01000011)
Dx0044 (01000100)
Ex0045 (01000101)
Fx0046 (01000110)
Gx0047 (01000111)
Hx0048 (01001000)
Ix0049 (01001001)
Jx004A (01001010)
Kx004B (01001011)
Lx004C (01001100)
Mx004D (01001101)
Nx004E (01001110)
Ox004F (01001111)
......
Back to top

Well, where does it go wrong?

The world would be much simpler if only one character encoding existed. That would have been clear enough for everyone. Unfortunately the truth is different. There are a lot of different character encodings, each with its own charsets and numeral mappings. So it may be obvious that a character which is converted to bytes using character encoding X may not be the same character when it is converted back from bytes using character encoding Y. That would in turn lead to confusion among humans, because they wouldn't understand the way the computer represented their natural language. Humans would see completely different characters and thus not be able to understand the "language" which is also known as the "mojibake". It can also happen that humans would not see any linguistic character at all, because the numeral representation of the character in question isn't covered by the numeral mapping of the character encoding used. It's simply unknown.

How such an unknown character is displayed differs per application which handles the character. In the webbrowser world, Firefox would display an unknown character as a black diamond with a question mark in it, while Internet Explorer would display it as an empty white square with a black border. Both represents the same Unicode character though: xFFFD, which is displayed in your webbrowser as "�". Internet Explorer simply doesn't have a font (a graphical picture) for it, hence the empty square. In Java/JSP/Servlet world, any unknown character which is passed through the write() methods of an OutputStream (e.g. the one obtained by ServletResponse#getOutputStream()) get printed as a plain question mark "?". Those question marks can in some cases also be caused by the database. Most database engines replaces uncovered numeral representations by a plain question mark during save (INSERT/UPDATE), which is in turn later displayed to the human when the data is been queried and sent to the webbrowser. The plain question marks are thus not per se caused by the webbrowser.

Here is a small test snippet which demonstrates the problem. Keep in mind that Java supports and uses Unicode all the time. So the encoding problem which you see in the output is not caused by Java itself, but by using the ISO 8859-1 character encoding to display Unicode characters. The ISO 8859-1 character encoding namely doesn't cover the numeral representations of a large part of the Unicode charset which is also known as the UTF-8 charset. By the way, the term "Unicode character" is nowhere defined, but it usually used by (unaware) programmers/users who actually meant "Any character which is not covered by the ISO 8859 character encoding".

package test;

public class Test {

    public static void main(String... args) throws Exception {
        String czech = "Český";
        String japanese = "日本語";

        System.out.println("UTF-8 czech: " + new String(czech.getBytes("UTF-8")));
        System.out.println("UTF-8 japanese: " + new String(japanese.getBytes("UTF-8")));

        System.out.println("ISO-8859-1 czech: " + new String(czech.getBytes("ISO-8859-1")));
        System.out.println("ISO-8859-1 japanese: " + new String(japanese.getBytes("ISO-8859-1")));
    }

}

UTF-8 czech: Český
UTF-8 japanese: 日本語
ISO-8859-1 czech: ?esk�
ISO-8859-1 japanese: ???

These kinds of problems are often referred to as the "Unicode problem".

Important note: your own operating system should of course have the proper fonts (yes, the human representations) supporting those Unicode charsets for both Czech and Japanese languages installed to see the proper characters/glyphs at this webpage :) Otherwise you will see in for example Firefox a black-bordered square with hexcode inside (0-9 and/or A-F) and in most other webbrowsers such as IE, Safari and Chrome a nothing-saying empty square with a black border. Below is a screenshot from Chrome which shows the right characters, so you can compare if necessary:

If your operating system for example doesn't have the Japanese glyphs in the font as required by this page, then you should in Firefox see three squares with hexcodes 65E5, 672C and 8A9E. Those hexcodes are actually also called 'Unicode codepoints'. In Windows, you can view all available fonts and the supported characters using 'charmap.exe' (Start > Run > charmap).

Another important note: if you have problems when copypasting the test snippet in your development environment (e.g. you are not seeing the proper characters, but only empty squares or something like), then please wait with playing until you have read the entire article, including step 1 of the OK .. So, I have an "Unicode problem", what now? chapter ;)

Back to top

Unicode, what's it all about?

Let's go back in the history of character encoding. Most of you may be familiar with the term "ASCII". This was less or more the first character encoding ever. At the ages when a byte was very expensive and 1MHz was extremely fast, only the characters which appeared on those ancient US typewriters (as well as at the average US International keyboard nowadays) were covered by the charset of the ASCII character encoding. This includes the complete Latin alphabet (A-Z, in both the lowercased and uppercased flavour), the numeral digits (0-9), the lexical control characters (space, dot, comma, colon, etcetera) and some special characters (the at sign, the sharp sign, the dollar sign, etcetera). All those characters fill up the space of 7 bits, half of the room a byte provides, with a total of 128 characters.

Later the remaining bit of a byte is used for Extended ASCII which provides room for a total of 255 characters. Most of the remaining room is used by special characters, such as diacritical characters and line drawing characters. Because everyone used the remaining room their own way (IBM, Commodore, Universities, etcetera), it was not interchangeable. Later ISO came up with standard character encoding definitions for 8 bit ASCII extensions, resulting in the known ISO 8859 character encoding standards such as ISO 8859-1.

8 bits may be enough for the languages using the Latin alphabet, but it is certainly not enough for the remaining non-Latin languages in the world, such as Chinese, Japanese, Hebrew, Cyrillic, Sanskrit, Arabic, etcetera. They developed their own non-ISO character encodings which was -again- not interchangeable, such as Guobiao, BIG5, JIS, KOI, MIK, TSCII, etcetera. Finally a new 16 bits character encoding standard based on top of ISO 8859-1 was established to cover any of the characters used at the world so that it is interchangeable everywhere: Unicode. You can find all of those linguistic characters here. Unicode also covers many special characters (symbols) such as punctuation and mathematical operators, which you can find here.

Back to top

OK .. So, I have an "Unicode problem", what now?

To the point: just ensure that you use UTF-8 (a character encoding which conforms the Unicode standard) all the way. There are more Unicode character encodings as well, but as far they are used very, very seldom. UTF-8 is likely the Unicode standard. To solve the "Unicode problem" you need to ensure that every step which involves byte-character conversion uses the one and the same character encoding: reading data from input stream, writing data to output stream, querying data from database, storing data in database, manipulating the data, displaying the data, etcetera. For a Java EE web developer, there are a lot of things you have to take into account.

  1. Development environment: yes, the development environment has to use UTF-8 as well. By default most text files are saved using the operating system default encoding such as ISO 8859-1 or even an proprietary encoding such as Windows ANSI (also known as CP-1252, which is in turn not interchangeable with non-Windows platforms!). The most basic text editor of Windows, Notepad, uses Windows ANSI by default, but Notepad supports UTF-8 as well. To save a text file containing Unicode characters using Notepad, you need to choose the File » Save As option and select UTF-8 from the Encoding dropdown. The same Save As story applies on many other self-respected text editors as well, like EditPlus, UltraEdit and Notepad++.


    In an IDE such as Eclipse you can set the encoding at several places. You need to explore the IDE preferences thoroughly to find and change them. In case of Eclipse, just go to Window » Preferences and enter filter text encoding. In the filtered preferences (Workspace, JSP files, etcetera) you can select the desired encoding from a dropdown. Important note: the Workspace encoding also covers the output console and thus also the outcome of System.out.println(). If you sysout an Unicode character using the default encoding, it would likely be printed as a plain vanilla question mark!


    In the command console it is not possible.

    C:\Java>java test.Test
    UTF-8 czech: Český
    UTF-8 japanese: 日本語
    ISO-8859-1 czech: ─?esk├╜
    ISO-8859-1 japanese: µ?ѵ?¼Î¦¬?

    C:\Java>_

    In theory, in the Windows command prompt you have to use a font which supports a broad range of Unicode characters. You can set the font by opening the command console (Start > Run > cmd), then clicking the small cmd icon at the left top, then choosing Properties and finally choosing the Font tab. In a default Windows environment only the Lucida Console font has the "best" support of Unicode fonts. It unfortunately lacks a broad range of Unicode characters though.

    The cmd.exe parameter \U and/or the command chcp 65001 (which changes the code page to UTF-8) doesn't help much if the font already doesn't support the desired characters. You could hack the registry to add more fonts, but you still have to find a specific command console font which supports all of the desired characters. In the end it's better to use Swing to create a command console like UI instead of using the standard command console. Especially if the application is intended to be distributed (you don't want to require the enduser to hack/change their environment to get your application to work, do you? ;) ).

  2. Java properties files: as stated in its Javadoc the load(InputStream) method of the java.util.Properties API uses ISO 8859-1 as the default encoding. Here's an extract of the class' Javadoc:

    .. the input/output stream is encoded in ISO 8859-1 character encoding. Characters that cannot be directly represented in this encoding can be written using Unicode escapes ; only a single 'u' character is allowed in an escape sequence. The native2ascii tool can be used to convert property files to and from other character encodings.

    If you have full control over loading of the properties files, then you should use the Java 1.6 load(Reader) method in combination with an InputStreamReader instead:
    
    Properties properties = new Properties();
    properties.load(new InputStreamReader(classLoader.getResourceAsStream(filename), "UTF-8"));
    
    
    If you don't have full control over loading of the properties files (e.g. managed by some framework), then you need the in the Javadoc mentioned native2ascii tool. The native2ascii tool can be found in the /bin folder of the JDK installation directory. When you for example need to maintain properties files with Unicode characters for i18n (Internationalization; also known as resource bundles), then it's a good practice to have both an UTF-8 properties file and an ISO 8859-1 properties file and some batch program to convert from the UTF-8 properties file to an ISO 8859-1 properties file. You use the UTF-8 properties file for editing only. You use the converter to convert it to ISO 8859-1 properties file after every edit. You finally just leave the ISO 8859-1 properties file as it is. In most (smart) IDE's like Eclipse you cannot use the .properties extension for those UTF-8 properties files, it would complain about unknown characters because it is forced to save properties files in ISO 8859-1 format. Name it .properties.utf8 or something else. Here's an example of a simple Windows batch file which does the conversion task:
    
    cd c:\path\to\properties\files
    c:\path\to\jdk\bin\native2ascii.exe -encoding UTF-8 text_cs.properties.utf8 text_cs.properties
    c:\path\to\jdk\bin\native2ascii.exe -encoding UTF-8 text_ja.properties.utf8 text_ja.properties
    c:\path\to\jdk\bin\native2ascii.exe -encoding UTF-8 text_zh.properties.utf8 text_zh.properties
    # You can add more properties files here.
    
    
    Save it as utf8.converter.bat (or something like) and run it once to convert all UTF-8 properties files to standard ISO 8859-1 properties files. If you're using Maven and/or Ant, this can even be automated to take place during the build of the project.

    For JSF there are better ways using ResourceBundle.Control API. Check this blog article: Internationalization in JSF with UTF-8 properties files.

  3. JSP/Servlet request: during request processing an average application server will by default use the ISO 8859-1 character encoding to URL-decode the request parameters. You need to force the character encoding to UTF-8 yourself. First this: "URL encoding" must not to be confused with "character encoding". URL encoding is merely a conversion of characters to their numeral representations in the %xx format, so that special characters can be passed through URL without any problems. The client will URL-encode the characters before sending them to the server. The server should URL-decode the characters using the same character encoding. Also see "percent encoding".

    How to configure this depends on the server used, so the best is to refer its documentation. In case of for example Tomcat you need to set the URIEncoding attribute of the <Connector> element in Tomcat's /conf/server.xml to set the character encoding of HTTP GET requests, also see this document:
    
    <Connector (...) URIEncoding="UTF-8" />
    
    
    In for example Glassfish you need to set the <parameter-encoding> entry in webapp's /WEB-INF/sun-web.xml (or, since Glassfish 3.1, glassfish-web.xml), see also this document:
    
    <parameter-encoding default-charset="UTF-8" />
    
    
    URL-decoding POST request parameters is a story apart. The webbrowser is namely supposed to send the charset used in the Content-Type request header. However, most webbrowsers doesn't do it. Those webbrowsers will just use the same character encoding as the page with the form was delivered with, i.e. it's the same charset as specified in Content-Type header of the HTTP response or the <meta> tag. Only Microsoft Internet explorer will send the character encoding in the request header when you specify it in the accept-charset attribute of the HTML form. However, this implementation is broken in certain circumstances, e.g. when IE-win says "ISO-8859-1", it is actually CP-1252! You should really avoid using it. Just let it go and set the encoding yourself.

    You can solve this by setting the same character encoding in the ServletRequest object yourself. An easy solution is to implement a Filter for this which is mapped on an url-pattern of /* and basically contains only the following lines in the doFilter() method:
    
    if (request.getCharacterEncoding() == null) {
        request.setCharacterEncoding("UTF-8");
    }
    chain.doFilter(request, response);
    
    
    Note: URL-decoding POST request parameters the above way is not necessary when you're using Facelets instead of JSP as it defaults to UTF-8 already. It's also not necessary when you're using Glassfish as the <parameter-encoding> also takes care about this.

    Here's a test snippet which demonstrates what exactly happens behind the scenes when it all fails:
    package test;
    
    import java.net.URLDecoder;
    import java.net.URLEncoder;
    
    public class Test {
    
        public static void main(String... args) throws Exception {
            String input = "日本語";
            System.out.println("Original input string from client: " + input);
    
            String encoded = URLEncoder.encode(input, "UTF-8");
            System.out.println("URL-encoded by client with UTF-8: " + encoded);
    
            String incorrectDecoded = URLDecoder.decode(encoded, "ISO-8859-1");
            System.out.println("Then URL-decoded by server with ISO-8859-1: " + incorrectDecoded);
    
            String correctDecoded = URLDecoder.decode(encoded, "UTF-8");
            System.out.println("Server should URL-decode with UTF-8: " + correctDecoded);
        }
    
    }
    
    Original input string from client: 日本語
    URL-encoded by client with UTF-8: %E6%97%A5%E6%9C%AC%E8%AA%9E
    Then URL-decoded by server with ISO-8859-1: 日本語
    Server should URL-decode with UTF-8: 日本語

  4. JSP/Servlet response: during response processing an average application server will by default use ISO 8859-1 to encode the response outputstream. You need to force the response encoding to UTF-8 yourself. If you use JSP as view technology, then adding the following line to the top (yes, as the first line) of your JSP ought to be sufficient:
    
    <%@ page pageEncoding="UTF-8" %>
    
    
    This will set the response outputstream encoding to UTF-8 and set the HTTP response content-type header to text/html;charset=UTF-8. To apply this setting globally so that you don't need to edit every individual JSP, you can also add the following entry to your /WEB-INF/web.xml file:
    
    <jsp-config>
        <jsp-property-group>
            <url-pattern>*.jsp</url-pattern>
            <page-encoding>UTF-8</page-encoding>
        </jsp-property-group>
    </jsp-config>
    
    
    Note: this is not necessary when you're using Facelets instead of JSP as it defaults to UTF-8 already.

    The HTTP content-type header actually does nothing at the server side, but it should instruct the webbrowser at the client side which character encoding to use for display. The webbrowser must use it above any specified HTML meta content-type header as specified by w3 HTML spec chapter 5.2.2. In other words, the HTML meta content-type header is totally ignored when the page is served over HTTP. But when the enduser saves the page locally and views it from the local disk file system, then the meta content-type header will be used. To cover that as well, you should add the following HTML meta content-type header to your JSP anyway:

    
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    
    
    Note: lowercase utf-8 or uppercase UTF-8 doesn't really matter in all circumstances.

    If you (ab)use a HttpServlet instead of a JSP to generate HTML content using out.write(), out.print() statements and so on, then you need to set the encoding in the ServletResponse object itself inside the servlet method block before you call getWriter() or getOutputStream() on it:
    
    response.setCharacterEncoding("UTF-8");
    
    
    You can do that in the aforementioned Filter, but this can lead to problems if you have servlets in your webapplication which uses the response for something else than generating HTML content. After all, there shouldn't be any need to do this. Use JSP to generate HTML content, that's where it is for. When generating other plain text content than HTML, such as XML, CSV, JSON, etcetera, then you need to set the response character encoding the above way.

  5. JSF/Facelets request/response: JSF/Facelets uses by default already UTF-8 for all HTTP requests and responses. You only need to configure the server as well to use the same encoding as described in JSP/Servlet request section.

    Only when you're using a custom filter or a 3rd party component library which calls request.getParameter() or any other method which implicitly needs to parse the request body in order to extract the data, then there's chance that it's too late for JSF/Facelets to set the UTF-8 character encoding before the request body is been parsed for the first time. PrimeFaces 3.2 for example is known to do that. In that case, you'd still need a custom filter as described in JSP/Servlet request section.

  6. Databases: also the database has to take the character encoding in account. In general you need to specify it during the CREATE and if necessary also during the ALTER statements and in some cases you also need to specify it in the connection string or the connection parameters. The exact syntax depends on the database used, best is to refer its documentation using the keywords "character set". In for example MySQL you can use the CHARACTER SET clause as pointed out here:
    
    CREATE DATABASE db_name CHARACTER SET utf8;
    CREATE TABLE tbl_name (...) CHARACTER SET utf8;
    
    
    Usually the database's JDBC driver is smart enough to use the database and/or table specified encoding for querying and storing the data. But in worst cases you have to specify the character encoding in the connection string as well. This is true in case of MySQL JDBC driver because it does not use the database-specified encoding, but the client-specified encoding. How to configure it should already be answered in the JDBC driver documentation. In for example MySQL you can read it here:
    
    jdbc:mysql://localhost:3306/db_name?useUnicode=true&characterEncoding=UTF-8
    
    

  7. Text files: when reading/writing a text file with unicode characters using Reader/Writer, you need java.io.InputStreamReader/java.io.OutputStreamWriter where in you can specify the UTF-8 encoding in one of its constructors:
    
    Reader reader = new InputStreamReader(new FileInputStream("c:/file.txt"), "UTF-8");
    Writer writer = new OutputStreamWriter(new FileOutputStream("c:/file.txt"), "UTF-8");
    
    

    Otherwise the operating system default encoding will be used.


  8. Strings: although Java uses Unicode all the time under the hood, when you convert between String and byte[] using String#getBytes() or String(byte[]), you should rather use the overloaded method/constructor which takes the character encoding:
    
    byte[] bytesInDefaultEncoding = someString.getBytes(); // May generate corrupt bytes.
    byte[] bytesInUTF8 = someString.getBytes("UTF-8"); // Correct.
    String stringUsingDefaultEncoding = new String(bytesInUTF8); // Unknown bytes becomes "?".
    String stringUsingUTF8 = new String(bytesInUTF8, "UTF-8"); // Correct.
    
    

    Otherwise the platform default encoding will be used, which can be the one of the underlying operating system or the IDE(!).

Summarized: everywhere where you have the possibility to specify the character encoding, you should make use of it and set it to UTF-8.

Back to top

References

Here are some very useful references.

Last but not least, as Java just supports and uses Unicode all the time, also internally in the compiler, it's cool to know that it's possible to have such a class in Java:

\u0070\u0075\u0062\u006C\u0069\u0063\u0020\u0020\u0020\u0063\u006C\u0061\u0073\u0073\u0020\u0020
\u0055\u006E\u0069\u0063\u006F\u0064\u0065\u0020\u007B\u0020\u0070\u0075\u0062\u006C\u0069\u0063
\u0020\u0020\u0073\u0074\u0061\u0074\u0069\u0063\u0020\u0020\u0076\u006F\u0069\u0064\u0020\u0020
\u006D\u0061\u0069\u006E\u0020\u0028\u0020\u0053\u0074\u0072\u0069\u006E\u0067\u0020\u005B\u005D
\u0061\u0072\u0067\u0073\u0020\u0029\u0020\u007B\u0020\u0053\u0079\u0073\u0074\u0065\u006D\u002E
\u006F\u0075\u0074\u002E\u0070\u0072\u0069\u006E\u0074\u006C\u006E\u0028\u0022\u0049\u0022\u002B
\u0022\u0020\u2665\u0020\u0055\u006E\u0069\u0063\u006F\u0064\u0065\u0022\u0029\u003B\u007D\u007D

Save it unchanged as Unicode.java (without package), compile it and run it ;)

Back to top

Copyright - None of this article may be taken over without explicit authorisation.

(C) May 2009, BalusC

Saturday, December 1, 2007

WhitespaceFilter

Whitespace

Whitespace is used everywhere. It covers spaces, tabs and newlines. It is used to distinguish lexical tokens from each other and also to keep the source code readable for the developer. But in case of HTML over network, whitespace costs bandwidth and therefore in some circumstances also money and/or performance. If you care about the bandwidth usage and/or the money and/or performance, then you can consider to trim off all whitespace of the HTML response. The only con is that it makes the HTML source code at the client side almost unreadable.

You can trim whitespace right in the HTML files (or JSP or JSF or whatever view you're using, as long as it writes plain HTML response), but that would make the source code unreadable for yourself. Better way is to use a Filter which trims the whitespace from the response.

Back to top

Replace response writer

Here is how such a WhitespaceFilter can look like. It is relatively easy, it actually replaces the writer of the HttpServletResponse with a customized implementation of PrintWriter. This implemetation will trim whitespace off from any strings and character arrays before writing it to the response stream. It also take care of any <pre> and <textarea> tags and keep the whitespace of its contents unchanged. However it doesn't care about the CSS white-space: pre; property, because it would involve too much work to check on that (parse HTML, lookup CSS classes, sniff the appropriate style and parse it again, etc). It isn't worth that effort. Just use <pre> tags if you want to use preformatted text ;)

Note that this filter only works on requests which are passed through a servlet which writes the response to the PrintWriter, e.g. JSP and JSF files (parsed by JspServlet and FacesServlet respectively) or custom servlets which uses HttpServletResponse#getWriter() to write output. This filter does not work on requests for plain vanilla CSS, Javascript, HTML files and images and another binary files which aren't written through the PrintWriter, but through the OutputStream. If you want to implement the same thing for the OutputStream, then you'll have to check the content type first if it starts with "text" or not, otherwise binary files would be screwed up. Unfortunately in real (at least, in Tomcat 6.0) the content type is set after the output stream is acquired, thus we cannot determine the content type during acquiring the output stream.

The stuff is tested in a Java EE 5.0 environment with Tomcat 6.0 with Servlet 2.5, JSP 2.1, JSTL 1.2 and JSF 1.2_06.

/*
 * net/balusc/webapp/WhitespaceFilter.java
 * 
 * Copyright (C) 2007 BalusC
 * 
 * This program is free software; you can redistribute it and/or modify it under the terms of the
 * GNU General Public License as published by the Free Software Foundation; either version 2 of the
 * License, or (at your option) any later version.
 * 
 * This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without
 * even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
 * General Public License for more details.
 * 
 * You should have received a copy of the GNU General Public License along with this program; if
 * not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
 * 02110-1301, USA.
 */

package net.balusc.webapp;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.io.StringReader;

import javax.servlet.Filter;
import javax.servlet.FilterChain;
import javax.servlet.FilterConfig;
import javax.servlet.ServletException;
import javax.servlet.ServletRequest;
import javax.servlet.ServletResponse;
import javax.servlet.http.HttpServletResponse;
import javax.servlet.http.HttpServletResponseWrapper;

/**
 * This filter class removes any whitespace from the response. It actually trims all leading and 
 * trailing spaces or tabs and newlines before writing to the response stream. This will greatly
 * save the network bandwith, but this will make the source of the response more hard to read.
 * <p>
 * This filter should be configured in the web.xml as follows:
 * <pre>
 * &lt;filter&gt;
 *     &lt;description&gt;
 *         This filter class removes any whitespace from the response. It actually trims all
 *         leading and trailing spaces or tabs and newlines before writing to the response stream.
 *         This will greatly save the network bandwith, but this will make the source of the
 *         response more hard to read.
 *     &lt;/description&gt;
 *     &lt;filter-name&gt;whitespaceFilter&lt;/filter-name&gt;
 *     &lt;filter-class&gt;net.balusc.webapp.WhitespaceFilter&lt;/filter-class&gt;
 * &lt;/filter&gt;
 * &lt;filter-mapping&gt;
 *     &lt;filter-name&gt;whitespaceFilter&lt;/filter-name&gt;
 *     &lt;url-pattern&gt;/*&lt;/url-pattern&gt;
 * &lt;/filter-mapping&gt;
 * </pre>
 *
 * @author BalusC
 * @link http://balusc.blogspot.com/2007/12/whitespacefilter.html
 */
public class WhitespaceFilter implements Filter {

    // Constants ----------------------------------------------------------------------------------

    // Specify here where you'd like to start/stop the trimming.
    // You may want to replace this by init-param and initialize in init() instead.
    static final String[] START_TRIM_AFTER = {"<html", "</textarea", "</pre"};
    static final String[] STOP_TRIM_AFTER = {"</html", "<textarea", "<pre"};

    // Actions ------------------------------------------------------------------------------------

    /**
     * @see Filter#init(FilterConfig)
     */
    public void init(FilterConfig config) throws ServletException {
        //
    }

    /**
     * @see Filter#doFilter(ServletRequest, ServletResponse, FilterChain)
     */
    public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
        throws IOException, ServletException
    {
        if (response instanceof HttpServletResponse) {
            HttpServletResponse httpres = (HttpServletResponse) response;
            chain.doFilter(request, wrapResponse(httpres, createTrimWriter(httpres)));
        } else {
            chain.doFilter(request, response);
        }
    }

    /**
     * @see Filter#destroy()
     */
    public void destroy() {
        //
    }

    // Utility (may be refactored to public utility class) ----------------------------------------

    /**
     * Create a new PrintWriter for the given HttpServletResponse which trims all whitespace.
     * @param response The involved HttpServletResponse.
     * @return A PrintWriter which trims all whitespace.
     * @throws IOException If something fails at I/O level.
     */
    private static PrintWriter createTrimWriter(final HttpServletResponse response)
        throws IOException
    {
        return new PrintWriter(new OutputStreamWriter(response.getOutputStream(), "UTF-8"), true) {
            private StringBuilder builder = new StringBuilder();
            private boolean trim = false;

            public void write(int c) {
                builder.append((char) c); // It is actually a char, not an int.
            }

            public void write(char[] chars, int offset, int length) {
                builder.append(chars, offset, length);
                this.flush(); // Preflush it.
            }

            public void write(String string, int offset, int length) {
                builder.append(string, offset, length);
                this.flush(); // Preflush it.
            }

            // Finally override the flush method so that it trims whitespace.
            public void flush() {
                synchronized (builder) {
                    BufferedReader reader = new BufferedReader(new StringReader(builder.toString()));
                    String line = null;

                    try {
                        while ((line = reader.readLine()) != null) {
                            if (startTrim(line)) {
                                trim = true;
                                out.write(line);
                            } else if (trim) {
                                out.write(line.trim());
                                if (stopTrim(line)) {
                                    trim = false;
                                    println();
                                }
                            } else {
                                out.write(line);
                                println();
                            }
                        }
                    } catch (IOException e) {
                        setError();
                        // Log e or do e.printStackTrace() if necessary.
                    }

                    // Reset the local StringBuilder and issue real flush.
                    builder = new StringBuilder();
                    super.flush();
                }
            }

            private boolean startTrim(String line) {
                for (String match : START_TRIM_AFTER) {
                    if (line.contains(match)) {
                        return true;
                    }
                }
                return false;
            }

            private boolean stopTrim(String line) {
                for (String match : STOP_TRIM_AFTER) {
                    if (line.contains(match)) {
                        return true;
                    }
                }
                return false;
            }
        };
    }

    /**
     * Wrap the given HttpServletResponse with the given PrintWriter.
     * @param response The HttpServletResponse of which the given PrintWriter have to be wrapped in.
     * @param writer The PrintWriter to be wrapped in the given HttpServletResponse.
     * @return The HttpServletResponse with the PrintWriter wrapped in.
     */
    private static HttpServletResponse wrapResponse(
        final HttpServletResponse response, final PrintWriter writer)
    {
        return new HttpServletResponseWrapper(response) {
            public PrintWriter getWriter() throws IOException {
                return writer;
            }
        };
    }

}

WhitespaceFilter configuration in web.xml:


    <filter>
        <description>
            This filter class removes any whitespace from the response. It actually trims all
            leading and trailing spaces or tabs and newlines before writing to the response stream.
            This will greatly save the network bandwith, but this will make the source of the
            response more hard to read.
        </description>
        <filter-name>whitespaceFilter</filter-name>
        <filter-class>net.balusc.webapp.WhitespaceFilter</filter-class>
    </filter>
    <filter-mapping>
        <filter-name>whitespaceFilter</filter-name>
        <url-pattern>/*</url-pattern>
    </filter-mapping>

That's all, folks!

Back to top

Copyright - GNU General Public License

(C) December 2007, BalusC