My proposal to Servlet and JSP Specification > Our problem
Previous Page Contents Next Page
OOP-Reserch home page

Mail to Me

Our problem

Any questions and comments are welcome to me.

1. Our problem

In the Tomcat developer mailing list, we have discussed about the way to decode the FORM data back to the original string, in another word, how to provide the appropriate parameter strings to the servlet writer.
Many read worthing messages are posted by the developers of tomcat-dev. I appreciate their enthusiastic support for Tomcat.

1.1 Server responsibility

As for the FORM parameter string sent from the WWW client, the current version of Servlet Specification provides the servlet writer with the following methods:

So, according to the Servlet Specification v2.2, the server implementation is responsible for making the parameter values (strings) available to the servlet writer.

To decode the encoded character array(in the form of '%12%34') back to the original string, we have to know the character encoding of the client side. The character array can be decoded to the byte array, and if supplied with the appropriate Java character encoding, the original string can be re-constructed based on such a byte array.

In some cases, the string created from the byte array with the bad encoding can be reverted to the right one in the following way:
   byte[] some=...
   //Creating bad string with the default Java encoding
   String bad=new String();

   //Get the original byte array based on the default
   //Java encoding
   byte[] other=bad.getBytes();

   //And then get the original string with the appropriate Java
   //encoding
   String enc="us-ascii";
   String good=new String(other,enc);
If the solution above works in all the case, the servlet writer can retrieve the original parameter strings and this may be enough. (The server implementation can leave such a task to the servlet writer.)
But, as for the 2 bytes character such as Japanese, the code above does not work, if the default Java encoding is not the appropriate one.

So, it is the responsibility for the server implementation to decode the FORM data to the original string and make it available to the servlet writer.

1.2 And what is missing?

As I described above, to get the original parameter string, the server implementation must convert the byte array to the string based on the appropriate Java character encoding. Given the original charset of the client side, the server implementation can determine the corresponding Java character encoding easily.
Then how the server implementation can tell the original charset on the client side?

The original charset of the WWW client should be set as the 'charset' attribute of the 'Content-type' header. But the WWW browsers at this time does not supply the 'charset' attribute and this results in the difficulty of decoding.

Thus, we, as the developer of the server implementation, encountered the difficulty to tell the charset in which the original FORM string is encoded.

1.3 Alternative way to determine the charset

As long as we can't rely on the WWW client, we have to find another way to determine the charset of the client side. We can list up 3 possible options:

A. Web-master (deployer) supplies the charset:
The one who deploys the web application knows the charset of the FORM data, so the server implementation can depend on the charset specified in the configuration file.
B. Guess the charset based on 'accept-language':
The server implementation can guess (at some extent) the charset from the array of the language in the 'accept-language' header.
C. Use the charset specified on HttpServletResponse:
In usual case, the FORM data is sent from the HTML (which may be generated by the servlet) and such an HTML file (or servlet) belongs to the same web application as the servlet which receives the FROM data. This means the FROM data is encoded in the same charset as the response of the servlet. So the server implementation can use the charset set on HttpServletResponse.
We are not sure by which way the server implementation can get the right charset in the most case. The best way to approximate the charset may differ for each web application.

1.4 Yet too heavy responsibility for the server implementation

In case that the charset determined by any option above is the right one, the server implementation can provide the original string to the servlet writer. But we can't ensure that it is always true. As you can guess, any of A, B or C can not always supply the right charset.
As a result, we can say that it is too heavy responsibility to tell the right charset of the client browser for the server implementation.

Java and all Java-based trademarks and logos are trademarks or registered of Sun Microsystems, Inc. in the United States and other countries.


Previous Page Contents Next Page
OOP-Reserch home page

Mail to Me


ALL CONTENTS COPYRIGHT 2000 , Jun Inamori. All rights reserved.
Any questions and comments are welcome to Jun Inamori .