Sunday, April 10, 2011

 when Reading a text file in Java

If you have code that processes text files using InputStreamReader, you may be puzzled why there are certain garbage characters  at the start of your resulting String data if you output that the contents of the file.

This is due to a Reader processing your input file in ASCII encoding when the data is stored in UTF-8.

While I came across this in Java, the ASCII/UTF-8 solution was actually in this aptly named blog post "Three little characters  designed to make your life hell" by Martyn at the Ventrino Blog.

The solution proposed in the post is basically "if your text file is saved as UTF-8, that is the problem, save as ASCII encoding instead". I think this isn't really a best practice, probably as UTF-8 is better for internationalization, the developer should probably change his/her code to support that format rather than restrict a program to the limited set of US/Western ASCII characters.

Here is an example of code that when reading a UTF-8 encoded file, say a .HTML file will display the 

//stores the html text, in my case I needed to pass it in to an SWT Browser.setText() method.
StringBuffer buffer = new StringBuffer();
//read the HTML to display from a resource
InputStream htmlInStream = Resources.getResourceAsStream(Resources.HTML_ABOUT_HTML);

BufferedReader bufInpStream = new BufferedReader(new InputStreamReader(htmlInStream));
String line = "";

while(line != null){
line = bufInpStream.readLine();
//System.out.println(line);
if(line!= null){
buffer.append(line);
}
}


The key seems to be the InputStreamReader conversion of the underlying InputStream to the HTML file. Readers are by design for processing text, while Streams are more concerned with binary data. So when converting from the Stream, we have to tell it what encoding to use. This correction will solve the problem.

//if we didn't specify char-encoding as UTF-8, it would show these strange chars: 
BufferedReader bufInpStream = new BufferedReader(new InputStreamReader(htmlInStream,"UTF-8"));


No comments:

Post a Comment