Characters in the extended ASCII character range output to an XML file.
This article addresses characters in the extended ASCII character range written from Net Express using XML syntax being output to an XML file.
Data from an input document that contained characters in the ASCII extended character range (128 thru 255) and included characters such as the euro symbol (€ - 128 – hex80) and bullet (• - 149 – hex 95) was being used by a customer in Net Express to generate an XML document.
Initially the XML file was generated with encoding=”UTF-8”.
After it had been created, if the file was opened in an XML editor it did not display correctly and appeared as if it was corrupt.
But if it was opened using a text editor (eg Wordpad or Notepad) the XML code was displayed as expected (although as a text document, not as an XML document). This suggested the file had been written correctly as all characters, including the ASCII extended characters, showed.
Further, when the file was opened in a hex editor it showed the ASCII extended characters written as hex”80” for the euro sign and hex”95” for the bullet etc. But the user believed these should have been converted to their equivalent HTML Name and that was the reason they were not being displayed correctly.
The program was then recompiled with the encoding=”ISO-8859-1”.
This time, if it was opened in an XML editor it appeared as an XML document (unlike with UTF-8), but it did not show the euro or bullet characters.
But, if the file was opened with a text or hex editor then all characters appeared to be correct.
The client noticed that there were 4 characters in the ASCII Printable Char Set (32 thru 127) that had been converted to their equivalent HTML name. These 4 characters are <> " and &.
The ‘less than’ sign (<) was converted to < the ‘greater than’ sign was converted to > the quote was converted to " and the ampersand was converted to &.
Because these characters had been converted it was assumed all characters in the 128-255 range that had an equivalent HTML Name should also be converted.
But this was an incorrect assumption.
The <>” and & are converted for a very simple reason – they are edit characters used in XML and if they were not converted they would be confused.
Once they changed the encoding to be encoding="Windows-1252" it worked fine.
All characters are written exactly as read, with the exception of the <>” and & characters. What’s critical is the codeset specified to read the document, as determined by the encoding statement.
All characters in the ASCII range 32 thru 127 are standard and therefore will always be displayed correctly by all XML tools. But those in the range 128 thru 255 are not and can vary. Therefore, it is purely how they are interpreted by the XML editor that will determine how they are displayed.