Surrogate Characters?!

Some time ago I/we ran into a problem at work.

The file format that our software uses for saving and loading projects are basicly a ZIP-archive containing XML. We started noticing some strange error reports that all pointed in the direction of the save-method. This is one part of the program that have never really caused any problems before. The exception being thrown said something about “surrogate character”… More specificly invalid high surrogate… This was the first time I’d heard that term so I quickly googled it and found the answer: Something in the string being saved in the XML-document was in an invalid byte range. How nice…

This is particular funny because the input data came from Excel 2007 files, which is also ZIP-archives containing XML… Which means that the .NET conversion from XML to unicode string and back to XML was failing… ! The solution was to sanitize the strings before putting them in the XML-document for output – something that seems like a nasty hack I didn’t really need if the XML-output code in .NET would encode the strings properly.