Page 1 of 1

Question about RichEditBox handling of Unicode text files

Posted: Fri Nov 07, 2014 11:14 am
by kcarmody
RichEditBox has two methods that handle files, RtfLoadFile and RtfSaveFile. These methods call RichEditBox_StreamIn and RichEditBox_StreamOut in c_richeditbox.c.

These two functions handle RTF and ANSI text files OK, but have some problems with Unicode text files. They seem to ignore the byte order marks (BOM) that are usually necessary for software to recognize text files as Unicode text files.

RichEditBox_StreamIn removes the BOM from a UTF-8 text file (nDataFormat = 1), but it does not remove the BOM from a UTF-16 text file (nDataFormat = 3). This behavior is actually implicit in the Windows code that the function calls.

RichEditBox_StreamOut does not add any BOMs, either to UTF-8 (nDataFormat = 1 or 2), or to UTF-16 (nDataFomat = 3). Some software can recognize unmarked UTF-8, but no software I have ever seen recognizes unmarked UTF-16.

I think that Windows acts this way because the EM_STREAMIN and EM_STREAMOUT messages are designed for "data streams", which may be internal buffers as well as file contents. Windows seems to assume that the developer will take care of BOMs if the data stream is going to or from a file.

All software that handles Unicode text files recognizes marked text files, so there is never any harm in putting a BOM in, while plenty of harm can come from leaving it out.

Both of these functions include a case (nDataFormat = 5) for UTF-8 RTF, but this is useless, as RTF encodes all Unicode characters as plain text RTF commands. So you never see a UTF-8 RTF file, and if you did, nothing would open it.

I came across the BOM problem when I was enhancing the Rich Edit Demo, viewtopic.php?f=9&t=4030. It was important to me to be able to read and write text files, so I added some workarounds to the demo to fix the behavior of RichEditBox_StreamIn/Out. This was a quick fix using Memoread and Memowrite, but it would be better to use fread and fwrite, either in Harbour or in C.

I could add such fixes into h_controlmisc.prg (definition of RtfLoadFile and RtfSaveFile methods) or into c_richeditbox.c (definition of RichEditBox_StreamIn/Out), but that would change the behavior of these methods and functions.

The question is, should these methods and functions be changed so that they handle BOMs? It might break existing code if we do. But I suspect that no one is using this code now, as it does not handle BOMs properly.

Kevin

Re: Question about RichEditBox handling of Unicode text files

Posted: Fri Nov 07, 2014 12:14 pm
by bpd2000
Thank you Mr. Kavin for more info on Unicode text files

Re: Question about RichEditBox handling of Unicode text files

Posted: Fri Nov 07, 2014 12:20 pm
by esgici
bpd2000 wrote:Thank you Mr. Kavin for more info on Unicode text files
+1

Re: Question about RichEditBox handling of Unicode text files

Posted: Fri Nov 07, 2014 4:48 pm
by Javier Tovar
bpd2000 wrote:Thank you Mr. Kavin for more info on Unicode text files
+1

Creo que el café es bueno por allá! :)

Saludos