HARBOUR UNICODE

Roberto Lopez · Post by **Roberto Lopez** » Sat Sep 04, 2010 2:10 pm

Hi All,

Viktor, from Harbour team posted an interesting message about unicode support in Harbour compiler.

Subject: [harbour] Harbour Unicode
To: harbour-devel@googlegroups.com

Hi,

I had recently switched my apps to use UTF-8 in all sources and
external files (except databases), which is quite nice step, but due
to limits Harbour, apps internally can still only use legacy 8-bit codepage=
s.
This means that in some places I had to resort to some extra steps
because UTF-8 -> CP !=3D CP -> UTF-8 in a general sense, plus
I have to manually make the conversion from UTF-8, wherever
required. [ Nevertheless I hope it will be useful as finally I can use
non-Windows OS/tools to edit the sources and files, while this was
very difficult to do with old 852 CP. ]

To take this to the next level, Harbour would need to have native
UNICODE support in core, so that it could handle UNICODE strings
as-is in all RTL functions and HVM operations.

Any app will have to differentiate between UNICODE strings and
raw byte streams (=3Dbinary data or strings using legacy CP),
so I was thinking of a system where string markers could be used
to markup these string types:

u"Hello, this is a UNICODE string, in UTF-8 encoding"
b"This is raw bytestream"

Default string markers would denote raw bytestream by
default to keep Clipper compatiblity, and this could be
changed with Harbour compiler option, so that regular
string markers mean UTF-8 encoded UNICODE strings.
This would offer an easy upgrade path for app developers.

HB_ITEM would have to be extended with new UNICODE
string type, current one would continue to mean raw byte
stream.

>From this point all internal operations (functions/operators)
can query the string type and act accordingly.

F.e. by default:
ASC( u"=C5=91" ) would return 337, while
ASC( "=C5=91" ) would return 245 (in case the source file was encoded in
8-bit ISO-8859-2 CP).

In above example ASC() implementation would check which
string type has been passed and act accordingly.

We will have to decide what encoding to use for UNICODE
strings internally. IMO the two meaningful choices here are
UTF-8 and UTF-32, where UTF-8 is slower in any operations
where characters are addressed by index and UTF-32 being
easy to handle but consumed more memory. [ Pbly UTF-8
is still better if everything considered. ]

We have to make a per-function decision about how string
parameters are accepted and handled. Some function may
act differently on UNICODE and bytestream strings, some may
internally need one or the other and make the required
conversion. Another thing to decide is which function to
return what string type.

Current HB_CDPSELECT() will only influence the handling
of 8-bit (legacy) bytestreams.

Probably all current hb_parc[x]() calls will have to be replaced
with new API where we allow legacy bytestreams to be passed.
Fortunately this is a problem only with the smaller part of
functions.

Also, most interfaces with 3rd party libs will have to be extended
to use str API (just like hbwin, hbodbc does now), where required
string format differs from Harbour internal format and we're dealing
with strings instead of bytestreams.

Does this look like a path we can start on? Any comments are
welcome.

Viktor

Post by **Rathinagiri** » Sat Sep 04, 2010 2:20 pm

Nice information. I am awaiting for full unicode supported HMG!

HMGforum.com

HARBOUR UNICODE

HARBOUR UNICODE

Re: HARBOUR UNICODE