From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.5-pre1 (2020-06-20) on
	ip-172-31-74-118.ec2.internal
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=3.0 tests=BAYES_00,FREEMAIL_FROM
	autolearn=ham autolearn_force=no version=3.4.5-pre1
X-Received: by 2002:a37:59c7:: with SMTP id n190mr11386qkb.146.1624300437900;
        Mon, 21 Jun 2021 11:33:57 -0700 (PDT)
X-Received: by 2002:a25:2405:: with SMTP id k5mr1724858ybk.405.1624300437626;
 Mon, 21 Jun 2021 11:33:57 -0700 (PDT)
Path: eternal-september.org!reader02.eternal-september.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Mon, 21 Jun 2021 11:33:57 -0700 (PDT)
In-Reply-To: <lybl7zgrxy.fsf@pushface.org>
Injection-Info: google-groups.googlegroups.com; posting-host=87.88.29.208; posting-account=6yLzewoAAABoisbSsCJH1SPMc9UrfXBH
NNTP-Posting-Host: 87.88.29.208
References: <f9f32d7f-2265-4dd5-8bcb-c477ca449cf3n@googlegroups.com>
 <lyk0mph7j6.fsf@pushface.org> <b4c0edbd-7567-47cb-ba75-2fa27d75a788n@googlegroups.com>
 <lybl7zgrxy.fsf@pushface.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8d443406-48dc-4d4e-868c-832caabebd1en@googlegroups.com>
Subject: Re: XMLAda & unicode symbols
From: Emmanuel Briot <briot.emmanuel@gmail.com>
Injection-Date: Mon, 21 Jun 2021 18:33:57 +0000
Content-Type: text/plain; charset="UTF-8"
Xref: reader02.eternal-september.org comp.lang.ada:62269
List-Id: <comp.lang.ada>

> A scan through XML/Ada shows that the only uses of Unicode_Char are in 
> the SAX subset. I don't see any way in the DOM subset of XML/Ada of 
> using them - someone please prove me wrong! 

Those two subsets are not independent, in fact the DOM subset is entirely based on the SAX one.
So anything that applies to SAX also applies to DOM.

That said, the DOM standard (at the time I built XML/Ada, which is 20 years ago whereabouts) likely
did not have standard functions that receives unicode characters, only strings.
DOM implementations are free to use any internal representation they want, and I think they did not
have to accept any random encoding. XML/Ada is not user-friendly, it really is only a fairly low-level
implementation of the DOM standard. Using DOM without high-level things like XPath is a real
pain. At the time, someone else had done an XPath implementation, so I never took the time to
duplicate that effort.

Conversion between various encodings (8bit, unicode utf-8, utf-16 or utf-32) is done via the
`unicode` module of XML/Ada, namely for instance `unicode-ces-utf8.ads`. They all provide a similar API. In this case
you want the `Encode` procedure. This is not a function (so doesn't return a Byte_Sequence directly) for efficiency
reason, even if it would be convenient for end-users, admittedly.

As someone rightly mentioned, it doesn't really make sense to use XML/Ada to build a tree in memory just for the
sake of printing it, though. Ada.Text_IO or streams will be much much more efficient. XML/Ada is only useful
to parse XML streams (in which case you never have to yourself encode a character to a byte sequence in
general).

> > we need to convert it, then let us do so outside of it.
> That is *exactly* what you have to do (convert outside, not throw any 
> old sequence of octets and 32-bit values somehow mashed together at 
> it

Well said Simon, thanks. Basically, the whole application should be utf-8 if you at all care about international
characters (if you don't, feel free to use latin-1, or any encoding your terminal supports). So conversion should not
occur just at the interface to XML/Ada, but only on input and output of your program.
XML/Ada just assumes a string is a sequence of bytes. The actual encoding has to be known by the application,
and be consistent.
If for some reason (Windows ?) you prefer utf-16 internally, you can change `sax-encodings.ads` and recompile.
(would have been neater to use generic traits packages, but I did not realize about them until a few years later).

It would also have been nicer to use a string type that knows about the encoding. I wrote GNATCOLL.Strings for
that purpose several years alter too. XML/Ada was never used extensively, so it was never a priority for AdaCore
to update it to use all these packages, at the risk of either breaking backward compatibility, or duplicating the
whole API to allow for the various string types. Not worth it.

Emmanuel