comp.lang.ada
 help / color / mirror / Atom feed
From: Emmanuel Briot <briot.emmanuel@gmail.com>
Subject: Re: XMLAda & unicode symbols
Date: Mon, 21 Jun 2021 11:33:57 -0700 (PDT)	[thread overview]
Message-ID: <8d443406-48dc-4d4e-868c-832caabebd1en@googlegroups.com> (raw)
In-Reply-To: <lybl7zgrxy.fsf@pushface.org>

> A scan through XML/Ada shows that the only uses of Unicode_Char are in 
> the SAX subset. I don't see any way in the DOM subset of XML/Ada of 
> using them - someone please prove me wrong! 

Those two subsets are not independent, in fact the DOM subset is entirely based on the SAX one.
So anything that applies to SAX also applies to DOM.

That said, the DOM standard (at the time I built XML/Ada, which is 20 years ago whereabouts) likely
did not have standard functions that receives unicode characters, only strings.
DOM implementations are free to use any internal representation they want, and I think they did not
have to accept any random encoding. XML/Ada is not user-friendly, it really is only a fairly low-level
implementation of the DOM standard. Using DOM without high-level things like XPath is a real
pain. At the time, someone else had done an XPath implementation, so I never took the time to
duplicate that effort.

Conversion between various encodings (8bit, unicode utf-8, utf-16 or utf-32) is done via the
`unicode` module of XML/Ada, namely for instance `unicode-ces-utf8.ads`. They all provide a similar API. In this case
you want the `Encode` procedure. This is not a function (so doesn't return a Byte_Sequence directly) for efficiency
reason, even if it would be convenient for end-users, admittedly.

As someone rightly mentioned, it doesn't really make sense to use XML/Ada to build a tree in memory just for the
sake of printing it, though. Ada.Text_IO or streams will be much much more efficient. XML/Ada is only useful
to parse XML streams (in which case you never have to yourself encode a character to a byte sequence in
general).

> > we need to convert it, then let us do so outside of it.
> That is *exactly* what you have to do (convert outside, not throw any 
> old sequence of octets and 32-bit values somehow mashed together at 
> it

Well said Simon, thanks. Basically, the whole application should be utf-8 if you at all care about international
characters (if you don't, feel free to use latin-1, or any encoding your terminal supports). So conversion should not
occur just at the interface to XML/Ada, but only on input and output of your program.
XML/Ada just assumes a string is a sequence of bytes. The actual encoding has to be known by the application,
and be consistent.
If for some reason (Windows ?) you prefer utf-16 internally, you can change `sax-encodings.ads` and recompile.
(would have been neater to use generic traits packages, but I did not realize about them until a few years later).

It would also have been nicer to use a string type that knows about the encoding. I wrote GNATCOLL.Strings for
that purpose several years alter too. XML/Ada was never used extensively, so it was never a priority for AdaCore
to update it to use all these packages, at the risk of either breaking backward compatibility, or duplicating the
whole API to allow for the various string types. Not worth it.

Emmanuel

  reply	other threads:[~2021-06-21 18:33 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-19 18:28 XMLAda & unicode symbols 196...@googlemail.com
2021-06-19 19:53 ` Jeffrey R. Carter
2021-06-20 17:02   ` 196...@googlemail.com
2021-06-20 17:23     ` Dmitry A. Kazakov
2021-06-20 17:58       ` 196...@googlemail.com
2021-06-20 18:16         ` Dmitry A. Kazakov
2021-06-21 19:40           ` 196...@googlemail.com
2021-06-21 20:18             ` Dmitry A. Kazakov
2021-06-21 15:37         ` Simon Wright
2021-06-21 19:49           ` 196...@googlemail.com
2021-06-21 20:23             ` Dmitry A. Kazakov
2021-06-21 20:47             ` Simon Wright
2021-06-22  0:30             ` Spiros Bousbouras
2021-06-20 18:21     ` Jeffrey R. Carter
2021-06-20 18:47       ` Dmitry A. Kazakov
2021-06-20 22:50         ` Jeffrey R. Carter
2021-06-21  4:16           ` Marius Amado-Alves
2021-06-21  9:39             ` Jeffrey R. Carter
2021-06-21  6:14           ` Dmitry A. Kazakov
2021-06-19 21:24 ` Simon Wright
2021-06-20 17:10   ` 196...@googlemail.com
2021-06-21 15:26     ` Simon Wright
2021-06-21 18:33       ` Emmanuel Briot [this message]
2021-06-21 20:06         ` 196...@googlemail.com
2021-06-21 21:26         ` Simon Wright
2021-06-22  6:52           ` Emmanuel Briot
2021-06-21 21:22       ` Simon Wright
2021-06-21  6:07 ` Vadim Godunko
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox