From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.5-pre1 (2020-06-20) on ip-172-31-74-118.ec2.internal X-Spam-Level: X-Spam-Status: No, score=-1.9 required=3.0 tests=BAYES_00,FREEMAIL_FROM autolearn=ham autolearn_force=no version=3.4.5-pre1 X-Received: by 2002:a37:59c7:: with SMTP id n190mr11386qkb.146.1624300437900; Mon, 21 Jun 2021 11:33:57 -0700 (PDT) X-Received: by 2002:a25:2405:: with SMTP id k5mr1724858ybk.405.1624300437626; Mon, 21 Jun 2021 11:33:57 -0700 (PDT) Path: eternal-september.org!reader02.eternal-september.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail Newsgroups: comp.lang.ada Date: Mon, 21 Jun 2021 11:33:57 -0700 (PDT) In-Reply-To: Injection-Info: google-groups.googlegroups.com; posting-host=87.88.29.208; posting-account=6yLzewoAAABoisbSsCJH1SPMc9UrfXBH NNTP-Posting-Host: 87.88.29.208 References: User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: <8d443406-48dc-4d4e-868c-832caabebd1en@googlegroups.com> Subject: Re: XMLAda & unicode symbols From: Emmanuel Briot Injection-Date: Mon, 21 Jun 2021 18:33:57 +0000 Content-Type: text/plain; charset="UTF-8" Xref: reader02.eternal-september.org comp.lang.ada:62269 List-Id: > A scan through XML/Ada shows that the only uses of Unicode_Char are in > the SAX subset. I don't see any way in the DOM subset of XML/Ada of > using them - someone please prove me wrong! Those two subsets are not independent, in fact the DOM subset is entirely based on the SAX one. So anything that applies to SAX also applies to DOM. That said, the DOM standard (at the time I built XML/Ada, which is 20 years ago whereabouts) likely did not have standard functions that receives unicode characters, only strings. DOM implementations are free to use any internal representation they want, and I think they did not have to accept any random encoding. XML/Ada is not user-friendly, it really is only a fairly low-level implementation of the DOM standard. Using DOM without high-level things like XPath is a real pain. At the time, someone else had done an XPath implementation, so I never took the time to duplicate that effort. Conversion between various encodings (8bit, unicode utf-8, utf-16 or utf-32) is done via the `unicode` module of XML/Ada, namely for instance `unicode-ces-utf8.ads`. They all provide a similar API. In this case you want the `Encode` procedure. This is not a function (so doesn't return a Byte_Sequence directly) for efficiency reason, even if it would be convenient for end-users, admittedly. As someone rightly mentioned, it doesn't really make sense to use XML/Ada to build a tree in memory just for the sake of printing it, though. Ada.Text_IO or streams will be much much more efficient. XML/Ada is only useful to parse XML streams (in which case you never have to yourself encode a character to a byte sequence in general). > > we need to convert it, then let us do so outside of it. > That is *exactly* what you have to do (convert outside, not throw any > old sequence of octets and 32-bit values somehow mashed together at > it Well said Simon, thanks. Basically, the whole application should be utf-8 if you at all care about international characters (if you don't, feel free to use latin-1, or any encoding your terminal supports). So conversion should not occur just at the interface to XML/Ada, but only on input and output of your program. XML/Ada just assumes a string is a sequence of bytes. The actual encoding has to be known by the application, and be consistent. If for some reason (Windows ?) you prefer utf-16 internally, you can change `sax-encodings.ads` and recompile. (would have been neater to use generic traits packages, but I did not realize about them until a few years later). It would also have been nicer to use a string type that knows about the encoding. I wrote GNATCOLL.Strings for that purpose several years alter too. XML/Ada was never used extensively, so it was never a priority for AdaCore to update it to use all these packages, at the risk of either breaking backward compatibility, or duplicating the whole API to allow for the various string types. Not worth it. Emmanuel