From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.5-pre1 (2020-06-20) on ip-172-31-74-118.ec2.internal X-Spam-Level: X-Spam-Status: No, score=-0.2 required=3.0 tests=BAYES_00,FREEMAIL_FROM, FROM_STARTS_WITH_NUMS,PDS_FROM_2_EMAILS autolearn=no autolearn_force=no version=3.4.5-pre1 X-Received: by 2002:a37:6851:: with SMTP id d78mr365987qkc.483.1624306018422; Mon, 21 Jun 2021 13:06:58 -0700 (PDT) X-Received: by 2002:a25:e658:: with SMTP id d85mr35818208ybh.165.1624306018288; Mon, 21 Jun 2021 13:06:58 -0700 (PDT) Path: eternal-september.org!reader02.eternal-september.org!border1.nntp.ams1.giganews.com!nntp.giganews.com!feeder1.cambriumusenet.nl!feed.tweak.nl!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail Newsgroups: comp.lang.ada Date: Mon, 21 Jun 2021 13:06:58 -0700 (PDT) In-Reply-To: <8d443406-48dc-4d4e-868c-832caabebd1en@googlegroups.com> Injection-Info: google-groups.googlegroups.com; posting-host=2001:8b0:ca:6:0:0:0:fd; posting-account=TiHetgoAAACluCgYkPc8-TWs6dBNgSne NNTP-Posting-Host: 2001:8b0:ca:6:0:0:0:fd References: <8d443406-48dc-4d4e-868c-832caabebd1en@googlegroups.com> User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: <7da5a442-2ad9-4bfd-9d6c-c8885da02d05n@googlegroups.com> Subject: Re: XMLAda & unicode symbols From: "196...@googlemail.com" <1963bib@googlemail.com> Injection-Date: Mon, 21 Jun 2021 20:06:58 +0000 Content-Type: text/plain; charset="UTF-8" Xref: reader02.eternal-september.org comp.lang.ada:62272 List-Id: On Monday, 21 June 2021 at 19:33:58 UTC+1, briot.e...@gmail.com wrote: > > A scan through XML/Ada shows that the only uses of Unicode_Char are in > > the SAX subset. I don't see any way in the DOM subset of XML/Ada of > > using them - someone please prove me wrong! > Those two subsets are not independent, in fact the DOM subset is entirely based on the SAX one. > So anything that applies to SAX also applies to DOM. > > That said, the DOM standard (at the time I built XML/Ada, which is 20 years ago whereabouts) likely > did not have standard functions that receives unicode characters, only strings. > DOM implementations are free to use any internal representation they want, and I think they did not > have to accept any random encoding. XML/Ada is not user-friendly, it really is only a fairly low-level > implementation of the DOM standard. Using DOM without high-level things like XPath is a real > pain. At the time, someone else had done an XPath implementation, so I never took the time to > duplicate that effort. > > Conversion between various encodings (8bit, unicode utf-8, utf-16 or utf-32) is done via the > `unicode` module of XML/Ada, namely for instance `unicode-ces-utf8.ads`. They all provide a similar API. In this case > you want the `Encode` procedure. This is not a function (so doesn't return a Byte_Sequence directly) for efficiency > reason, even if it would be convenient for end-users, admittedly. > > As someone rightly mentioned, it doesn't really make sense to use XML/Ada to build a tree in memory just for the > sake of printing it, though. Ada.Text_IO or streams will be much much more efficient. XML/Ada is only useful > to parse XML streams (in which case you never have to yourself encode a character to a byte sequence in > general). > > > we need to convert it, then let us do so outside of it. > > That is *exactly* what you have to do (convert outside, not throw any > > old sequence of octets and 32-bit values somehow mashed together at > > it > Well said Simon, thanks. Basically, the whole application should be utf-8 if you at all care about international > characters (if you don't, feel free to use latin-1, or any encoding your terminal supports). So conversion should not > occur just at the interface to XML/Ada, but only on input and output of your program. > XML/Ada just assumes a string is a sequence of bytes. The actual encoding has to be known by the application, > and be consistent. > If for some reason (Windows ?) you prefer utf-16 internally, you can change `sax-encodings.ads` and recompile. > (would have been neater to use generic traits packages, but I did not realize about them until a few years later). > > It would also have been nicer to use a string type that knows about the encoding. I wrote GNATCOLL.Strings for > that purpose several years alter too. XML/Ada was never used extensively, so it was never a priority for AdaCore > to update it to use all these packages, at the risk of either breaking backward compatibility, or duplicating the > whole API to allow for the various string types. Not worth it. > > Emmanuel Okay, now I think I am getting somewhere. A push and a prod is always welcome.