From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.5-pre1 (2020-06-20) on ip-172-31-74-118.ec2.internal X-Spam-Level: X-Spam-Status: No, score=-1.9 required=3.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.5-pre1 Path: eternal-september.org!reader02.eternal-september.org!aioe.org!5WHqCw2XxjHb2npjM9GYbw.user.gioia.aioe.org.POSTED!not-for-mail From: "Dmitry A. Kazakov" Newsgroups: comp.lang.ada Subject: Re: Ada and Unicode Date: Mon, 19 Apr 2021 14:52:43 +0200 Organization: Aioe.org NNTP Server Message-ID: References: <607b5b20$0$27442$426a74cc@news.free.fr> <86mttuk5f0.fsf@stephe-leake.org> NNTP-Posting-Host: 5WHqCw2XxjHb2npjM9GYbw.user.gioia.aioe.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-Complaints-To: abuse@aioe.org User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.9.1 X-Notice: Filtered by postfilter v. 0.9.2 Content-Language: en-US Xref: reader02.eternal-september.org comp.lang.ada:61840 List-Id: On 2021-04-19 13:56, Luke A. Guest wrote: > On 19/04/2021 10:08, Stephen Leake wrote: >>> What's the way to manage Unicode correctly ? >> >> There are two issues: Unicode in source code, that the compiler must >> understand, and Unicode in strings, that your program must understand. > > And this is there the Ada standard gets it wrong, in the encodings > package re utf-8. > > Unicode is a superset of 7-bit ASCII not Latin 1. The high bit in the > leading octet indicates whether there are trailing octets. See > https://github.com/Lucretia/uca/blob/master/src/uca.ads#L70 for the data > layout. The first 128 "characters" in Unicode match that of 7-bit ASCII, > not 8-bit ASCII, and certainly not Latin 1. Therefore this: > > package Ada.Strings.UTF_Encoding >   ... >   subtype UTF_8_String is String; >   ... > end Ada.Strings.UTF_Encoding; > > Was absolutely and totally wrong. It is practical solution. Ada type system cannot express differently represented/constrained string/array/vector subtypes. Ignoring Latin-1 and using String as if it were an array of octets is the best available solution. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de