From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.5-pre1 (2020-06-20) on ip-172-31-74-118.ec2.internal X-Spam-Level: X-Spam-Status: No, score=-1.9 required=3.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.5-pre1 Path: eternal-september.org!reader02.eternal-september.org!aioe.org!JUN8/iIzeA71QWaIWFKODA.user.gioia.aioe.org.POSTED!not-for-mail From: "Luke A. Guest" Newsgroups: comp.lang.ada Subject: Re: Ada and Unicode Date: Mon, 19 Apr 2021 12:56:34 +0100 Organization: Aioe.org NNTP Server Message-ID: References: <607b5b20$0$27442$426a74cc@news.free.fr> <86mttuk5f0.fsf@stephe-leake.org> NNTP-Posting-Host: JUN8/iIzeA71QWaIWFKODA.user.gioia.aioe.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Complaints-To: abuse@aioe.org User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.9.1 X-Notice: Filtered by postfilter v. 0.9.2 Content-Language: en-GB Xref: reader02.eternal-september.org comp.lang.ada:61838 List-Id: On 19/04/2021 10:08, Stephen Leake wrote: >> What's the way to manage Unicode correctly ? > > There are two issues: Unicode in source code, that the compiler must > understand, and Unicode in strings, that your program must understand. And this is there the Ada standard gets it wrong, in the encodings package re utf-8. Unicode is a superset of 7-bit ASCII not Latin 1. The high bit in the leading octet indicates whether there are trailing octets. See https://github.com/Lucretia/uca/blob/master/src/uca.ads#L70 for the data layout. The first 128 "characters" in Unicode match that of 7-bit ASCII, not 8-bit ASCII, and certainly not Latin 1. Therefore this: package Ada.Strings.UTF_Encoding ... subtype UTF_8_String is String; ... end Ada.Strings.UTF_Encoding; Was absolutely and totally wrong.