From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
	ip-172-31-65-14.ec2.internal
X-Spam-Level: 
X-Spam-Status: No, score=-0.9 required=3.0 tests=BAYES_00,XPRIO autolearn=no
	autolearn_force=no version=3.4.6
Path: eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: "Randy Brukardt" <randy@rrsoftware.com>
Newsgroups: comp.lang.ada
Subject: Re: Ada and Unicode
Date: Fri, 8 Apr 2022 23:05:38 -0500
Organization: A noiseless patient Spider
Message-ID: <t2r0mk$q4d$1@dont-email.me>
References: <607b5b20$0$27442$426a74cc@news.free.fr> <86mttuk5f0.fsf@stephe-leake.org> <s5jr59$1tkq$1@gioia.aioe.org> <s5juep$1lbu$1@gioia.aioe.org> <s5jute$1s08$1@gioia.aioe.org> <s5k0ai$bb5$1@dont-email.me> <fantome.forums.tDeContes-E8EAB8.20043603042022@news.free.fr> <t2knpr$s26$1@dont-email.me> <t2lesj$d2f$1@dont-email.me> <lysfqolzrg.fsf@pushface.org> <t2ov3c$10au$1@gioia.aioe.org> <lyfsmn2xjn.fsf@pushface.org> <t2q3cb$bbt$1@gioia.aioe.org>
Injection-Date: Sat, 9 Apr 2022 04:05:40 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="e67210af8ec07b0490c4686d225dc9ca";
	logging-data="26765"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/6osmb2tyMtfbfkruD/KEqzXVCGBeRrJc="
Cancel-Lock: sha1:xh7N6YYB0qMsXYLb8DM/uJ+v21k=
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.7246
X-RFC2646: Format=Flowed; Response
X-Newsreader: Microsoft Outlook Express 6.00.2900.5931
X-Priority: 3
X-MSMail-Priority: Normal
Xref: reader02.eternal-september.org comp.lang.ada:63720
List-Id: <comp.lang.ada>

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:t2q3cb$bbt$1@gioia.aioe.org...
> On 2022-04-08 21:19, Simon Wright wrote:
>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
>>
>>> On 2022-04-08 10:56, Simon Wright wrote:
>>>> "Randy Brukardt" <randy@rrsoftware.com> writes:
>>>>
>>>>> If you had an Ada-like language that used a universal UTF-8 string
>>>>> internally, you then would have a lot of old and mostly useless
>>>>> operations supported for array types (since things like slices are
>>>>> mainly useful for string operations).
>>>>
>>>> Just off the top of my head, wouldn't it be better to use
>>>> UTF32-encoded Wide_Wide_Character internally?
>>>
>>> Yep, that is the exactly the problem, a confusion between interface
>>> and implementation.
>>
>> Don't understand. My point was that *when you are implementing this* it
>> mught be easier to deal with 32-bit charactrs/code points/whatever the
>> proper jargon is than with UTF8.
>
> I think it would be more difficult, because you will have to convert from 
> and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface 
> standard and I/O standard. That would be 60-70% of all cases you need a 
> string. Most string operations like search, comparison, slicing are 
> isomorphic between code points and octets. So you would win nothing from 
> keeping strings internally as arrays of code points.

I basically agree with Dmitry here. The internal representation is an 
implementation detail, but it seems likely that you would want to store 
UTF-8 strings directly; they're almost always going to be half the size 
(even for languages using their own characters like Greek) and for most of 
us, they'll be just a bit more than a quarter the size. The amount of bytes 
you copy around matters; the number of operations where code points are 
needed is fairly small.

The main problem with UTF-8 is representing the code point positions in a 
way that they (a) aren't abused and (b) don't cost too much to calculate. 
Just using character indexes is too expensive for UTF-8 and UTF-16 
representations, and using octet indexes is unsafe (since the splitting a 
character representation is a possibility). I'd probably use an abstract 
character position type that was implemented with an octet index under the 
covers.

I think that would work OK as doing math on those is suspicious with a UTF 
representation. We're spoiled from using Latin-1 representations, of course, 
but generally one is interested in 5 characters, not 5 octets. And the 
number of octets in 5 characters depends on the string. So most of the sorts 
of operations that I tend to do (for instance from some code I was fixing 
earlier today):

     if Fort'Length > 6 and then
        Font(2..6) = "Arial" then

This would be a bad idea if one is using any sort of universal 
representation -- you don't know how many octets is in the string literal so 
you can't assume a number in the test string. So the slice is dangerous 
(even though in this particular case it would be OK since the test string is 
all Ascii characters -- but I wouldn't want users to get in the habit of 
assuming such things).

[BTW, the above was a bad idea anyway, because it turns out that the 
function in the Ada library returned bounds that don't start at 1. So the 
slice was usually out of range -- which is why I was looking at the code. 
Another thing that we could do without. Slices are evil, since they *seem* 
to be the right solution, yet rarely are in practice without a lot of 
hoops.]

> The situation is comparable to Unbounded_Strings. The implementation is 
> relatively simple, but the user must carry the burden of calling To_String 
> and To_Unbounded_String all over the application and the processor must 
> suffer the overhead of copying arrays here and there.

Yes, but that happens because Ada doesn't really have a string abstraction, 
so when you try to build one, you can't fully do the job. One presumes that 
a new language with a universal UTF-8 string wouldn't have that problem. (As 
previously noted, I don't see much point in trying to patch up Ada with a 
bunch of UTF-8 string packages; you would need an entire new set of 
Ada.Strings libraries and I/O libraries, and then you'd have all of the old 
stuff messing up resolution, using the best names, and confusing everything. 
A cleaner slate is needed.)

                                   Randy.