From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on ip-172-31-65-14.ec2.internal X-Spam-Level: X-Spam-Status: No, score=-0.9 required=3.0 tests=BAYES_00,XPRIO autolearn=no autolearn_force=no version=3.4.6 Path: eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail From: "Randy Brukardt" Newsgroups: comp.lang.ada Subject: Re: Ada and Unicode Date: Fri, 8 Apr 2022 23:05:38 -0500 Organization: A noiseless patient Spider Message-ID: References: <607b5b20$0$27442$426a74cc@news.free.fr> <86mttuk5f0.fsf@stephe-leake.org> Injection-Date: Sat, 9 Apr 2022 04:05:40 -0000 (UTC) Injection-Info: reader02.eternal-september.org; posting-host="e67210af8ec07b0490c4686d225dc9ca"; logging-data="26765"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/6osmb2tyMtfbfkruD/KEqzXVCGBeRrJc=" Cancel-Lock: sha1:xh7N6YYB0qMsXYLb8DM/uJ+v21k= X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.7246 X-RFC2646: Format=Flowed; Response X-Newsreader: Microsoft Outlook Express 6.00.2900.5931 X-Priority: 3 X-MSMail-Priority: Normal Xref: reader02.eternal-september.org comp.lang.ada:63720 List-Id: "Dmitry A. Kazakov" wrote in message news:t2q3cb$bbt$1@gioia.aioe.org... > On 2022-04-08 21:19, Simon Wright wrote: >> "Dmitry A. Kazakov" writes: >> >>> On 2022-04-08 10:56, Simon Wright wrote: >>>> "Randy Brukardt" writes: >>>> >>>>> If you had an Ada-like language that used a universal UTF-8 string >>>>> internally, you then would have a lot of old and mostly useless >>>>> operations supported for array types (since things like slices are >>>>> mainly useful for string operations). >>>> >>>> Just off the top of my head, wouldn't it be better to use >>>> UTF32-encoded Wide_Wide_Character internally? >>> >>> Yep, that is the exactly the problem, a confusion between interface >>> and implementation. >> >> Don't understand. My point was that *when you are implementing this* it >> mught be easier to deal with 32-bit charactrs/code points/whatever the >> proper jargon is than with UTF8. > > I think it would be more difficult, because you will have to convert from > and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface > standard and I/O standard. That would be 60-70% of all cases you need a > string. Most string operations like search, comparison, slicing are > isomorphic between code points and octets. So you would win nothing from > keeping strings internally as arrays of code points. I basically agree with Dmitry here. The internal representation is an implementation detail, but it seems likely that you would want to store UTF-8 strings directly; they're almost always going to be half the size (even for languages using their own characters like Greek) and for most of us, they'll be just a bit more than a quarter the size. The amount of bytes you copy around matters; the number of operations where code points are needed is fairly small. The main problem with UTF-8 is representing the code point positions in a way that they (a) aren't abused and (b) don't cost too much to calculate. Just using character indexes is too expensive for UTF-8 and UTF-16 representations, and using octet indexes is unsafe (since the splitting a character representation is a possibility). I'd probably use an abstract character position type that was implemented with an octet index under the covers. I think that would work OK as doing math on those is suspicious with a UTF representation. We're spoiled from using Latin-1 representations, of course, but generally one is interested in 5 characters, not 5 octets. And the number of octets in 5 characters depends on the string. So most of the sorts of operations that I tend to do (for instance from some code I was fixing earlier today): if Fort'Length > 6 and then Font(2..6) = "Arial" then This would be a bad idea if one is using any sort of universal representation -- you don't know how many octets is in the string literal so you can't assume a number in the test string. So the slice is dangerous (even though in this particular case it would be OK since the test string is all Ascii characters -- but I wouldn't want users to get in the habit of assuming such things). [BTW, the above was a bad idea anyway, because it turns out that the function in the Ada library returned bounds that don't start at 1. So the slice was usually out of range -- which is why I was looking at the code. Another thing that we could do without. Slices are evil, since they *seem* to be the right solution, yet rarely are in practice without a lot of hoops.] > The situation is comparable to Unbounded_Strings. The implementation is > relatively simple, but the user must carry the burden of calling To_String > and To_Unbounded_String all over the application and the processor must > suffer the overhead of copying arrays here and there. Yes, but that happens because Ada doesn't really have a string abstraction, so when you try to build one, you can't fully do the job. One presumes that a new language with a universal UTF-8 string wouldn't have that problem. (As previously noted, I don't see much point in trying to patch up Ada with a bunch of UTF-8 string packages; you would need an entire new set of Ada.Strings libraries and I/O libraries, and then you'd have all of the old stuff messing up resolution, using the best names, and confusing everything. A cleaner slate is needed.) Randy.