From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on ip-172-31-65-14.ec2.internal X-Spam-Level: X-Spam-Status: No, score=-1.9 required=3.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.6 Path: eternal-september.org!reader02.eternal-september.org!aioe.org!vNObJwB5W4WN632vBkQn9g.user.46.165.242.75.POSTED!not-for-mail From: Simon Wright Newsgroups: comp.lang.ada Subject: Re: Ada and Unicode Date: Sat, 09 Apr 2022 08:43:34 +0100 Organization: Aioe.org NNTP Server Message-ID: References: <607b5b20$0$27442$426a74cc@news.free.fr> <86mttuk5f0.fsf@stephe-leake.org> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: gioia.aioe.org; logging-data="21787"; posting-host="vNObJwB5W4WN632vBkQn9g.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org"; User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (darwin) X-Notice: Filtered by postfilter v. 0.9.2 Cancel-Lock: sha1:mdS6T8kOK8g0IPH+Q+8E/wbhwu4= Xref: reader02.eternal-september.org comp.lang.ada:63721 List-Id: "Randy Brukardt" writes: > "Dmitry A. Kazakov" wrote in message > news:t2q3cb$bbt$1@gioia.aioe.org... >> On 2022-04-08 21:19, Simon Wright wrote: >>> "Dmitry A. Kazakov" writes: >>> >>>> On 2022-04-08 10:56, Simon Wright wrote: >>>>> "Randy Brukardt" writes: >>>>> >>>>>> If you had an Ada-like language that used a universal UTF-8 string >>>>>> internally, you then would have a lot of old and mostly useless >>>>>> operations supported for array types (since things like slices are >>>>>> mainly useful for string operations). >>>>> >>>>> Just off the top of my head, wouldn't it be better to use >>>>> UTF32-encoded Wide_Wide_Character internally? >>>> >>>> Yep, that is the exactly the problem, a confusion between interface >>>> and implementation. >>> >>> Don't understand. My point was that *when you are implementing this* it >>> mught be easier to deal with 32-bit charactrs/code points/whatever the >>> proper jargon is than with UTF8. >> >> I think it would be more difficult, because you will have to convert from >> and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface >> standard and I/O standard. That would be 60-70% of all cases you need a >> string. Most string operations like search, comparison, slicing are >> isomorphic between code points and octets. So you would win nothing from >> keeping strings internally as arrays of code points. > > I basically agree with Dmitry here. The internal representation is an > implementation detail, but it seems likely that you would want to store > UTF-8 strings directly; they're almost always going to be half the size > (even for languages using their own characters like Greek) and for most of > us, they'll be just a bit more than a quarter the size. The amount of bytes > you copy around matters; the number of operations where code points are > needed is fairly small. Well, I don't have any skin in this game, so I'll shut up at this point.