From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on ip-172-31-65-14.ec2.internal X-Spam-Level: X-Spam-Status: No, score=-2.9 required=3.0 tests=BAYES_00,NICE_REPLY_A autolearn=ham autolearn_force=no version=3.4.6 Path: eternal-september.org!reader02.eternal-september.org!news.gegeweb.eu!gegeweb.org!usenet-fr.net!proxad.net!feeder1-2.proxad.net!cleanfeed2-b.proxad.net!nnrp3-2.free.fr!not-for-mail Date: Sat, 9 Apr 2022 12:27:04 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Content-Language: fr Newsgroups: comp.lang.ada References: <607b5b20$0$27442$426a74cc@news.free.fr> <86mttuk5f0.fsf@stephe-leake.org> From: DrPi <314@drpi.fr> Subject: Re: Ada and Unicode In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Message-ID: <62515f7a$0$25324$426a74cc@news.free.fr> Organization: Guest of ProXad - France NNTP-Posting-Date: 09 Apr 2022 12:27:06 CEST NNTP-Posting-Host: 82.65.30.55 X-Trace: 1649500026 news-1.free.fr 25324 82.65.30.55:51170 X-Complaints-To: abuse@proxad.net Xref: reader02.eternal-september.org comp.lang.ada:63722 List-Id: Le 09/04/2022 à 06:05, Randy Brukardt a écrit : > "Dmitry A. Kazakov" wrote in message > news:t2q3cb$bbt$1@gioia.aioe.org... >> On 2022-04-08 21:19, Simon Wright wrote: >>> "Dmitry A. Kazakov" writes: >>> >>>> On 2022-04-08 10:56, Simon Wright wrote: >>>>> "Randy Brukardt" writes: >>>>> >>>>>> If you had an Ada-like language that used a universal UTF-8 string >>>>>> internally, you then would have a lot of old and mostly useless >>>>>> operations supported for array types (since things like slices are >>>>>> mainly useful for string operations). >>>>> >>>>> Just off the top of my head, wouldn't it be better to use >>>>> UTF32-encoded Wide_Wide_Character internally? >>>> >>>> Yep, that is the exactly the problem, a confusion between interface >>>> and implementation. >>> >>> Don't understand. My point was that *when you are implementing this* it >>> mught be easier to deal with 32-bit charactrs/code points/whatever the >>> proper jargon is than with UTF8. >> >> I think it would be more difficult, because you will have to convert from >> and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface >> standard and I/O standard. That would be 60-70% of all cases you need a >> string. Most string operations like search, comparison, slicing are >> isomorphic between code points and octets. So you would win nothing from >> keeping strings internally as arrays of code points. > > I basically agree with Dmitry here. The internal representation is an > implementation detail, but it seems likely that you would want to store > UTF-8 strings directly; they're almost always going to be half the size > (even for languages using their own characters like Greek) and for most of > us, they'll be just a bit more than a quarter the size. The amount of bytes > you copy around matters; the number of operations where code points are > needed is fairly small. > > The main problem with UTF-8 is representing the code point positions in a > way that they (a) aren't abused and (b) don't cost too much to calculate. > Just using character indexes is too expensive for UTF-8 and UTF-16 > representations, and using octet indexes is unsafe (since the splitting a > character representation is a possibility). I'd probably use an abstract > character position type that was implemented with an octet index under the > covers. > > I think that would work OK as doing math on those is suspicious with a UTF > representation. We're spoiled from using Latin-1 representations, of course, > but generally one is interested in 5 characters, not 5 octets. And the > number of octets in 5 characters depends on the string. So most of the sorts > of operations that I tend to do (for instance from some code I was fixing > earlier today): > > if Fort'Length > 6 and then > Font(2..6) = "Arial" then > > This would be a bad idea if one is using any sort of universal > representation -- you don't know how many octets is in the string literal so > you can't assume a number in the test string. So the slice is dangerous > (even though in this particular case it would be OK since the test string is > all Ascii characters -- but I wouldn't want users to get in the habit of > assuming such things). > > [BTW, the above was a bad idea anyway, because it turns out that the > function in the Ada library returned bounds that don't start at 1. So the > slice was usually out of range -- which is why I was looking at the code. > Another thing that we could do without. Slices are evil, since they *seem* > to be the right solution, yet rarely are in practice without a lot of > hoops.] > >> The situation is comparable to Unbounded_Strings. The implementation is >> relatively simple, but the user must carry the burden of calling To_String >> and To_Unbounded_String all over the application and the processor must >> suffer the overhead of copying arrays here and there. > > Yes, but that happens because Ada doesn't really have a string abstraction, > so when you try to build one, you can't fully do the job. One presumes that > a new language with a universal UTF-8 string wouldn't have that problem. (As > previously noted, I don't see much point in trying to patch up Ada with a > bunch of UTF-8 string packages; you would need an entire new set of > Ada.Strings libraries and I/O libraries, and then you'd have all of the old > stuff messing up resolution, using the best names, and confusing everything. > A cleaner slate is needed.) > > Randy. > > In Python-2, there is the same kind of problem. A string is a byte array. This is the programmer responsibility to encode/decode to/from UTF8/Latin1/... and to manage everything correctly. Litteral strings can be considered as encoded or decoded depending on the notation ("" or u""). In Python-3, a string is a character(glyph ?) array. The internal representation is hidden to the programmer. UTF8/Latin1/... encoded "strings" are of type bytes (byte array). Writing/reading to/from a file is done with bytes type. When writing/reading to/from a file in text mode, you have to specify the encoding to use. The encoding/decoding is then internally managed. As a general rule, all "external communications" are done with bytes (byte array). This is the programmer responsability to encode/decode where needed to convert from/to strings. The source files (.py) are considered to be UTF8 encoded by default but one can declare the actual encoding at the top of the file in a special comment tag. When a badly encoded character is found, an exception is raised at parsing time. So, literal strings are real strings, not bytes. I think the Python-3 way of doing things is much more understandable and really usable. On the Ada side, I've still not understood how to correctly deal with all this stuff. Note : In Python-3, bytes type is not reserved to encoded "strings". It is a versatile type for what it's named : a byte array.