From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
	ip-172-31-65-14.ec2.internal
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=3.0 tests=BAYES_00,NICE_REPLY_A
	autolearn=ham autolearn_force=no version=3.4.6
Path: eternal-september.org!reader02.eternal-september.org!news.gegeweb.eu!gegeweb.org!usenet-fr.net!proxad.net!feeder1-2.proxad.net!cleanfeed2-b.proxad.net!nnrp3-2.free.fr!not-for-mail
Date: Sat, 9 Apr 2022 12:27:04 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
 Thunderbird/91.7.0
Content-Language: fr
Newsgroups: comp.lang.ada
References: <607b5b20$0$27442$426a74cc@news.free.fr>
 <86mttuk5f0.fsf@stephe-leake.org> <s5jr59$1tkq$1@gioia.aioe.org>
 <s5juep$1lbu$1@gioia.aioe.org> <s5jute$1s08$1@gioia.aioe.org>
 <s5k0ai$bb5$1@dont-email.me>
 <fantome.forums.tDeContes-E8EAB8.20043603042022@news.free.fr>
 <t2knpr$s26$1@dont-email.me> <t2lesj$d2f$1@dont-email.me>
 <lysfqolzrg.fsf@pushface.org> <t2ov3c$10au$1@gioia.aioe.org>
 <lyfsmn2xjn.fsf@pushface.org> <t2q3cb$bbt$1@gioia.aioe.org>
 <t2r0mk$q4d$1@dont-email.me>
From: DrPi <314@drpi.fr>
Subject: Re: Ada and Unicode
In-Reply-To: <t2r0mk$q4d$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Message-ID: <62515f7a$0$25324$426a74cc@news.free.fr>
Organization: Guest of ProXad - France
NNTP-Posting-Date: 09 Apr 2022 12:27:06 CEST
NNTP-Posting-Host: 82.65.30.55
X-Trace: 1649500026 news-1.free.fr 25324 82.65.30.55:51170
X-Complaints-To: abuse@proxad.net
Xref: reader02.eternal-september.org comp.lang.ada:63722
List-Id: <comp.lang.ada>


Le 09/04/2022 à 06:05, Randy Brukardt a écrit :
> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
> news:t2q3cb$bbt$1@gioia.aioe.org...
>> On 2022-04-08 21:19, Simon Wright wrote:
>>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
>>>
>>>> On 2022-04-08 10:56, Simon Wright wrote:
>>>>> "Randy Brukardt" <randy@rrsoftware.com> writes:
>>>>>
>>>>>> If you had an Ada-like language that used a universal UTF-8 string
>>>>>> internally, you then would have a lot of old and mostly useless
>>>>>> operations supported for array types (since things like slices are
>>>>>> mainly useful for string operations).
>>>>>
>>>>> Just off the top of my head, wouldn't it be better to use
>>>>> UTF32-encoded Wide_Wide_Character internally?
>>>>
>>>> Yep, that is the exactly the problem, a confusion between interface
>>>> and implementation.
>>>
>>> Don't understand. My point was that *when you are implementing this* it
>>> mught be easier to deal with 32-bit charactrs/code points/whatever the
>>> proper jargon is than with UTF8.
>>
>> I think it would be more difficult, because you will have to convert from
>> and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface
>> standard and I/O standard. That would be 60-70% of all cases you need a
>> string. Most string operations like search, comparison, slicing are
>> isomorphic between code points and octets. So you would win nothing from
>> keeping strings internally as arrays of code points.
> 
> I basically agree with Dmitry here. The internal representation is an
> implementation detail, but it seems likely that you would want to store
> UTF-8 strings directly; they're almost always going to be half the size
> (even for languages using their own characters like Greek) and for most of
> us, they'll be just a bit more than a quarter the size. The amount of bytes
> you copy around matters; the number of operations where code points are
> needed is fairly small.
> 
> The main problem with UTF-8 is representing the code point positions in a
> way that they (a) aren't abused and (b) don't cost too much to calculate.
> Just using character indexes is too expensive for UTF-8 and UTF-16
> representations, and using octet indexes is unsafe (since the splitting a
> character representation is a possibility). I'd probably use an abstract
> character position type that was implemented with an octet index under the
> covers.
> 
> I think that would work OK as doing math on those is suspicious with a UTF
> representation. We're spoiled from using Latin-1 representations, of course,
> but generally one is interested in 5 characters, not 5 octets. And the
> number of octets in 5 characters depends on the string. So most of the sorts
> of operations that I tend to do (for instance from some code I was fixing
> earlier today):
> 
>       if Fort'Length > 6 and then
>          Font(2..6) = "Arial" then
> 
> This would be a bad idea if one is using any sort of universal
> representation -- you don't know how many octets is in the string literal so
> you can't assume a number in the test string. So the slice is dangerous
> (even though in this particular case it would be OK since the test string is
> all Ascii characters -- but I wouldn't want users to get in the habit of
> assuming such things).
> 
> [BTW, the above was a bad idea anyway, because it turns out that the
> function in the Ada library returned bounds that don't start at 1. So the
> slice was usually out of range -- which is why I was looking at the code.
> Another thing that we could do without. Slices are evil, since they *seem*
> to be the right solution, yet rarely are in practice without a lot of
> hoops.]
> 
>> The situation is comparable to Unbounded_Strings. The implementation is
>> relatively simple, but the user must carry the burden of calling To_String
>> and To_Unbounded_String all over the application and the processor must
>> suffer the overhead of copying arrays here and there.
> 
> Yes, but that happens because Ada doesn't really have a string abstraction,
> so when you try to build one, you can't fully do the job. One presumes that
> a new language with a universal UTF-8 string wouldn't have that problem. (As
> previously noted, I don't see much point in trying to patch up Ada with a
> bunch of UTF-8 string packages; you would need an entire new set of
> Ada.Strings libraries and I/O libraries, and then you'd have all of the old
> stuff messing up resolution, using the best names, and confusing everything.
> A cleaner slate is needed.)
> 
>                                     Randy.
> 
> 

In Python-2, there is the same kind of problem. A string is a byte 
array. This is the programmer responsibility to encode/decode to/from 
UTF8/Latin1/... and to manage everything correctly. Litteral strings can 
be considered as encoded or decoded depending on the notation ("" or u"").

In Python-3, a string is a character(glyph ?) array. The internal 
representation is hidden to the programmer.
UTF8/Latin1/... encoded "strings" are of type bytes (byte array).
Writing/reading to/from a file is done with bytes type.
When writing/reading to/from a file in text mode, you have to specify 
the encoding to use. The encoding/decoding is then internally managed.
As a general rule, all "external communications" are done with bytes 
(byte array). This is the programmer responsability to encode/decode 
where needed to convert from/to strings.
The source files (.py) are considered to be UTF8 encoded by default but 
one can declare the actual encoding at the top of the file in a special 
comment tag. When a badly encoded character is found, an exception is 
raised at parsing time. So, literal strings are real strings, not bytes.

I think the Python-3 way of doing things is much more understandable and 
really usable.

On the Ada side, I've still not understood how to correctly deal with 
all this stuff.


Note : In Python-3, bytes type is not reserved to encoded "strings". It 
is a versatile type for what it's named : a byte array.