comp.lang.ada
 help / color / mirror / Atom feed
* Ada and Unicode
@ 2021-04-17 22:03 DrPi
  2021-04-18  0:02 ` Luke A. Guest
                   ` (4 more replies)
  0 siblings, 5 replies; 63+ messages in thread
From: DrPi @ 2021-04-17 22:03 UTC (permalink / raw)


Hi,

I have a good knowledge of Unicode : code points, encoding...
What I don't understand is how to manage Unicode strings with Ada. I've 
read part of ARM and did some tests without success.

I managed to be partly successful with source code encoded in Latin-1. 
Any other encoding failed.
Any way to use source code encoded in UTF-8 ?
In some languages, it is possible to set a tag at the beginning of the 
source file to direct the compiler which encoding to use.
I wasn't successful using -gnatW8 switch. But maybe I made to many tests 
and my brain was scrambled.

Even with source code encoded in Latin-1, I've not been able to manage 
Unicode strings correctly.

What's the way to manage Unicode correctly ?

Regards,
Nicolas

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-17 22:03 Ada and Unicode DrPi
@ 2021-04-18  0:02 ` Luke A. Guest
  2021-04-19  9:09   ` DrPi
  2021-04-19  8:29 ` Maxim Reznik
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 63+ messages in thread
From: Luke A. Guest @ 2021-04-18  0:02 UTC (permalink / raw)



On 17/04/2021 23:03, DrPi wrote:
> Hi,
> 
> I have a good knowledge of Unicode : code points, encoding...
> What I don't understand is how to manage Unicode strings with Ada. I've 
> read part of ARM and did some tests without success.

It's a mess imo. I've complained about it before. The official stance is 
that the standard defines that a compiler should accept the ISO 
equivalent of Unicode and that a compiler should implement a flawed 
system, especially UTF-8 types, 
http://www.ada-auth.org/standards/rm12_w_tc1/html/RM-A-4-11.html

Unicode is a bit painful, I've messed about with it to some degree here 
https://github.com/Lucretia/uca.

There are other attempts:

1. http://www.dmitry-kazakov.de/ada/strings_edit.htm
2. https://github.com/reznikmm/matreshka (very heavy, many layers)
3. https://github.com/Blady-Com/UXStrings

I remember getting an exception converting from my unicode_string to a 
wide_wide string for some reason ages ago.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-17 22:03 Ada and Unicode DrPi
  2021-04-18  0:02 ` Luke A. Guest
@ 2021-04-19  8:29 ` Maxim Reznik
  2021-04-19  9:28   ` DrPi
  2021-04-19 11:15   ` Simon Wright
  2021-04-19  9:08 ` Stephen Leake
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 63+ messages in thread
From: Maxim Reznik @ 2021-04-19  8:29 UTC (permalink / raw)


воскресенье, 18 апреля 2021 г. в 01:03:14 UTC+3, DrPi:
> 
> Any way to use source code encoded in UTF-8 ? 

Yes, with GNAT just use "-gnatW8" for compiler flag (in command line or your project file):

--  main.adb:
with Ada.Wide_Wide_Text_IO;

procedure Main is
   Привет : constant Wide_Wide_String := "Привет";
begin
   Ada.Wide_Wide_Text_IO.Put_Line (Привет);
end Main;

$ gprbuild -gnatW8 main.adb
$ ./main 
Привет


> In some languages, it is possible to set a tag at the beginning of the 
> source file to direct the compiler which encoding to use. 

You can do this with putting the Wide_Character_Encoding pragma (This is a GNAT specific pragma) at the top of the file. Take a look:

--  main.adb:
pragma Wide_Character_Encoding (UTF8);

with Ada.Wide_Wide_Text_IO;

procedure Main is
   Привет : constant Wide_Wide_String := "Привет";
begin
   Ada.Wide_Wide_Text_IO.Put_Line (Привет);
end Main;

$ gprbuild main.adb
$ ./main 
Привет



> What's the way to manage Unicode correctly ? 
> 

You can use Wide_Wide_String and Unbounded_Wide_Wide_String type to process Unicode strings. But this is not very handy. I use the Matreshka library for Unicode strings. It has a lot of features (regexp, string vectors, XML, JSON, databases, Web Servlets, template engine, etc.). URL: https://forge.ada-ru.org/matreshka

> Regards, 
> Nicolas

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-17 22:03 Ada and Unicode DrPi
  2021-04-18  0:02 ` Luke A. Guest
  2021-04-19  8:29 ` Maxim Reznik
@ 2021-04-19  9:08 ` Stephen Leake
  2021-04-19  9:34   ` Dmitry A. Kazakov
                     ` (3 more replies)
  2021-04-19 13:18 ` Vadim Godunko
  2021-04-19 22:40 ` Shark8
  4 siblings, 4 replies; 63+ messages in thread
From: Stephen Leake @ 2021-04-19  9:08 UTC (permalink / raw)


DrPi <314@drpi.fr> writes:

> Any way to use source code encoded in UTF-8 ?

      for Switches ("non_ascii.ads") use ("-gnatiw", "-gnatW8");

from the gnat user guide, 4.3.1 Alphabetical List of All Switches:

`-gnati`c''
     Identifier character set (`c' = 1/2/3/4/8/9/p/f/n/w).  For details
     of the possible selections for `c', see *note Character Set
     Control: 4e.

This applies to identifiers in the source code

`-gnatW`e''
     Wide character encoding method (`e'=n/h/u/s/e/8).

This applies to string and character literals.

> What's the way to manage Unicode correctly ?

There are two issues: Unicode in source code, that the compiler must
understand, and Unicode in strings, that your program must understand.

(I've never written a program that dealt with utf strings other than
file names).
 
-gnati8 tells the compiler that the source code uses utf-8 encoding.

-gnatW8 tells the compiler that string literals use utf-8 encoding.

package Ada.Strings.UTF_Encoding provides some facilities for dealing
with utf. It does _not_ provide walking a string by code point, which
would seem necessary.

We could be more helpful if you show what you are trying to do, you've
tried, and what errors you got.

-- 
-- Stephe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-18  0:02 ` Luke A. Guest
@ 2021-04-19  9:09   ` DrPi
  0 siblings, 0 replies; 63+ messages in thread
From: DrPi @ 2021-04-19  9:09 UTC (permalink / raw)


Le 18/04/2021 à 02:02, Luke A. Guest a écrit :
> 
> On 17/04/2021 23:03, DrPi wrote:
>> Hi,
>>
>> I have a good knowledge of Unicode : code points, encoding...
>> What I don't understand is how to manage Unicode strings with Ada. 
>> I've read part of ARM and did some tests without success.
> 
> It's a mess imo. I've complained about it before. The official stance is 
> that the standard defines that a compiler should accept the ISO 
> equivalent of Unicode and that a compiler should implement a flawed 
> system, especially UTF-8 types, 
> http://www.ada-auth.org/standards/rm12_w_tc1/html/RM-A-4-11.html
> 
> Unicode is a bit painful, I've messed about with it to some degree here 
> https://github.com/Lucretia/uca.
> 
> There are other attempts:
> 
> 1. http://www.dmitry-kazakov.de/ada/strings_edit.htm
> 2. https://github.com/reznikmm/matreshka (very heavy, many layers)
> 3. https://github.com/Blady-Com/UXStrings
> 
> I remember getting an exception converting from my unicode_string to a 
> wide_wide string for some reason ages ago.
Thanks

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19  8:29 ` Maxim Reznik
@ 2021-04-19  9:28   ` DrPi
  2021-04-19 13:50     ` Maxim Reznik
  2021-04-19 11:15   ` Simon Wright
  1 sibling, 1 reply; 63+ messages in thread
From: DrPi @ 2021-04-19  9:28 UTC (permalink / raw)


Le 19/04/2021 à 10:29, Maxim Reznik a écrit :
> воскресенье, 18 апреля 2021 г. в 01:03:14 UTC+3, DrPi:
>>
>> Any way to use source code encoded in UTF-8 ?
> 
> Yes, with GNAT just use "-gnatW8" for compiler flag (in command line or your project file):
> 
> --  main.adb:
> with Ada.Wide_Wide_Text_IO;
> 
> procedure Main is
>     Привет : constant Wide_Wide_String := "Привет";
> begin
>     Ada.Wide_Wide_Text_IO.Put_Line (Привет);
> end Main;
> 
> $ gprbuild -gnatW8 main.adb
> $ ./main
> Привет
> 
> 
>> In some languages, it is possible to set a tag at the beginning of the
>> source file to direct the compiler which encoding to use.
> 
> You can do this with putting the Wide_Character_Encoding pragma (This is a GNAT specific pragma) at the top of the file. Take a look:
> 
> --  main.adb:
> pragma Wide_Character_Encoding (UTF8);
> 
> with Ada.Wide_Wide_Text_IO;
> 
> procedure Main is
>     Привет : constant Wide_Wide_String := "Привет";
> begin
>     Ada.Wide_Wide_Text_IO.Put_Line (Привет);
> end Main;
> 
> $ gprbuild main.adb
> $ ./main
> Привет
> 
Wide and Wide_Wide characters and UTF-8 are two distinct things.
Wide and Wide_Wide characters are supposed to contain Unicode code 
points (Unicode characters).
UTF-8 is a stream of bytes, the encoding of Wide or Wide_Wide characters.
What's the purpose of "pragma Wide_Character_Encoding (UTF8);" ?

> 
> 
>> What's the way to manage Unicode correctly ?
>>
> 
> You can use Wide_Wide_String and Unbounded_Wide_Wide_String type to process Unicode strings. But this is not very handy. I use the Matreshka library for Unicode strings. It has a lot of features (regexp, string vectors, XML, JSON, databases, Web Servlets, template engine, etc.). URL: https://forge.ada-ru.org/matreshka

Thanks
> 
>> Regards,
>> Nicolas

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19  9:08 ` Stephen Leake
@ 2021-04-19  9:34   ` Dmitry A. Kazakov
  2021-04-19 11:56   ` Luke A. Guest
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 63+ messages in thread
From: Dmitry A. Kazakov @ 2021-04-19  9:34 UTC (permalink / raw)


On 2021-04-19 11:08, Stephen Leake wrote:

> (I've never written a program that dealt with utf strings other than
> file names).
>   
> -gnati8 tells the compiler that the source code uses utf-8 encoding.
> 
> -gnatW8 tells the compiler that string literals use utf-8 encoding.

Both are recipes for disaster, especially the second. IMO the source 
must be strictly ASCII 7-bit. It is less dangerous to have UTF-8 or 
Latin-1 identifiers, they could be at least checked, except when used 
for external names. But string literals would be a ticking bomb.

If you need a wider set than ASCII, use named constants and integer 
literals. E.g.

    Celsius : constant String := Character'Val (16#C2#) &
                                 Character'Val (16#B0#) & 'C';

> We could be more helpful if you show what you are trying to do, you've
> tried, and what errors you got.

True

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19  8:29 ` Maxim Reznik
  2021-04-19  9:28   ` DrPi
@ 2021-04-19 11:15   ` Simon Wright
  2021-04-19 11:50     ` Luke A. Guest
                       ` (2 more replies)
  1 sibling, 3 replies; 63+ messages in thread
From: Simon Wright @ 2021-04-19 11:15 UTC (permalink / raw)


Maxim Reznik <reznikmm@gmail.com> writes:

> воскресенье, 18 апреля 2021 г. в 01:03:14 UTC+3, DrPi:
>> 
>> Any way to use source code encoded in UTF-8 ? 
>
> Yes, with GNAT just use "-gnatW8" for compiler flag (in command line
> or your project file):

But don't use unit names containing international characters, at any
rate if you're (interested in compiling on) Windows or macOS:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 11:15   ` Simon Wright
@ 2021-04-19 11:50     ` Luke A. Guest
  2021-04-19 15:53     ` DrPi
  2022-04-03 19:20     ` Thomas
  2 siblings, 0 replies; 63+ messages in thread
From: Luke A. Guest @ 2021-04-19 11:50 UTC (permalink / raw)


On 19/04/2021 12:15, Simon Wright wrote:
> Maxim Reznik <reznikmm@gmail.com> writes:
> 
>> воскресенье, 18 апреля 2021 г. в 01:03:14 UTC+3, DrPi:
>>>
>>> Any way to use source code encoded in UTF-8 ?
>>
>> Yes, with GNAT just use "-gnatW8" for compiler flag (in command line
>> or your project file):
> 
> But don't use unit names containing international characters, at any
> rate if you're (interested in compiling on) Windows or macOS:

There's no such thing as "character" any more and we need to move away 
from that. Unicode has the concept of a code point which is 32 bit and 
any "character" as we know it, or glyph, can consist of multiple code 
points.

In my lib, nowhere near ready (whether it will be I don't know), I 
define octets, Unicode_String (utf-8 string) which is array of octets 
and Code_Points which an iterator produces as it iterates over those 
strings. I was intending to have an iterator for grapheme clusters and 
other units.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19  9:08 ` Stephen Leake
  2021-04-19  9:34   ` Dmitry A. Kazakov
@ 2021-04-19 11:56   ` Luke A. Guest
  2021-04-19 12:13     ` Luke A. Guest
  2021-04-19 12:52     ` Dmitry A. Kazakov
  2021-04-19 16:14   ` DrPi
  2022-04-16  2:32   ` Thomas
  3 siblings, 2 replies; 63+ messages in thread
From: Luke A. Guest @ 2021-04-19 11:56 UTC (permalink / raw)


On 19/04/2021 10:08, Stephen Leake wrote:
>> What's the way to manage Unicode correctly ?
> 
> There are two issues: Unicode in source code, that the compiler must
> understand, and Unicode in strings, that your program must understand.

And this is there the Ada standard gets it wrong, in the encodings 
package re utf-8.

Unicode is a superset of 7-bit ASCII not Latin 1. The high bit in the 
leading octet indicates whether there are trailing octets. See 
https://github.com/Lucretia/uca/blob/master/src/uca.ads#L70 for the data 
layout. The first 128 "characters" in Unicode match that of 7-bit ASCII, 
not 8-bit ASCII, and certainly not Latin 1. Therefore this:

package Ada.Strings.UTF_Encoding
    ...
    subtype UTF_8_String is String;
    ...
end Ada.Strings.UTF_Encoding;

Was absolutely and totally wrong.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 11:56   ` Luke A. Guest
@ 2021-04-19 12:13     ` Luke A. Guest
  2021-04-19 15:48       ` DrPi
  2021-04-19 12:52     ` Dmitry A. Kazakov
  1 sibling, 1 reply; 63+ messages in thread
From: Luke A. Guest @ 2021-04-19 12:13 UTC (permalink / raw)



On 19/04/2021 12:56, Luke A. Guest wrote:

> 
> package Ada.Strings.UTF_Encoding
>    ...
>    subtype UTF_8_String is String;
>    ...
> end Ada.Strings.UTF_Encoding;
> 
> Was absolutely and totally wrong.

...and, before someone comes back with "but all the upper half of latin 
1" are represented and have the same values." Yes, they do, in Code 
points which is a 32 bit number. In UTF-8 they are encoded as 2 octets!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 11:56   ` Luke A. Guest
  2021-04-19 12:13     ` Luke A. Guest
@ 2021-04-19 12:52     ` Dmitry A. Kazakov
  2021-04-19 13:00       ` Luke A. Guest
  1 sibling, 1 reply; 63+ messages in thread
From: Dmitry A. Kazakov @ 2021-04-19 12:52 UTC (permalink / raw)


On 2021-04-19 13:56, Luke A. Guest wrote:
> On 19/04/2021 10:08, Stephen Leake wrote:
>>> What's the way to manage Unicode correctly ?
>>
>> There are two issues: Unicode in source code, that the compiler must
>> understand, and Unicode in strings, that your program must understand.
> 
> And this is there the Ada standard gets it wrong, in the encodings 
> package re utf-8.
> 
> Unicode is a superset of 7-bit ASCII not Latin 1. The high bit in the 
> leading octet indicates whether there are trailing octets. See 
> https://github.com/Lucretia/uca/blob/master/src/uca.ads#L70 for the data 
> layout. The first 128 "characters" in Unicode match that of 7-bit ASCII, 
> not 8-bit ASCII, and certainly not Latin 1. Therefore this:
> 
> package Ada.Strings.UTF_Encoding
>    ...
>    subtype UTF_8_String is String;
>    ...
> end Ada.Strings.UTF_Encoding;
> 
> Was absolutely and totally wrong.

It is practical solution. Ada type system cannot express differently 
represented/constrained string/array/vector subtypes. Ignoring Latin-1 
and using String as if it were an array of octets is the best available 
solution.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 12:52     ` Dmitry A. Kazakov
@ 2021-04-19 13:00       ` Luke A. Guest
  2021-04-19 13:10         ` Dmitry A. Kazakov
                           ` (3 more replies)
  0 siblings, 4 replies; 63+ messages in thread
From: Luke A. Guest @ 2021-04-19 13:00 UTC (permalink / raw)




On 19/04/2021 13:52, Dmitry A. Kazakov wrote:

 > It is practical solution. Ada type system cannot express differently 
represented/constrained string/array/vector subtypes. Ignoring Latin-1 
and using String as if it were an array of octets is the best available 
solution.
 >

They're different types and should be incompatible, because, well, they 
are. What does Ada have that allows for this that other languages 
doesn't? Oh yeah! Types!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 13:00       ` Luke A. Guest
@ 2021-04-19 13:10         ` Dmitry A. Kazakov
  2021-04-19 13:15           ` Luke A. Guest
  2021-04-19 13:24         ` J-P. Rosen
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 63+ messages in thread
From: Dmitry A. Kazakov @ 2021-04-19 13:10 UTC (permalink / raw)


On 2021-04-19 14:55, Luke A. Guest wrote:
 >
 > On 19/04/2021 13:52, Dmitry A. Kazakov wrote:
 >
 >> It is practical solution. Ada type system cannot express differently 
represented/constrained string/array/vector subtypes. Ignoring Latin-1 
and using String as if it were an array of octets is the best available 
solution.
 >>
 >
 > They're different types and should be incompatible, because, well, 
they are. What does Ada have that allows for this that other languages 
doesn't? Oh yeah! Types!

They are subtypes, differently constrained, like Positive and Integer. 
Operations are same values are differently constrained. It does not make 
sense to consider ASCII 'a', Latin-1 'a', UTF-8 'a' different. It is 
same glyph differently encoded. Encoding is a representation aspect, 
ergo out of the interface!

BTW, subtype is a type.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 13:10         ` Dmitry A. Kazakov
@ 2021-04-19 13:15           ` Luke A. Guest
  2021-04-19 13:31             ` Dmitry A. Kazakov
  0 siblings, 1 reply; 63+ messages in thread
From: Luke A. Guest @ 2021-04-19 13:15 UTC (permalink / raw)


On 19/04/2021 14:10, Dmitry A. Kazakov wrote:

>> They're different types and should be incompatible, because, well, 
> they are. What does Ada have that allows for this that other languages 
> doesn't? Oh yeah! Types!
> 
> They are subtypes, differently constrained, like Positive and Integer. 

No they're not. They're subtypes only and therefore compatible. The UTF 
string isn't constrained in any other ways.

> Operations are same values are differently constrained. It does not make 
> sense to consider ASCII 'a', Latin-1 'a', UTF-8 'a' different. It is 
> same glyph differently encoded. Encoding is a representation aspect, 
> ergo out of the interface!

As I already said in Unicode the glyph is not part part of Unicode. The 
single code point character concept doesn't exist anymore.

> 
> BTW, subtype is a type.
> 

subtype is a compatible type.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-17 22:03 Ada and Unicode DrPi
                   ` (2 preceding siblings ...)
  2021-04-19  9:08 ` Stephen Leake
@ 2021-04-19 13:18 ` Vadim Godunko
  2022-04-03 16:51   ` Thomas
  2021-04-19 22:40 ` Shark8
  4 siblings, 1 reply; 63+ messages in thread
From: Vadim Godunko @ 2021-04-19 13:18 UTC (permalink / raw)


On Sunday, April 18, 2021 at 1:03:14 AM UTC+3, DrPi wrote:
> 
> I have a good knowledge of Unicode : code points, encoding... 
> What I don't understand is how to manage Unicode strings with Ada. I've 
> read part of ARM and did some tests without success. 
> 
> I managed to be partly successful with source code encoded in Latin-1. 
> Any other encoding failed. 
> Any way to use source code encoded in UTF-8 ? 
> In some languages, it is possible to set a tag at the beginning of the 
> source file to direct the compiler which encoding to use. 
> I wasn't successful using -gnatW8 switch. But maybe I made to many tests 
> and my brain was scrambled. 
> 
> Even with source code encoded in Latin-1, I've not been able to manage 
> Unicode strings correctly. 
> 
> What's the way to manage Unicode correctly ? 
> 

Ada doesn't have good Unicode support. :( So, you need to find suitable set of "workarounds".

There are few different aspects of Unicode support need to be considered:

1. Representation of string literals. If you want to use non-ASCII characters in source code, you need to use -gnatW8 switch and it will require use of Wide_Wide_String everywhere.
2. Internal representation during application execution. You are forced to use Wide_Wide_String at previous step, so it will be UCS4/UTF32.
3. Text encoding/decoding on input/output operations. GNAT allows to use UTF-8 by providing some magic string for Form parameter of Text_IO.

It is hard to say that it is reasonable set of features for modern world. To fix some of drawbacks of current situation we are developing new text processing library, know as VSS. 

https://github.com/AdaCore/VSS

At current stage it provides encoding independent API for text manipulation, encoders and decoders API for I/O, and JSON reader/writer; regexp support should come soon.

Encoding independent API means that application always use Unicode characters to process text, independently from the real encoding used to store information in memory (UTF-8 is used for now, UTF-16 will be added later for interoperability with Windows API and WASM). Coders and encoders allow translation from/to different encodings when application exchange information with the world.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 13:00       ` Luke A. Guest
  2021-04-19 13:10         ` Dmitry A. Kazakov
@ 2021-04-19 13:24         ` J-P. Rosen
  2021-04-20 19:13           ` Randy Brukardt
  2022-04-03 18:04           ` Thomas
  2021-04-19 16:07         ` DrPi
  2021-04-20 19:06         ` Randy Brukardt
  3 siblings, 2 replies; 63+ messages in thread
From: J-P. Rosen @ 2021-04-19 13:24 UTC (permalink / raw)


Le 19/04/2021 à 15:00, Luke A. Guest a écrit :
> They're different types and should be incompatible, because, well, they 
> are. What does Ada have that allows for this that other languages 
> doesn't? Oh yeah! Types!

They are not so different. For example, you may read the first line of a 
file in a string, then discover that it starts with a BOM, and thus 
decide it is UTF-8.

BTW, the very first version of this AI had different types, but the ARG 
felt that it would just complicate the interface for the sake of abusive 
"purity".

-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52
https://www.adalog.fr

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 13:15           ` Luke A. Guest
@ 2021-04-19 13:31             ` Dmitry A. Kazakov
  2022-04-03 17:24               ` Thomas
  0 siblings, 1 reply; 63+ messages in thread
From: Dmitry A. Kazakov @ 2021-04-19 13:31 UTC (permalink / raw)


On 2021-04-19 15:15, Luke A. Guest wrote:
> On 19/04/2021 14:10, Dmitry A. Kazakov wrote:
> 
>>> They're different types and should be incompatible, because, well, 
>> they are. What does Ada have that allows for this that other languages 
>> doesn't? Oh yeah! Types!
>>
>> They are subtypes, differently constrained, like Positive and Integer. 
> 
> No they're not. They're subtypes only and therefore compatible. The UTF 
> string isn't constrained in any other ways.

Of course it is. There could be string encodings that have no Unicode 
counterparts and thus missing in UTF-8/16.

>> Operations are same values are differently constrained. It does not 
>> make sense to consider ASCII 'a', Latin-1 'a', UTF-8 'a' different. It 
>> is same glyph differently encoded. Encoding is a representation 
>> aspect, ergo out of the interface!
> 
> As I already said in Unicode the glyph is not part part of Unicode. The 
> single code point character concept doesn't exist anymore.

It does not matter from practical point of view. Some Unicode's 
idiosyncrasies are better ignored.

>> BTW, subtype is a type.

> subtype is a compatible type.

Ada subtype is both a sub- and supertype, i.e. substitutable [or so the 
compiler thinks] in both directions. A derived tagged type is 
substitutable in only one direction.

Neither is fully "compatible", because otherwise there would be no 
reason to have an exactly same thing.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19  9:28   ` DrPi
@ 2021-04-19 13:50     ` Maxim Reznik
  2021-04-19 15:51       ` DrPi
  0 siblings, 1 reply; 63+ messages in thread
From: Maxim Reznik @ 2021-04-19 13:50 UTC (permalink / raw)


понедельник, 19 апреля 2021 г. в 12:28:39 UTC+3, DrPi:
> Le 19/04/2021 à 10:29, Maxim Reznik a écrit : 
> > воскресенье, 18 апреля 2021 г. в 01:03:14 UTC+3, DrPi: 
> >> In some languages, it is possible to set a tag at the beginning of the 
> >> source file to direct the compiler which encoding to use. 
> > 
> > You can do this with putting the Wide_Character_Encoding pragma (This is a GNAT specific pragma) at the top of the file.
> >
> Wide and Wide_Wide characters and UTF-8 are two distinct things. 
> Wide and Wide_Wide characters are supposed to contain Unicode code 
> points (Unicode characters). 
> UTF-8 is a stream of bytes, the encoding of Wide or Wide_Wide characters.

Yes, it is.

> What's the purpose of "pragma Wide_Character_Encoding (UTF8);" ?

This pragma specifies the character encoding to be used in program source text...

https://docs.adacore.com/gnat_rm-docs/html/gnat_rm/gnat_rm/implementation_defined_pragmas.html#pragma-wide-character-encoding

I would suggest also this article to read:

https://two-wrongs.com/unicode-strings-in-ada-2012

Best regards,

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 12:13     ` Luke A. Guest
@ 2021-04-19 15:48       ` DrPi
  0 siblings, 0 replies; 63+ messages in thread
From: DrPi @ 2021-04-19 15:48 UTC (permalink / raw)


Le 19/04/2021 à 14:13, Luke A. Guest a écrit :
> 
> On 19/04/2021 12:56, Luke A. Guest wrote:
> 
>>
>> package Ada.Strings.UTF_Encoding
>>    ...
>>    subtype UTF_8_String is String;
>>    ...
>> end Ada.Strings.UTF_Encoding;
>>
>> Was absolutely and totally wrong.
> 
> ...and, before someone comes back with "but all the upper half of latin 
> 1" are represented and have the same values." Yes, they do, in Code 
> points which is a 32 bit number. In UTF-8 they are encoded as 2 octets!
A code point has no size. Like universal integers in Ada.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 13:50     ` Maxim Reznik
@ 2021-04-19 15:51       ` DrPi
  0 siblings, 0 replies; 63+ messages in thread
From: DrPi @ 2021-04-19 15:51 UTC (permalink / raw)


Le 19/04/2021 à 15:50, Maxim Reznik a écrit :
> понедельник, 19 апреля 2021 г. в 12:28:39 UTC+3, DrPi:
>> Le 19/04/2021 à 10:29, Maxim Reznik a écrit :
>>> воскресенье, 18 апреля 2021 г. в 01:03:14 UTC+3, DrPi:
>>>> In some languages, it is possible to set a tag at the beginning of the
>>>> source file to direct the compiler which encoding to use.
>>>
>>> You can do this with putting the Wide_Character_Encoding pragma (This is a GNAT specific pragma) at the top of the file.
>>>
>> Wide and Wide_Wide characters and UTF-8 are two distinct things.
>> Wide and Wide_Wide characters are supposed to contain Unicode code
>> points (Unicode characters).
>> UTF-8 is a stream of bytes, the encoding of Wide or Wide_Wide characters.
> 
> Yes, it is.
> 
>> What's the purpose of "pragma Wide_Character_Encoding (UTF8);" ?
> 
> This pragma specifies the character encoding to be used in program source text...
> 
> https://docs.adacore.com/gnat_rm-docs/html/gnat_rm/gnat_rm/implementation_defined_pragmas.html#pragma-wide-character-encoding

Good to know.

> 
> I would suggest also this article to read:
> 
> https://two-wrongs.com/unicode-strings-in-ada-2012
> 
I think I've already read it. But will do again.

> Best regards,
> 
Thanks

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 11:15   ` Simon Wright
  2021-04-19 11:50     ` Luke A. Guest
@ 2021-04-19 15:53     ` DrPi
  2022-04-03 19:20     ` Thomas
  2 siblings, 0 replies; 63+ messages in thread
From: DrPi @ 2021-04-19 15:53 UTC (permalink / raw)


Le 19/04/2021 à 13:15, Simon Wright a écrit :
> Maxim Reznik <reznikmm@gmail.com> writes:
> 
>> воскресенье, 18 апреля 2021 г. в 01:03:14 UTC+3, DrPi:
>>>
>>> Any way to use source code encoded in UTF-8 ?
>>
>> Yes, with GNAT just use "-gnatW8" for compiler flag (in command line
>> or your project file):
> 
> But don't use unit names containing international characters, at any
> rate if you're (interested in compiling on) Windows or macOS:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114
> 
Good to know.
Thanks

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 13:00       ` Luke A. Guest
  2021-04-19 13:10         ` Dmitry A. Kazakov
  2021-04-19 13:24         ` J-P. Rosen
@ 2021-04-19 16:07         ` DrPi
  2021-04-20 19:06         ` Randy Brukardt
  3 siblings, 0 replies; 63+ messages in thread
From: DrPi @ 2021-04-19 16:07 UTC (permalink / raw)


Le 19/04/2021 à 15:00, Luke A. Guest a écrit :
> 
> 
> On 19/04/2021 13:52, Dmitry A. Kazakov wrote:
> 
>> It is practical solution. Ada type system cannot express differently 
> represented/constrained string/array/vector subtypes. Ignoring Latin-1 
> and using String as if it were an array of octets is the best available 
> solution.
>>
> 
> They're different types and should be incompatible, because, well, they 
> are. What does Ada have that allows for this that other languages 
> doesn't? Oh yeah! Types!
I agree.

In Python2, encoded and "decoded" strings are of same type "str". Bad 
design.

In Python3, "decoded" strings are of type "str" and encoded strings are 
of type "bytes" (byte array). Both are different things and can't be 
assigned one to the other. Much more clear for the programmer.
It should the same in Ada. Different types.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19  9:08 ` Stephen Leake
  2021-04-19  9:34   ` Dmitry A. Kazakov
  2021-04-19 11:56   ` Luke A. Guest
@ 2021-04-19 16:14   ` DrPi
  2021-04-19 17:12     ` Björn Lundin
  2022-04-16  2:32   ` Thomas
  3 siblings, 1 reply; 63+ messages in thread
From: DrPi @ 2021-04-19 16:14 UTC (permalink / raw)


Le 19/04/2021 à 11:08, Stephen Leake a écrit :
> DrPi <314@drpi.fr> writes:
> 
>> Any way to use source code encoded in UTF-8 ?
> 
>        for Switches ("non_ascii.ads") use ("-gnatiw", "-gnatW8");
> 
That's interesting.
Using these switches at project level is not OK. Project source files 
not always use the same encoding. Especially when using libraries.
Using these switches at source level is better. A little bit complicated 
to use but better.

> from the gnat user guide, 4.3.1 Alphabetical List of All Switches:
> 
> `-gnati`c''
>       Identifier character set (`c' = 1/2/3/4/8/9/p/f/n/w).  For details
>       of the possible selections for `c', see *note Character Set
>       Control: 4e.
> 
> This applies to identifiers in the source code
> 
> `-gnatW`e''
>       Wide character encoding method (`e'=n/h/u/s/e/8).
> 
> This applies to string and character literals.
> 
>> What's the way to manage Unicode correctly ?
> 
> There are two issues: Unicode in source code, that the compiler must
> understand, and Unicode in strings, that your program must understand.
> 
> (I've never written a program that dealt with utf strings other than
> file names).
>   
> -gnati8 tells the compiler that the source code uses utf-8 encoding.
> 
> -gnatW8 tells the compiler that string literals use utf-8 encoding.
> 
> package Ada.Strings.UTF_Encoding provides some facilities for dealing
> with utf. It does _not_ provide walking a string by code point, which
> would seem necessary.
> 
> We could be more helpful if you show what you are trying to do, you've
> tried, and what errors you got.
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 16:14   ` DrPi
@ 2021-04-19 17:12     ` Björn Lundin
  2021-04-19 19:44       ` DrPi
  0 siblings, 1 reply; 63+ messages in thread
From: Björn Lundin @ 2021-04-19 17:12 UTC (permalink / raw)


Den 2021-04-19 kl. 18:14, skrev DrPi:
>>        for Switches ("non_ascii.ads") use ("-gnatiw", "-gnatW8");
>>
> That's interesting.
> Using these switches at project level is not OK. Project source files 
> not always use the same encoding. Especially when using libraries.
> Using these switches at source level is better. A little bit complicated 
> to use but better.

You did understand that the above setting only applies to the file 
called 'non_ascii.ads' - and not to the rest of the files?



-- 
Björn

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 17:12     ` Björn Lundin
@ 2021-04-19 19:44       ` DrPi
  0 siblings, 0 replies; 63+ messages in thread
From: DrPi @ 2021-04-19 19:44 UTC (permalink / raw)


Le 19/04/2021 à 19:12, Björn Lundin a écrit :
> Den 2021-04-19 kl. 18:14, skrev DrPi:
>>>        for Switches ("non_ascii.ads") use ("-gnatiw", "-gnatW8");
>>>
>> That's interesting.
>> Using these switches at project level is not OK. Project source files 
>> not always use the same encoding. Especially when using libraries.
>> Using these switches at source level is better. A little bit 
>> complicated to use but better.
> 
> You did understand that the above setting only applies to the file 
> called 'non_ascii.ads' - and not to the rest of the files?
> 
> 
> 
Yes, that's what I've understood.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-17 22:03 Ada and Unicode DrPi
                   ` (3 preceding siblings ...)
  2021-04-19 13:18 ` Vadim Godunko
@ 2021-04-19 22:40 ` Shark8
  2021-04-20 15:05   ` Simon Wright
  4 siblings, 1 reply; 63+ messages in thread
From: Shark8 @ 2021-04-19 22:40 UTC (permalink / raw)


On Saturday, April 17, 2021 at 4:03:14 PM UTC-6, DrPi wrote:
> Hi, 
> 
> I have a good knowledge of Unicode : code points, encoding... 
> What I don't understand is how to manage Unicode strings with Ada. I've 
> read part of ARM and did some tests without success. 
> 
> I managed to be partly successful with source code encoded in Latin-1. 
Ah.
Yes, this is an issue in GNAT, and possibly other compilers.
The easiest method for me is to right-click the text-buffer for the file in GPS, click properties in the menu that pops up, then in the dialog select from the Character Set drop-down "Unicode UTF-#".
> Any other encoding failed. 
> Any way to use source code encoded in UTF-8 ? 
There's the above method with GPS.
IIRC there's also a Pragma and a compiler-flag for GNAT.

It's actually a non-issue for Byron, because the file-reader does a BOM-check [IIRC defaulting to ASCII in the absence of a BOM] and outputs to the lexer the Wide_Wide_Character equivalent of the input-encoding.
See: https://github.com/OneWingedShark/Byron/blob/master/src/reader/readington.adb

> In some languages, it is possible to set a tag at the beginning of the 
> source file to direct the compiler which encoding to use. 
> I wasn't successful using -gnatW8 switch. But maybe I made to many tests 
> and my brain was scrambled.
IIRC the gnatW8 flag sets it to UTF-8, so if your editor is saving in something else like UTF-16 BE, the compiler [probably] won't read it correctly.

> Even with source code encoded in Latin-1, I've not been able to manage 
> Unicode strings correctly. 
> 
> What's the way to manage Unicode correctly ? 
I typically use the GPS file/properties method above, and then I might also use the pragma.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 22:40 ` Shark8
@ 2021-04-20 15:05   ` Simon Wright
  2021-04-20 19:17     ` Randy Brukardt
  0 siblings, 1 reply; 63+ messages in thread
From: Simon Wright @ 2021-04-20 15:05 UTC (permalink / raw)


Shark8 <onewingedshark@gmail.com> writes:

> It's actually a non-issue for Byron, because the file-reader does a
> BOM-check [IIRC defaulting to ASCII in the absence of a BOM]

GNAT does a BOM-check also. gnatchop does one better, carrying the BOM
from the top of the input file through to each output file.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 13:00       ` Luke A. Guest
                           ` (2 preceding siblings ...)
  2021-04-19 16:07         ` DrPi
@ 2021-04-20 19:06         ` Randy Brukardt
  2022-04-03 18:37           ` Thomas
  3 siblings, 1 reply; 63+ messages in thread
From: Randy Brukardt @ 2021-04-20 19:06 UTC (permalink / raw)


"Luke A. Guest" <laguest@archeia.com> wrote in message 
news:s5jute$1s08$1@gioia.aioe.org...
>
>
> On 19/04/2021 13:52, Dmitry A. Kazakov wrote:
>
> > It is practical solution. Ada type system cannot express differently
> represented/constrained string/array/vector subtypes. Ignoring Latin-1 and 
> using String as if it were an array of octets is the best available 
> solution.
> >
>
> They're different types and should be incompatible, because, well, they 
> are. What does Ada have that allows for this that other languages doesn't? 
> Oh yeah! Types!

If they're incompatible, you need an automatic way to convert between 
representations, since these are all views of the same thing (an abstract 
string type). You really don't want 35 versions of Open each taking a 
different string type.

It's the fact that Ada can't do this that makes Unbounded_Strings unusable 
(well, barely usable). Ada 202x fixes the literal problem at least, but we'd 
have to completely abandon Unbounded_Strings and use a different library 
design in order for for it to allow literals. And if you're going to do 
that, you might as well do something about UTF-8 as well -- but now you're 
going to need even more conversions. Yuck.

I think the only true solution here would be based on a proper abstract 
Root_String type. But that wouldn't work in Ada, since it would be 
incompatible with all of the existing code out there. Probably would have to 
wait for a follow-on language.

                 Randy.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 13:24         ` J-P. Rosen
@ 2021-04-20 19:13           ` Randy Brukardt
  2022-04-03 18:04           ` Thomas
  1 sibling, 0 replies; 63+ messages in thread
From: Randy Brukardt @ 2021-04-20 19:13 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1464 bytes --]

"J-P. Rosen" <rosen@adalog.fr> wrote in message 
news:s5k0ai$bb5$1@dont-email.me...
> Le 19/04/2021 à 15:00, Luke A. Guest a écrit :
>> They're different types and should be incompatible, because, well, they 
>> are. What does Ada have that allows for this that other languages 
>> doesn't? Oh yeah! Types!
>
> They are not so different. For example, you may read the first line of a 
> file in a string, then discover that it starts with a BOM, and thus decide 
> it is UTF-8.
>
> BTW, the very first version of this AI had different types, but the ARG 
> felt that it would just complicate the interface for the sake of abusive 
> "purity".

Unfortunately, that was the first instance that showed the beginning of the 
end for Ada. If I remember correctly (and I may not ;-), that came from some 
people who were wedded to the Linux model where nothing is checked (or IMHO, 
typed). For them, a String is simply a bucket of octets. That prevented 
putting an encoding of any sort of any type on file names ("it should just 
work on Linux, that's what people expect"). The rest follows from that.

Those of us who care about strong typing were disgusted, the result 
essentially does not work on Windows or MacOS (which do check the content of 
file names - as you can see in GNAT compiling units with non-Latin-1 
characters in their names), and I don't really expect any recovery from 
that.

                              Randy.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-20 15:05   ` Simon Wright
@ 2021-04-20 19:17     ` Randy Brukardt
  2021-04-20 20:04       ` Simon Wright
  0 siblings, 1 reply; 63+ messages in thread
From: Randy Brukardt @ 2021-04-20 19:17 UTC (permalink / raw)


"Simon Wright" <simon@pushface.org> wrote in message 
news:lybla9574t.fsf@pushface.org...
> Shark8 <onewingedshark@gmail.com> writes:
>
>> It's actually a non-issue for Byron, because the file-reader does a
>> BOM-check [IIRC defaulting to ASCII in the absence of a BOM]
>
> GNAT does a BOM-check also. gnatchop does one better, carrying the BOM
> from the top of the input file through to each output file.

That's what the documentation says, but it didn't work on ACATS source files 
(the few which use Unicode start with a BOM). I had to write a bunch of 
extra code in the script generator to stick the options on the Unicode files 
(that worked). Perhaps that's been fixed since, but I wouldn't trust it 
(burned once, twice shy).

                                   Randy.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-20 19:17     ` Randy Brukardt
@ 2021-04-20 20:04       ` Simon Wright
  0 siblings, 0 replies; 63+ messages in thread
From: Simon Wright @ 2021-04-20 20:04 UTC (permalink / raw)


"Randy Brukardt" <randy@rrsoftware.com> writes:

> "Simon Wright" <simon@pushface.org> wrote in message 
> news:lybla9574t.fsf@pushface.org...
>> Shark8 <onewingedshark@gmail.com> writes:
>>
>>> It's actually a non-issue for Byron, because the file-reader does a
>>> BOM-check [IIRC defaulting to ASCII in the absence of a BOM]
>>
>> GNAT does a BOM-check also. gnatchop does one better, carrying the BOM
>> from the top of the input file through to each output file.
>
> That's what the documentation says, but it didn't work on ACATS source files 
> (the few which use Unicode start with a BOM). I had to write a bunch of 
> extra code in the script generator to stick the options on the Unicode files 
> (that worked). Perhaps that's been fixed since, but I wouldn't trust it 
> (burned once, twice shy).

It does now: just checked again with c250001, c250002.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 13:18 ` Vadim Godunko
@ 2022-04-03 16:51   ` Thomas
  2023-04-04  0:02     ` Thomas
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas @ 2022-04-03 16:51 UTC (permalink / raw)


In article <f9d91cb0-c9bb-4d42-a1a9-0cd546da436cn@googlegroups.com>,
 Vadim Godunko <vgodunko@gmail.com> wrote:

> On Sunday, April 18, 2021 at 1:03:14 AM UTC+3, DrPi wrote:

> > What's the way to manage Unicode correctly ? 
> > 
> 
> Ada doesn't have good Unicode support. :( So, you need to find suitable set 
> of "workarounds".
> 
> There are few different aspects of Unicode support need to be considered:
> 
> 1. Representation of string literals. If you want to use non-ASCII characters 
> in source code, you need to use -gnatW8 switch and it will require use of 
> Wide_Wide_String everywhere.
> 2. Internal representation during application execution. You are forced to 
> use Wide_Wide_String at previous step, so it will be UCS4/UTF32.

> It is hard to say that it is reasonable set of features for modern world.

I don't think Ada would be lacking that much, for having good UTF-8 
support.

the cardinal point is to be able to fill a 
Ada.Strings.UTF_Encoding.UTF_8_String with a litteral.
(once you got it, when you'll try to fill a Standard.String with a 
non-Latin-1 character, it'll make an error, i think it's fine :-) )

does Ada 202x allow it ?

if not, it would probably be easier if it was
    type UTF_8_String is new String;
instead of
    subtype UTF_8_String is String;


for all subprograms it's quite easy:
we just have to duplicate them with the new type, and to mark the old 
one as Obsolescent.

but, now that "subtype UTF_8_String" exists, i don't know what we can do 
for types.
(is the only way to choose a new name?)


> To 
> fix some of drawbacks of current situation we are developing new text 
> processing library, know as VSS. 
> 
> https://github.com/AdaCore/VSS

(are you working at AdaCore ?)

-- 
RAPID maintainer
http://savannah.nongnu.org/projects/rapid/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 13:31             ` Dmitry A. Kazakov
@ 2022-04-03 17:24               ` Thomas
  0 siblings, 0 replies; 63+ messages in thread
From: Thomas @ 2022-04-03 17:24 UTC (permalink / raw)


In article <s5k0ne$opv$1@gioia.aioe.org>,
 "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote:

> On 2021-04-19 15:15, Luke A. Guest wrote:
> > On 19/04/2021 14:10, Dmitry A. Kazakov wrote:
> > 
> >>> They're different types and should be incompatible, because, well, 
> >> they are. What does Ada have that allows for this that other languages 
> >> doesn't? Oh yeah! Types!
> >>
> >> They are subtypes, differently constrained, like Positive and Integer. 
> > 
> > No they're not. They're subtypes only and therefore compatible. The UTF 
> > string isn't constrained in any other ways.
> 
> Of course it is. There could be string encodings that have no Unicode 
> counterparts and thus missing in UTF-8/16.

1
there is missing a validity function to tell weather a given 
UTF_8_String is valid or not,
and a Dynamic_Predicate on the subtype UTF_8_String connected to the 
function.

2
more important, (when non-ASCII,) valid UTF_8_String *do not* represent 
the same thing as themselves converted to String.

> 
> >> Operations are same values are differently constrained. It does not 
> >> make sense to consider ASCII 'a', Latin-1 'a', UTF-8 'a' different. It 
> >> is same glyph differently encoded. Encoding is a representation 
> >> aspect, ergo out of the interface!

it works because 'a' is ASCII.
if you try it with a non-ASCII character, all goes wrong.

-- 
RAPID maintainer
http://savannah.nongnu.org/projects/rapid/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 13:24         ` J-P. Rosen
  2021-04-20 19:13           ` Randy Brukardt
@ 2022-04-03 18:04           ` Thomas
  2022-04-06 18:57             ` J-P. Rosen
  1 sibling, 1 reply; 63+ messages in thread
From: Thomas @ 2022-04-03 18:04 UTC (permalink / raw)


In article <s5k0ai$bb5$1@dont-email.me>, "J-P. Rosen" <rosen@adalog.fr> 
wrote:

> Le 19/04/2021 à 15:00, Luke A. Guest a écrit :
> > They're different types and should be incompatible, because, well, they 
> > are. What does Ada have that allows for this that other languages 
> > doesn't? Oh yeah! Types!
> 
> They are not so different. For example, you may read the first line of a 
> file in a string, then discover that it starts with a BOM, and thus 
> decide it is UTF-8.

could you give me an example of sth that you can do yet, and you could 
not do if UTF_8_String was private, please?
(to discover that it starts with a BOM, you must look at it.)


> 
> BTW, the very first version of this AI had different types, but the ARG 
> felt that it would just complicate the interface for the sake of abusive 
> "purity".

could you explain "abusive purity" please?

i guess it is because of ASCII.
i guess a lot of developpers use only ASCII in a lot of situation, and 
they would find annoying to need Ada.Strings.UTF_Encoding.Strings every 
time.

but I think a simple explicit conversion is acceptable, for a not fully 
compatible type which requires some attention.


the best would be to be required to use ASCII_String as intermediate, 
but i don't know how it could be designed at language level:

UTF_8_Var := UTF_8_String (ASCII_String (Latin_1_Var));
Latin_1_Var:= String (ASCII_String (UTF_8_Var));

and this would be forbidden :
UTF_8_Var := UTF_8_String (Latin_1_Var);

this would ensures to raise Constraint_Error when there are somme 
non-ASCII characters.

-- 
RAPID maintainer
http://savannah.nongnu.org/projects/rapid/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-20 19:06         ` Randy Brukardt
@ 2022-04-03 18:37           ` Thomas
  2022-04-04 23:52             ` Randy Brukardt
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas @ 2022-04-03 18:37 UTC (permalink / raw)


In article <s5n8nj$cec$1@franka.jacob-sparre.dk>,
 "Randy Brukardt" <randy@rrsoftware.com> wrote:

> "Luke A. Guest" <laguest@archeia.com> wrote in message 
> news:s5jute$1s08$1@gioia.aioe.org...
> >
> >
> > On 19/04/2021 13:52, Dmitry A. Kazakov wrote:
> >
> > > It is practical solution. Ada type system cannot express differently
> > represented/constrained string/array/vector subtypes. Ignoring Latin-1 and 
> > using String as if it were an array of octets is the best available 
> > solution.
> > >
> >
> > They're different types and should be incompatible, because, well, they 
> > are. What does Ada have that allows for this that other languages doesn't? 
> > Oh yeah! Types!
> 
> If they're incompatible, you need an automatic way to convert between 
> representations, since these are all views of the same thing (an abstract 
> string type). You really don't want 35 versions of Open each taking a 
> different string type.

i need not 35 versions of Open.
i need a version of Open with an Unicode string type (not Latin-1 - 
preferably UTF-8), which will use Ada.Strings.UTF_Encoding.Conversions 
as far as needed, regarding the underlying API.


> 
> It's the fact that Ada can't do this that makes Unbounded_Strings unusable 
> (well, barely usable).

knowing Ada, i find it acceptable.
i don't say the same about Ada.Strings.UTF_Encoding.UTF_8_String.

> Ada 202x fixes the literal problem at least, but we'd 
> have to completely abandon Unbounded_Strings and use a different library 
> design in order for for it to allow literals. And if you're going to do 
> that, you might as well do something about UTF-8 as well -- but now you're 
> going to need even more conversions. Yuck.

as i said to Vadim Godunko, i need to fill a string type with an UTF-8 
litteral.
but i don't think this string type has to manage various conversions.

from my point of view, each library has to accept 1 kind of string type 
(preferably UTF-8 everywhere),
and then, this library has to make needed conversions regarding the 
underlying API. not the user.


> 
> I think the only true solution here would be based on a proper abstract 
> Root_String type. But that wouldn't work in Ada, since it would be 
> incompatible with all of the existing code out there. Probably would have to 
> wait for a follow-on language.

of course, it would be very nice to have a more thicker language with a 
garbage collector, only 1 String type which allows all what we need, etc.

-- 
RAPID maintainer
http://savannah.nongnu.org/projects/rapid/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19 11:15   ` Simon Wright
  2021-04-19 11:50     ` Luke A. Guest
  2021-04-19 15:53     ` DrPi
@ 2022-04-03 19:20     ` Thomas
  2022-04-04  6:10       ` Vadim Godunko
  2022-04-04 14:33       ` Simon Wright
  2 siblings, 2 replies; 63+ messages in thread
From: Thomas @ 2022-04-03 19:20 UTC (permalink / raw)


In article <lyfszm5xv2.fsf@pushface.org>,
 Simon Wright <simon@pushface.org> wrote:

> But don't use unit names containing international characters, at any
> rate if you're (interested in compiling on) Windows or macOS:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114

if i understand,  Eric Botcazou is a gnu admin who decided to reject your bug?
i find him very "low portability thinking"!

it is the responsability of compilers and other underlying tools, to manage various underlying OS and FS,
not of the user to avoid those that the compiler devs find too bad!
(or to use the right encoding. i heard that Windows uses UTF-16, do you know about it?)


clearly, To_Lower takes Latin-1.
and this kind of problems would be easier to avoid if string types were stronger ...


after:

package Ada.Strings.UTF_Encoding
    ...
    type UTF_8_String is new String;
    ...
end Ada.Strings.UTF_Encoding;

i would have also made:

package Ada.Directories
    ...
    type File_Name_String is new Ada.Strings.UTF_Encoding.UTF_8_String;
    ...
end Ada.Directories;

with probably a validity check and a Dynamic_Predicate which allows "".

then, i would use File_Name_String in all Ada.Directories and Ada.*_IO.

-- 
RAPID maintainer
http://savannah.nongnu.org/projects/rapid/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-03 19:20     ` Thomas
@ 2022-04-04  6:10       ` Vadim Godunko
  2022-04-04 14:19         ` Simon Wright
  2023-03-30 23:35         ` Thomas
  2022-04-04 14:33       ` Simon Wright
  1 sibling, 2 replies; 63+ messages in thread
From: Vadim Godunko @ 2022-04-04  6:10 UTC (permalink / raw)


On Sunday, April 3, 2022 at 10:20:21 PM UTC+3, Thomas wrote:
> 
> > But don't use unit names containing international characters, at any 
> > rate if you're (interested in compiling on) Windows or macOS: 
> > 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114
> 
> and this kind of problems would be easier to avoid if string types were stronger ... 
> 

Your suggestion is unable to resolve this issue on Mac OS X. Like case sensitivity, binary compare of two strings can't compare strings in different normalization forms. Right solution is to use right type to represent any paths, and even it doesn't resolve some issues, like relative paths and change of rules at mounting points.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-04  6:10       ` Vadim Godunko
@ 2022-04-04 14:19         ` Simon Wright
  2022-04-04 15:11           ` Simon Wright
  2022-04-05  7:59           ` Vadim Godunko
  2023-03-30 23:35         ` Thomas
  1 sibling, 2 replies; 63+ messages in thread
From: Simon Wright @ 2022-04-04 14:19 UTC (permalink / raw)


Vadim Godunko <vgodunko@gmail.com> writes:

> On Sunday, April 3, 2022 at 10:20:21 PM UTC+3, Thomas wrote:
>> 
>> > But don't use unit names containing international characters, at
>> > any rate if you're (interested in compiling on) Windows or macOS:
>> > 
>> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114
>> 
>> and this kind of problems would be easier to avoid if string types
>> were stronger ...
>> 
>
> Your suggestion is unable to resolve this issue on Mac OS X. Like case
> sensitivity, binary compare of two strings can't compare strings in
> different normalization forms. Right solution is to use right type to
> represent any paths, and even it doesn't resolve some issues, like
> relative paths and change of rules at mounting points.

I think that's a macOS problem that Apple aren't going to resolve* any
time soon! While banging my head against PR81114 recently, I found
(can't remember where) that (lower case a acute) and (lower case a,
combining acute) represent the same concept and it's up to
tools/operating systems etc to recognise that.

Emacs, too, has a problem: it doesn't recognise the 'combining' part of
(lower case a, combining acute), so what you see on your screen is "a'".

* I don't know how/whether clang addresses this.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-03 19:20     ` Thomas
  2022-04-04  6:10       ` Vadim Godunko
@ 2022-04-04 14:33       ` Simon Wright
  1 sibling, 0 replies; 63+ messages in thread
From: Simon Wright @ 2022-04-04 14:33 UTC (permalink / raw)


Thomas <fantome.forums.tDeContes@free.fr.invalid> writes:

> In article <lyfszm5xv2.fsf@pushface.org>,
>  Simon Wright <simon@pushface.org> wrote:
>
>> But don't use unit names containing international characters, at any
>> rate if you're (interested in compiling on) Windows or macOS:
>> 
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114
>
> if i understand, Eric Botcazou is a gnu admin who decided to reject
> your bug?  i find him very "low portability thinking"!

To be fair, he only suspended it - you can tell I didn't want to press
very far.

We could remove the part where the filename is smashed to lower-case as
if it were ASCII[1][2][3] (OK, perhaps Latin-1?) if the machine is
Windows or (Apple if not on aarch64!!!), but that still leaves the
filesystem name issue. Windows might be OK (code pages???)

[1] https://github.com/gcc-mirror/gcc/blob/master/gcc/ada/adaint.c#L620
[2] https://github.com/gcc-mirror/gcc/blob/master/gcc/ada/lib-writ.adb#L812
[2] https://github.com/gcc-mirror/gcc/blob/master/gcc/ada/lib-writ.adb#L1490

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-04 14:19         ` Simon Wright
@ 2022-04-04 15:11           ` Simon Wright
  2022-04-05  7:59           ` Vadim Godunko
  1 sibling, 0 replies; 63+ messages in thread
From: Simon Wright @ 2022-04-04 15:11 UTC (permalink / raw)


Simon Wright <simon@pushface.org> writes:

> I think that's a macOS problem that Apple aren't going to resolve* any
> time soon! While banging my head against PR81114 recently, I found
> (can't remember where) that (lower case a acute) and (lower case a,
> combining acute) represent the same concept and it's up to
> tools/operating systems etc to recognise that.
[...]
> * I don't know how/whether clang addresses this.

It doesn't, so far as I can tell; has the exact same problem.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-03 18:37           ` Thomas
@ 2022-04-04 23:52             ` Randy Brukardt
  2023-03-31  3:06               ` Thomas
  0 siblings, 1 reply; 63+ messages in thread
From: Randy Brukardt @ 2022-04-04 23:52 UTC (permalink / raw)



"Thomas" <fantome.forums.tDeContes@free.fr.invalid> wrote in message 
news:fantome.forums.tDeContes-5E3B70.20370903042022@news.free.fr...
...
> as i said to Vadim Godunko, i need to fill a string type with an UTF-8
> litteral.but i don't think this string type has to manage various 
> conversions.
>
> from my point of view, each library has to accept 1 kind of string type
> (preferably UTF-8 everywhere),
> and then, this library has to make needed conversions regarding the
> underlying API. not the user.

This certainly is a fine ivory tower solution, but it completely ignores two 
practicalities in the case of Ada:

(1) You need to replace almost all of the existing Ada language defined 
packages to make this work. Things that are deeply embedded in both 
implementations and programs (like Ada.Exceptions and Ada.Text_IO) would 
have to change substantially. The result would essentially be a different 
language, since the resulting libraries would not work with most existing 
programs. They'd have to have different names (since if you used the same 
names, you change the failures from compile-time to runtime -- or even 
undetected -- which would be completely against the spirit of Ada), which 
means that one would have to essentially start over learning and using the 
resulting language. Calling it Ada would be rather silly, since it would be 
practically incompatible (and it would make sense to use this point to 
eliminate a lot of the cruft from the Ada design).

(2) One needs to be able to read and write data given whatever encoding the 
project requires (that's often decided by outside forces, such as other 
hardware or software that the project needs to interoperate with). That 
means that completely hiding the encoding (or using a universal encoding) 
doesn't fully solve the problems faced by Ada programmers. At a minimum, you 
have to have a way to specify the encoding of files, streams, and hardware 
interfaces (this sort of thing is not provided by any common target OS, so 
it's not in any target API). That will greatly complicate the interface and 
implementation of the libraries.

> ... of course, it would be very nice to have a more thicker language with 
> a garbage collector ...

I doubt that you will ever see that in the Ada family, as analysis and 
therefore determinism is a very important property for the language. Ada has 
lots of mechanisms for managing storage without directly doing it yourself 
(by calling Unchecked_Deallocation), yet none of them use any garbage 
collection in a traditional sense. I could see more such mechanisms (an 
ownership option on the line of Rust could easily manage storage at the same 
time, since any object that could be orphaned could never be used again and 
thus should be reclaimed), but standard garbage collection is too 
non-deterministic for many of the uses Ada is put to.

                                              Randy.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-04 14:19         ` Simon Wright
  2022-04-04 15:11           ` Simon Wright
@ 2022-04-05  7:59           ` Vadim Godunko
  2022-04-08  9:01             ` Simon Wright
  1 sibling, 1 reply; 63+ messages in thread
From: Vadim Godunko @ 2022-04-05  7:59 UTC (permalink / raw)


On Monday, April 4, 2022 at 5:19:20 PM UTC+3, Simon Wright wrote:
> I think that's a macOS problem that Apple aren't going to resolve* any 
> time soon! While banging my head against PR81114 recently, I found 
> (can't remember where) that (lower case a acute) and (lower case a, 
> combining acute) represent the same concept and it's up to 
> tools/operating systems etc to recognise that. 
> 
And will not. It is application responsibility to convert file names to NFD to pass to OS. Also, application must compare any paths after conversion to NFD, it is important to handle more complicated cases when canonical reordering is applied.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-03 18:04           ` Thomas
@ 2022-04-06 18:57             ` J-P. Rosen
  2022-04-07  1:30               ` Randy Brukardt
  0 siblings, 1 reply; 63+ messages in thread
From: J-P. Rosen @ 2022-04-06 18:57 UTC (permalink / raw)


Le 03/04/2022 à 21:04, Thomas a écrit :
>> They are not so different. For example, you may read the first line of a
>> file in a string, then discover that it starts with a BOM, and thus
>> decide it is UTF-8.
> 
> could you give me an example of sth that you can do yet, and you could
> not do if UTF_8_String was private, please?
> (to discover that it starts with a BOM, you must look at it.)
Just what I said above, since a BOM is not a valid UTF-8 (otherwise, it 
could not be recognized).

>>
>> BTW, the very first version of this AI had different types, but the ARG
>> felt that it would just complicate the interface for the sake of abusive
>> "purity".
> 
> could you explain "abusive purity" please?
> 
It was felt that in practice, being too strict in separating the types 
would make things more difficult, without any practical gain. This has 
been discussed - you may not agree with the outcome, but it was not made 
out of pure lazyness

-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52
https://www.adalog.fr

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-06 18:57             ` J-P. Rosen
@ 2022-04-07  1:30               ` Randy Brukardt
  2022-04-08  8:56                 ` Simon Wright
  0 siblings, 1 reply; 63+ messages in thread
From: Randy Brukardt @ 2022-04-07  1:30 UTC (permalink / raw)


"J-P. Rosen" <rosen@adalog.fr> wrote in message 
news:t2knpr$s26$1@dont-email.me...
...
> It was felt that in practice, being too strict in separating the types 
> would make things more difficult, without any practical gain. This has 
> been discussed - you may not agree with the outcome, but it was not made 
> out of pure lazyness

The problem with that, of course, is that it sends the wrong message 
vis-a-vis strong typing and interfaces. If we abandon it at the first sign 
of trouble, they we are saying that it isn't really that important.

In this particular case, the reason really came down to practicality: if you 
want to do anything string-like with a UTF-8 string, making it a separate 
type becomes painful. It wouldn't work with anything in Ada.Strings, 
Ada.Text_IO, or Ada.Directories, even though most of the operations are 
fine. And there was no political will to replace all of those things with 
versions to use with proper universal strings.

Moreover, if you really want to do that, you have to hide much of the array 
behavior of the Universal string. For instance, you can't allow willy-nilly 
slicing or replacement: cutting a character representation in half or 
setting an illegal representation has to be prohibited (operations that 
would turn a valid string into an invalid string should always raise an 
exception). That means you can't (directly) use built-in indexing and 
slicing -- those have to go through some sort of functions. So you do pretty 
much have to use a private type for universal strings (similar to 
Ada.Strings.Bounded would be best, I think).

If you had an Ada-like language that used a universal UTF-8 string 
internally, you then would have a lot of old and mostly useless operations 
supported for array types (since things like slices are mainly useful for 
string operations). So such a language should simplify the core 
substantially by dropping many of those obsolete features (especially as 
little of the library would be directly compatible anyway). So one should 
end up with a new language that draws from Ada rather than something in Ada 
itself. (It would be great if that language could make strings with 
different capacities interoperable - a major annoyance with Ada. And 
modernizing access types, generalizing resolution, and the like also would 
be good improvements IMHO.)

                                               Randy.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-07  1:30               ` Randy Brukardt
@ 2022-04-08  8:56                 ` Simon Wright
  2022-04-08  9:26                   ` Dmitry A. Kazakov
  0 siblings, 1 reply; 63+ messages in thread
From: Simon Wright @ 2022-04-08  8:56 UTC (permalink / raw)


"Randy Brukardt" <randy@rrsoftware.com> writes:

> If you had an Ada-like language that used a universal UTF-8 string
> internally, you then would have a lot of old and mostly useless
> operations supported for array types (since things like slices are
> mainly useful for string operations).

Just off the top of my head, wouldn't it be better to use UTF32-encoded
Wide_Wide_Character internally? (you would still have trouble with
e.g. national flag emojis :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-05  7:59           ` Vadim Godunko
@ 2022-04-08  9:01             ` Simon Wright
  0 siblings, 0 replies; 63+ messages in thread
From: Simon Wright @ 2022-04-08  9:01 UTC (permalink / raw)


Vadim Godunko <vgodunko@gmail.com> writes:

> On Monday, April 4, 2022 at 5:19:20 PM UTC+3, Simon Wright wrote:
>> I think that's a macOS problem that Apple aren't going to resolve* any 
>> time soon! While banging my head against PR81114 recently, I found 
>> (can't remember where) that (lower case a acute) and (lower case a, 
>> combining acute) represent the same concept and it's up to 
>> tools/operating systems etc to recognise that. 
>> 
> And will not. It is application responsibility to convert file names
> to NFD to pass to OS. Also, application must compare any paths after
> conversion to NFD, it is important to handle more complicated cases
> when canonical reordering is applied.

Isn't the compiler a tool? gnatmake? gprbuild? (gnatmake handles ACATS
c250002 provided you tell the compiler that the fs is case-sensitive,
gprbuild doesn't even manage that)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-08  8:56                 ` Simon Wright
@ 2022-04-08  9:26                   ` Dmitry A. Kazakov
  2022-04-08 19:19                     ` Simon Wright
  0 siblings, 1 reply; 63+ messages in thread
From: Dmitry A. Kazakov @ 2022-04-08  9:26 UTC (permalink / raw)


On 2022-04-08 10:56, Simon Wright wrote:
> "Randy Brukardt" <randy@rrsoftware.com> writes:
> 
>> If you had an Ada-like language that used a universal UTF-8 string
>> internally, you then would have a lot of old and mostly useless
>> operations supported for array types (since things like slices are
>> mainly useful for string operations).
> 
> Just off the top of my head, wouldn't it be better to use UTF32-encoded
> Wide_Wide_Character internally?

Yep, that is the exactly the problem, a confusion between interface and 
implementation.

Encoding /= interface, e.g. an interface of a string viewed as an array 
of characters. That interface just same for ASCII, Latin-1, EBCDIC, 
RADIX50, UTF-8 etc strings. Why do you care what is inside?

Ada type system's inability to implement this interface is another 
issue. Usefulness of this interface is yet another. For immutable 
strings it is quite useful. For mutable strings it might appear too 
constrained, e.g. for packed encodings like UTF-8 and UTF-16.

Also this interface should have nothing to do with the interface of an 
UTF-8 string as an array of octets or the interface of an UTF-16LE 
string as an array of little endian words.

Since Ada cannot separate these interfaces, for practical purposes, 
Strings are arrays of octets considered as UTF-8 encoding. The rest goes 
into coding guidelines under the title "never ever do this."

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-08  9:26                   ` Dmitry A. Kazakov
@ 2022-04-08 19:19                     ` Simon Wright
  2022-04-08 19:45                       ` Dmitry A. Kazakov
  0 siblings, 1 reply; 63+ messages in thread
From: Simon Wright @ 2022-04-08 19:19 UTC (permalink / raw)


"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:

> On 2022-04-08 10:56, Simon Wright wrote:
>> "Randy Brukardt" <randy@rrsoftware.com> writes:
>> 
>>> If you had an Ada-like language that used a universal UTF-8 string
>>> internally, you then would have a lot of old and mostly useless
>>> operations supported for array types (since things like slices are
>>> mainly useful for string operations).
>>
>> Just off the top of my head, wouldn't it be better to use
>> UTF32-encoded Wide_Wide_Character internally?
>
> Yep, that is the exactly the problem, a confusion between interface
> and implementation.

Don't understand. My point was that *when you are implementing this* it
mught be easier to deal with 32-bit charactrs/code points/whatever the
proper jargon is than with UTF8.

> Encoding /= interface, e.g. an interface of a string viewed as an
> array of characters. That interface just same for ASCII, Latin-1,
> EBCDIC, RADIX50, UTF-8 etc strings. Why do you care what is inside?

With a user's hat on, I don't. Implementers might have a different point
of view.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-08 19:19                     ` Simon Wright
@ 2022-04-08 19:45                       ` Dmitry A. Kazakov
  2022-04-09  4:05                         ` Randy Brukardt
  0 siblings, 1 reply; 63+ messages in thread
From: Dmitry A. Kazakov @ 2022-04-08 19:45 UTC (permalink / raw)


On 2022-04-08 21:19, Simon Wright wrote:
> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
> 
>> On 2022-04-08 10:56, Simon Wright wrote:
>>> "Randy Brukardt" <randy@rrsoftware.com> writes:
>>>
>>>> If you had an Ada-like language that used a universal UTF-8 string
>>>> internally, you then would have a lot of old and mostly useless
>>>> operations supported for array types (since things like slices are
>>>> mainly useful for string operations).
>>>
>>> Just off the top of my head, wouldn't it be better to use
>>> UTF32-encoded Wide_Wide_Character internally?
>>
>> Yep, that is the exactly the problem, a confusion between interface
>> and implementation.
> 
> Don't understand. My point was that *when you are implementing this* it
> mught be easier to deal with 32-bit charactrs/code points/whatever the
> proper jargon is than with UTF8.

I think it would be more difficult, because you will have to convert 
from and to UTF-8 under the hood or explicitly. UTF-8 is de-facto 
interface standard and I/O standard. That would be 60-70% of all cases 
you need a string. Most string operations like search, comparison, 
slicing are isomorphic between code points and octets. So you would win 
nothing from keeping strings internally as arrays of code points.

The situation is comparable to Unbounded_Strings. The implementation is 
relatively simple, but the user must carry the burden of calling 
To_String and To_Unbounded_String all over the application and the 
processor must suffer the overhead of copying arrays here and there.

>> Encoding /= interface, e.g. an interface of a string viewed as an
>> array of characters. That interface just same for ASCII, Latin-1,
>> EBCDIC, RADIX50, UTF-8 etc strings. Why do you care what is inside?
> 
> With a user's hat on, I don't. Implementers might have a different point
> of view.

Sure, but in Ada philosophy their opinion should carry less weight, 
than, say, in C.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-08 19:45                       ` Dmitry A. Kazakov
@ 2022-04-09  4:05                         ` Randy Brukardt
  2022-04-09  7:43                           ` Simon Wright
  2022-04-09 10:27                           ` DrPi
  0 siblings, 2 replies; 63+ messages in thread
From: Randy Brukardt @ 2022-04-09  4:05 UTC (permalink / raw)


"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:t2q3cb$bbt$1@gioia.aioe.org...
> On 2022-04-08 21:19, Simon Wright wrote:
>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
>>
>>> On 2022-04-08 10:56, Simon Wright wrote:
>>>> "Randy Brukardt" <randy@rrsoftware.com> writes:
>>>>
>>>>> If you had an Ada-like language that used a universal UTF-8 string
>>>>> internally, you then would have a lot of old and mostly useless
>>>>> operations supported for array types (since things like slices are
>>>>> mainly useful for string operations).
>>>>
>>>> Just off the top of my head, wouldn't it be better to use
>>>> UTF32-encoded Wide_Wide_Character internally?
>>>
>>> Yep, that is the exactly the problem, a confusion between interface
>>> and implementation.
>>
>> Don't understand. My point was that *when you are implementing this* it
>> mught be easier to deal with 32-bit charactrs/code points/whatever the
>> proper jargon is than with UTF8.
>
> I think it would be more difficult, because you will have to convert from 
> and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface 
> standard and I/O standard. That would be 60-70% of all cases you need a 
> string. Most string operations like search, comparison, slicing are 
> isomorphic between code points and octets. So you would win nothing from 
> keeping strings internally as arrays of code points.

I basically agree with Dmitry here. The internal representation is an 
implementation detail, but it seems likely that you would want to store 
UTF-8 strings directly; they're almost always going to be half the size 
(even for languages using their own characters like Greek) and for most of 
us, they'll be just a bit more than a quarter the size. The amount of bytes 
you copy around matters; the number of operations where code points are 
needed is fairly small.

The main problem with UTF-8 is representing the code point positions in a 
way that they (a) aren't abused and (b) don't cost too much to calculate. 
Just using character indexes is too expensive for UTF-8 and UTF-16 
representations, and using octet indexes is unsafe (since the splitting a 
character representation is a possibility). I'd probably use an abstract 
character position type that was implemented with an octet index under the 
covers.

I think that would work OK as doing math on those is suspicious with a UTF 
representation. We're spoiled from using Latin-1 representations, of course, 
but generally one is interested in 5 characters, not 5 octets. And the 
number of octets in 5 characters depends on the string. So most of the sorts 
of operations that I tend to do (for instance from some code I was fixing 
earlier today):

     if Fort'Length > 6 and then
        Font(2..6) = "Arial" then

This would be a bad idea if one is using any sort of universal 
representation -- you don't know how many octets is in the string literal so 
you can't assume a number in the test string. So the slice is dangerous 
(even though in this particular case it would be OK since the test string is 
all Ascii characters -- but I wouldn't want users to get in the habit of 
assuming such things).

[BTW, the above was a bad idea anyway, because it turns out that the 
function in the Ada library returned bounds that don't start at 1. So the 
slice was usually out of range -- which is why I was looking at the code. 
Another thing that we could do without. Slices are evil, since they *seem* 
to be the right solution, yet rarely are in practice without a lot of 
hoops.]

> The situation is comparable to Unbounded_Strings. The implementation is 
> relatively simple, but the user must carry the burden of calling To_String 
> and To_Unbounded_String all over the application and the processor must 
> suffer the overhead of copying arrays here and there.

Yes, but that happens because Ada doesn't really have a string abstraction, 
so when you try to build one, you can't fully do the job. One presumes that 
a new language with a universal UTF-8 string wouldn't have that problem. (As 
previously noted, I don't see much point in trying to patch up Ada with a 
bunch of UTF-8 string packages; you would need an entire new set of 
Ada.Strings libraries and I/O libraries, and then you'd have all of the old 
stuff messing up resolution, using the best names, and confusing everything. 
A cleaner slate is needed.)

                                   Randy.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-09  4:05                         ` Randy Brukardt
@ 2022-04-09  7:43                           ` Simon Wright
  2022-04-09 10:27                           ` DrPi
  1 sibling, 0 replies; 63+ messages in thread
From: Simon Wright @ 2022-04-09  7:43 UTC (permalink / raw)


"Randy Brukardt" <randy@rrsoftware.com> writes:

> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
> news:t2q3cb$bbt$1@gioia.aioe.org...
>> On 2022-04-08 21:19, Simon Wright wrote:
>>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
>>>
>>>> On 2022-04-08 10:56, Simon Wright wrote:
>>>>> "Randy Brukardt" <randy@rrsoftware.com> writes:
>>>>>
>>>>>> If you had an Ada-like language that used a universal UTF-8 string
>>>>>> internally, you then would have a lot of old and mostly useless
>>>>>> operations supported for array types (since things like slices are
>>>>>> mainly useful for string operations).
>>>>>
>>>>> Just off the top of my head, wouldn't it be better to use
>>>>> UTF32-encoded Wide_Wide_Character internally?
>>>>
>>>> Yep, that is the exactly the problem, a confusion between interface
>>>> and implementation.
>>>
>>> Don't understand. My point was that *when you are implementing this* it
>>> mught be easier to deal with 32-bit charactrs/code points/whatever the
>>> proper jargon is than with UTF8.
>>
>> I think it would be more difficult, because you will have to convert from 
>> and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface 
>> standard and I/O standard. That would be 60-70% of all cases you need a 
>> string. Most string operations like search, comparison, slicing are 
>> isomorphic between code points and octets. So you would win nothing from 
>> keeping strings internally as arrays of code points.
>
> I basically agree with Dmitry here. The internal representation is an 
> implementation detail, but it seems likely that you would want to store 
> UTF-8 strings directly; they're almost always going to be half the size 
> (even for languages using their own characters like Greek) and for most of 
> us, they'll be just a bit more than a quarter the size. The amount of bytes 
> you copy around matters; the number of operations where code points are 
> needed is fairly small.

Well, I don't have any skin in this game, so I'll shut up at this point.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-09  4:05                         ` Randy Brukardt
  2022-04-09  7:43                           ` Simon Wright
@ 2022-04-09 10:27                           ` DrPi
  2022-04-09 16:46                             ` Dennis Lee Bieber
  2022-04-10  5:58                             ` Vadim Godunko
  1 sibling, 2 replies; 63+ messages in thread
From: DrPi @ 2022-04-09 10:27 UTC (permalink / raw)



Le 09/04/2022 à 06:05, Randy Brukardt a écrit :
> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
> news:t2q3cb$bbt$1@gioia.aioe.org...
>> On 2022-04-08 21:19, Simon Wright wrote:
>>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
>>>
>>>> On 2022-04-08 10:56, Simon Wright wrote:
>>>>> "Randy Brukardt" <randy@rrsoftware.com> writes:
>>>>>
>>>>>> If you had an Ada-like language that used a universal UTF-8 string
>>>>>> internally, you then would have a lot of old and mostly useless
>>>>>> operations supported for array types (since things like slices are
>>>>>> mainly useful for string operations).
>>>>>
>>>>> Just off the top of my head, wouldn't it be better to use
>>>>> UTF32-encoded Wide_Wide_Character internally?
>>>>
>>>> Yep, that is the exactly the problem, a confusion between interface
>>>> and implementation.
>>>
>>> Don't understand. My point was that *when you are implementing this* it
>>> mught be easier to deal with 32-bit charactrs/code points/whatever the
>>> proper jargon is than with UTF8.
>>
>> I think it would be more difficult, because you will have to convert from
>> and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface
>> standard and I/O standard. That would be 60-70% of all cases you need a
>> string. Most string operations like search, comparison, slicing are
>> isomorphic between code points and octets. So you would win nothing from
>> keeping strings internally as arrays of code points.
> 
> I basically agree with Dmitry here. The internal representation is an
> implementation detail, but it seems likely that you would want to store
> UTF-8 strings directly; they're almost always going to be half the size
> (even for languages using their own characters like Greek) and for most of
> us, they'll be just a bit more than a quarter the size. The amount of bytes
> you copy around matters; the number of operations where code points are
> needed is fairly small.
> 
> The main problem with UTF-8 is representing the code point positions in a
> way that they (a) aren't abused and (b) don't cost too much to calculate.
> Just using character indexes is too expensive for UTF-8 and UTF-16
> representations, and using octet indexes is unsafe (since the splitting a
> character representation is a possibility). I'd probably use an abstract
> character position type that was implemented with an octet index under the
> covers.
> 
> I think that would work OK as doing math on those is suspicious with a UTF
> representation. We're spoiled from using Latin-1 representations, of course,
> but generally one is interested in 5 characters, not 5 octets. And the
> number of octets in 5 characters depends on the string. So most of the sorts
> of operations that I tend to do (for instance from some code I was fixing
> earlier today):
> 
>       if Fort'Length > 6 and then
>          Font(2..6) = "Arial" then
> 
> This would be a bad idea if one is using any sort of universal
> representation -- you don't know how many octets is in the string literal so
> you can't assume a number in the test string. So the slice is dangerous
> (even though in this particular case it would be OK since the test string is
> all Ascii characters -- but I wouldn't want users to get in the habit of
> assuming such things).
> 
> [BTW, the above was a bad idea anyway, because it turns out that the
> function in the Ada library returned bounds that don't start at 1. So the
> slice was usually out of range -- which is why I was looking at the code.
> Another thing that we could do without. Slices are evil, since they *seem*
> to be the right solution, yet rarely are in practice without a lot of
> hoops.]
> 
>> The situation is comparable to Unbounded_Strings. The implementation is
>> relatively simple, but the user must carry the burden of calling To_String
>> and To_Unbounded_String all over the application and the processor must
>> suffer the overhead of copying arrays here and there.
> 
> Yes, but that happens because Ada doesn't really have a string abstraction,
> so when you try to build one, you can't fully do the job. One presumes that
> a new language with a universal UTF-8 string wouldn't have that problem. (As
> previously noted, I don't see much point in trying to patch up Ada with a
> bunch of UTF-8 string packages; you would need an entire new set of
> Ada.Strings libraries and I/O libraries, and then you'd have all of the old
> stuff messing up resolution, using the best names, and confusing everything.
> A cleaner slate is needed.)
> 
>                                     Randy.
> 
> 

In Python-2, there is the same kind of problem. A string is a byte 
array. This is the programmer responsibility to encode/decode to/from 
UTF8/Latin1/... and to manage everything correctly. Litteral strings can 
be considered as encoded or decoded depending on the notation ("" or u"").

In Python-3, a string is a character(glyph ?) array. The internal 
representation is hidden to the programmer.
UTF8/Latin1/... encoded "strings" are of type bytes (byte array).
Writing/reading to/from a file is done with bytes type.
When writing/reading to/from a file in text mode, you have to specify 
the encoding to use. The encoding/decoding is then internally managed.
As a general rule, all "external communications" are done with bytes 
(byte array). This is the programmer responsability to encode/decode 
where needed to convert from/to strings.
The source files (.py) are considered to be UTF8 encoded by default but 
one can declare the actual encoding at the top of the file in a special 
comment tag. When a badly encoded character is found, an exception is 
raised at parsing time. So, literal strings are real strings, not bytes.

I think the Python-3 way of doing things is much more understandable and 
really usable.

On the Ada side, I've still not understood how to correctly deal with 
all this stuff.


Note : In Python-3, bytes type is not reserved to encoded "strings". It 
is a versatile type for what it's named : a byte array.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-09 10:27                           ` DrPi
@ 2022-04-09 16:46                             ` Dennis Lee Bieber
  2022-04-09 18:59                               ` DrPi
  2022-04-10  5:58                             ` Vadim Godunko
  1 sibling, 1 reply; 63+ messages in thread
From: Dennis Lee Bieber @ 2022-04-09 16:46 UTC (permalink / raw)


On Sat, 9 Apr 2022 12:27:04 +0200, DrPi <314@drpi.fr> declaimed the
following:

>
>In Python-3, a string is a character(glyph ?) array. The internal 
>representation is hidden to the programmer.

	<SNIP>
>
>On the Ada side, I've still not understood how to correctly deal with 
>all this stuff.

	One thing to take into account is that Python strings are immutable.
Changing the contents of a string requires constructing a new string from
parts that incorporate the change.

	That allows for the second aspect -- even if not visible to a
programmer, Python (3) strings are not a fixed representation: If all
characters in the string fit in the 8-bit UTF range, that string is stored
using one byte per character. If any character uses a 16-bit UTF
representation, the entire string is stored as 16-bit characters (and
similar for 32-bit UTF points). Thus, indexing into the string is still
fast -- just needing to scale the index by the character width of the
entire string.

	


-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
	wlfraed@ix.netcom.com    http://wlfraed.microdiversity.freeddns.org/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-09 16:46                             ` Dennis Lee Bieber
@ 2022-04-09 18:59                               ` DrPi
  0 siblings, 0 replies; 63+ messages in thread
From: DrPi @ 2022-04-09 18:59 UTC (permalink / raw)


Le 09/04/2022 à 18:46, Dennis Lee Bieber a écrit :
> On Sat, 9 Apr 2022 12:27:04 +0200, DrPi <314@drpi.fr> declaimed the
> following:
> 
>>
>> In Python-3, a string is a character(glyph ?) array. The internal
>> representation is hidden to the programmer.
> 
> 	<SNIP>
>>
>> On the Ada side, I've still not understood how to correctly deal with
>> all this stuff.
> 
> 	One thing to take into account is that Python strings are immutable.
> Changing the contents of a string requires constructing a new string from
> parts that incorporate the change.
> 

Right. I forgot to mention it.

> 	That allows for the second aspect -- even if not visible to a
> programmer, Python (3) strings are not a fixed representation: If all
> characters in the string fit in the 8-bit UTF range, that string is stored
> using one byte per character. If any character uses a 16-bit UTF
> representation, the entire string is stored as 16-bit characters (and
> similar for 32-bit UTF points). Thus, indexing into the string is still
> fast -- just needing to scale the index by the character width of the
> entire string.
> 

Thanks for clarifying.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-09 10:27                           ` DrPi
  2022-04-09 16:46                             ` Dennis Lee Bieber
@ 2022-04-10  5:58                             ` Vadim Godunko
  2022-04-10 18:59                               ` DrPi
  2022-04-12  6:13                               ` Randy Brukardt
  1 sibling, 2 replies; 63+ messages in thread
From: Vadim Godunko @ 2022-04-10  5:58 UTC (permalink / raw)


On Saturday, April 9, 2022 at 1:27:08 PM UTC+3, DrPi wrote:
> 
> On the Ada side, I've still not understood how to correctly deal with 
> all this stuff. 
> 
Take a look at https://github.com/AdaCore/VSS 

Ideas behind this library is close to ideas of types separation in Python3. String is a Virtual_String, byte sequence is Stream_Element_Vector. Need to convert byte stream to string or back - use Virtual_String_Encoder/Virtual_String_Decoder.

I think ((Wide_)Wide_)(Character|String) is obsolete for modern systems and programming languages; more cleaner types and API is a requirement now. The only case when old character/string types is really makes value is low resources embedded systems; in other cases their use generates a lot of hidden issues, which is very hard to detect.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-10  5:58                             ` Vadim Godunko
@ 2022-04-10 18:59                               ` DrPi
  2022-04-12  6:13                               ` Randy Brukardt
  1 sibling, 0 replies; 63+ messages in thread
From: DrPi @ 2022-04-10 18:59 UTC (permalink / raw)


Le 10/04/2022 à 07:58, Vadim Godunko a écrit :
> On Saturday, April 9, 2022 at 1:27:08 PM UTC+3, DrPi wrote:
>>
>> On the Ada side, I've still not understood how to correctly deal with
>> all this stuff.
>>
> Take a look at https://github.com/AdaCore/VSS
> 
> Ideas behind this library is close to ideas of types separation in Python3. String is a Virtual_String, byte sequence is Stream_Element_Vector. Need to convert byte stream to string or back - use Virtual_String_Encoder/Virtual_String_Decoder.
> 
> I think ((Wide_)Wide_)(Character|String) is obsolete for modern systems and programming languages; more cleaner types and API is a requirement now. The only case when old character/string types is really makes value is low resources embedded systems; in other cases their use generates a lot of hidden issues, which is very hard to detect.

That's an interesting solution.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-10  5:58                             ` Vadim Godunko
  2022-04-10 18:59                               ` DrPi
@ 2022-04-12  6:13                               ` Randy Brukardt
  1 sibling, 0 replies; 63+ messages in thread
From: Randy Brukardt @ 2022-04-12  6:13 UTC (permalink / raw)


"Vadim Godunko" <vgodunko@gmail.com> wrote in message 
news:3962d55d-10e8-4dff-9ad3-847d69c3c337n@googlegroups.com...
...
>I think ((Wide_)Wide_)(Character|String) is obsolete for modern systems and
>programming languages; more cleaner types and API is a requirement now.

...which essentially means Ada is obsolete in your view, as String in 
particular is way too embedded in the definition and the language-defined 
units to use anything else. You'd end up with a mass of conversions to get 
anything done (the main problem with Ada.Strings.Unbounded).

Or I suppose you could replace pretty much the entire library with a new 
one. But now you have two of everything to confuse newcomers and you still 
have a mass of old nonsense weighing down the language and complicating 
implementations.

>The only case when old character/string types is really makes value is low 
>resources embedded systems; ...

...which of course is at least 50% of the use of Ada, and probably closer to 
90% of the money. Any solution for Ada has to continue to meet the needs of 
embedded programmers. For instance, it would need to support fixed, bounded, 
and unbounded versions (solely having unbounded strings would not work for 
many applications, and indeed not just embedded systems need to restrict 
those -- any long-running server has to control dynamic allocation)

>...in other cases their use generates a lot of hidden issues, which is very 
>hard to detect.

At least some of which occur because a string is not an array, and the 
forcible mapping to them never worked very well. The Z-80 Pascals that we 
used to implement the very earliest versions of Ada had more functional 
strings than Ada does (by being bounded and using a library for most 
operations) - they would have been way easier to extend (as the Python ones 
were, as an example).

                        Randy.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2021-04-19  9:08 ` Stephen Leake
                     ` (2 preceding siblings ...)
  2021-04-19 16:14   ` DrPi
@ 2022-04-16  2:32   ` Thomas
  3 siblings, 0 replies; 63+ messages in thread
From: Thomas @ 2022-04-16  2:32 UTC (permalink / raw)


In article <86mttuk5f0.fsf@stephe-leake.org>,
 Stephen Leake <stephen_leake@stephe-leake.org> wrote:

> DrPi <314@drpi.fr> writes:
> 
> > Any way to use source code encoded in UTF-8 ?


> from the gnat user guide, 4.3.1 Alphabetical List of All Switches:
> 
> `-gnati`c''
>      Identifier character set (`c' = 1/2/3/4/8/9/p/f/n/w).  For details
>      of the possible selections for `c', see *note Character Set
>      Control: 4e.
> 
> This applies to identifiers in the source code
> 
> `-gnatW`e''
>      Wide character encoding method (`e'=n/h/u/s/e/8).
> 
> This applies to string and character literals.


afaik, -gnati is deactivated when -gnatW is not n or h (from memory)

so you can't ask both to check that identifiers are in ASCII and to have 
literals in UTF-8.


(if it's resolved in new versions it's a good news :-) )

-- 
RAPID maintainer
http://savannah.nongnu.org/projects/rapid/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-04  6:10       ` Vadim Godunko
  2022-04-04 14:19         ` Simon Wright
@ 2023-03-30 23:35         ` Thomas
  1 sibling, 0 replies; 63+ messages in thread
From: Thomas @ 2023-03-30 23:35 UTC (permalink / raw)


sorry for the delay.


In article <48309745-aa2a-47bd-a4f9-6daa843e0771n@googlegroups.com>,
 Vadim Godunko <vgodunko@gmail.com> wrote:

> On Sunday, April 3, 2022 at 10:20:21 PM UTC+3, Thomas wrote:
> > 
> > > But don't use unit names containing international characters, at any 
> > > rate if you're (interested in compiling on) Windows or macOS: 
> > > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114
> > 
> > and this kind of problems would be easier to avoid if string types were 
> > stronger ... 
> > 
> 
> Your suggestion is unable to resolve this issue on Mac OS X.

i said "easier" not "easy".

don't forget that Unicode has 2 levels :
- octets <-> code points
- code points <-> characters/glyphs

and you can't expect the upper to work if the lower doesn't.


> Like case 
> sensitivity, binary compare of two strings can't compare strings in different 
> normalization forms. Right solution is to use right type to represent any 
> paths,

what would be the "right type", according to you?


In fact, here the first question to ask is:
what's the expected encoding for Ada.Text_IO.Open.Name?
- is it Latin-1 because the type is String not UTF_8_String?
- is it undefined because it depends on the underling FS?

-- 
RAPID maintainer
http://savannah.nongnu.org/projects/rapid/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-04 23:52             ` Randy Brukardt
@ 2023-03-31  3:06               ` Thomas
  2023-04-01 10:18                 ` Randy Brukardt
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas @ 2023-03-31  3:06 UTC (permalink / raw)


In article <t2g0c1$eou$1@dont-email.me>,
 "Randy Brukardt" <randy@rrsoftware.com> wrote:

> "Thomas" <fantome.forums.tDeContes@free.fr.invalid> wrote in message 
> news:fantome.forums.tDeContes-5E3B70.20370903042022@news.free.fr...
> ...
> > as i said to Vadim Godunko, i need to fill a string type with an UTF-8
> > litteral.but i don't think this string type has to manage various 
> > conversions.
> >
> > from my point of view, each library has to accept 1 kind of string type
> > (preferably UTF-8 everywhere),
> > and then, this library has to make needed conversions regarding the
> > underlying API. not the user.
> 
> This certainly is a fine ivory tower solution,

I like to think from an ivory tower, 
and then look at the reality to see what's possible to do or not. :-)



> but it completely ignores two 
> practicalities in the case of Ada:
> 
> (1) You need to replace almost all of the existing Ada language defined 
> packages to make this work. Things that are deeply embedded in both 
> implementations and programs (like Ada.Exceptions and Ada.Text_IO) would 
> have to change substantially. The result would essentially be a different 
> language, since the resulting libraries would not work with most existing 
> programs.

- in Ada, of course we can't delete what's existing, and there are many 
packages which are already in 3 versions (S/WS/WWS).
imho, it would be consistent to make a 4th version of them for a new 
UTF_8_String type.

- in a new language close to Ada, it would not necessarily be a good 
idea to remove some of them, depending on industrial needs, to keep them 
with us.

> They'd have to have different names (since if you used the same 
> names, you change the failures from compile-time to runtime -- or even 
> undetected -- which would be completely against the spirit of Ada), which 
> means that one would have to essentially start over learning and using the 
> resulting language.

i think i don't understand.

> (and it would make sense to use this point to 
> eliminate a lot of the cruft from the Ada design).

could you give an example of cruft from the Ada design, please? :-)


> 
> (2) One needs to be able to read and write data given whatever encoding the 
> project requires (that's often decided by outside forces, such as other 
> hardware or software that the project needs to interoperate with).

> At a minimum, you 
> have to have a way to specify the encoding of files, streams, and hardware 
> interfaces

> That will greatly complicate the interface and 
> implementation of the libraries.

i don't think so.
it's a matter of interfacing libraries, for the purpose of communicating 
with the outside (neither of internal libraries nor of the choice of the 
internal type for the implementation).

Ada.Text_IO.Open.Form already allows (a part of?) this (on the content 
of the files, not on their name), see ARM A.10.2 (6-8).
(write i the reference to ARM correctly?)



> 
> > ... of course, it would be very nice to have a more thicker language with 
> > a garbage collector ...
> 
> I doubt that you will ever see that in the Ada family,

> as analysis and 
> therefore determinism is a very important property for the language.

I completely agree :-)

> Ada has 
> lots of mechanisms for managing storage without directly doing it yourself 
> (by calling Unchecked_Deallocation), yet none of them use any garbage 
> collection in a traditional sense.

sorry, i meant "garbage collector" in a generic sense, not in a 
traditional sense.
that is, as Ada users we could program with pointers and pool, without 
memory leaks nor calling Unchecked_Deallocation.

for example Ada.Containers.Indefinite_Holders.

i already wrote one for constrained limited types.
do you know if it's possible to do it for unconstrained limited types, 
like the class of a limited tagged type?

-- 
RAPID maintainer
http://savannah.nongnu.org/projects/rapid/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2023-03-31  3:06               ` Thomas
@ 2023-04-01 10:18                 ` Randy Brukardt
  0 siblings, 0 replies; 63+ messages in thread
From: Randy Brukardt @ 2023-04-01 10:18 UTC (permalink / raw)


I'm not going to answer this point-by-point, as it would take very much too 
long, and there is a similar thread going on the ARG's Github (which needs 
my attention more than comp.lang.ada.

But my opinion is that Ada got strings completely wrong, and the best thing 
to do with them is to completely nuke them and start over. But one cannot do 
that in the context of Ada, one would have to at least leave way to use the 
old mechanisms for compatibility with older code. That would leave a 
hodge-podge of mechanisms that would make Ada very much harder (rather than 
easier) to use.

As far as the cruft goes, I wrote up a 20+ page document on that during the 
pandemic, but I could never interest anyone knowledgeable to review it, and 
I don't plan to make it available without that. Most of the things are 
caused by interactions -- mostly because of too much generality. And of 
course there are features that Ada would be better off without (like 
anonymous access types).

                          Randy.

"Thomas" <fantome.forums.tDeContes@free.fr.invalid> wrote in message 
news:64264e2f$0$25952$426a74cc@news.free.fr...
> In article <t2g0c1$eou$1@dont-email.me>,
> "Randy Brukardt" <randy@rrsoftware.com> wrote:
>
>> "Thomas" <fantome.forums.tDeContes@free.fr.invalid> wrote in message
>> news:fantome.forums.tDeContes-5E3B70.20370903042022@news.free.fr...
>> ...
>> > as i said to Vadim Godunko, i need to fill a string type with an UTF-8
>> > litteral.but i don't think this string type has to manage various
>> > conversions.
>> >
>> > from my point of view, each library has to accept 1 kind of string type
>> > (preferably UTF-8 everywhere),
>> > and then, this library has to make needed conversions regarding the
>> > underlying API. not the user.
>>
>> This certainly is a fine ivory tower solution,
>
> I like to think from an ivory tower,
> and then look at the reality to see what's possible to do or not. :-)
>
>
>
>> but it completely ignores two
>> practicalities in the case of Ada:
>>
>> (1) You need to replace almost all of the existing Ada language defined
>> packages to make this work. Things that are deeply embedded in both
>> implementations and programs (like Ada.Exceptions and Ada.Text_IO) would
>> have to change substantially. The result would essentially be a different
>> language, since the resulting libraries would not work with most existing
>> programs.
>
> - in Ada, of course we can't delete what's existing, and there are many
> packages which are already in 3 versions (S/WS/WWS).
> imho, it would be consistent to make a 4th version of them for a new
> UTF_8_String type.
>
> - in a new language close to Ada, it would not necessarily be a good
> idea to remove some of them, depending on industrial needs, to keep them
> with us.
>
>> They'd have to have different names (since if you used the same
>> names, you change the failures from compile-time to runtime -- or even
>> undetected -- which would be completely against the spirit of Ada), which
>> means that one would have to essentially start over learning and using 
>> the
>> resulting language.
>
> i think i don't understand.
>
>> (and it would make sense to use this point to
>> eliminate a lot of the cruft from the Ada design).
>
> could you give an example of cruft from the Ada design, please? :-)
>
>
>>
>> (2) One needs to be able to read and write data given whatever encoding 
>> the
>> project requires (that's often decided by outside forces, such as other
>> hardware or software that the project needs to interoperate with).
>
>> At a minimum, you
>> have to have a way to specify the encoding of files, streams, and 
>> hardware
>> interfaces
>
>> That will greatly complicate the interface and
>> implementation of the libraries.
>
> i don't think so.
> it's a matter of interfacing libraries, for the purpose of communicating
> with the outside (neither of internal libraries nor of the choice of the
> internal type for the implementation).
>
> Ada.Text_IO.Open.Form already allows (a part of?) this (on the content
> of the files, not on their name), see ARM A.10.2 (6-8).
> (write i the reference to ARM correctly?)
>
>
>
>>
>> > ... of course, it would be very nice to have a more thicker language 
>> > with
>> > a garbage collector ...
>>
>> I doubt that you will ever see that in the Ada family,
>
>> as analysis and
>> therefore determinism is a very important property for the language.
>
> I completely agree :-)
>
>> Ada has
>> lots of mechanisms for managing storage without directly doing it 
>> yourself
>> (by calling Unchecked_Deallocation), yet none of them use any garbage
>> collection in a traditional sense.
>
> sorry, i meant "garbage collector" in a generic sense, not in a
> traditional sense.
> that is, as Ada users we could program with pointers and pool, without
> memory leaks nor calling Unchecked_Deallocation.
>
> for example Ada.Containers.Indefinite_Holders.
>
> i already wrote one for constrained limited types.
> do you know if it's possible to do it for unconstrained limited types,
> like the class of a limited tagged type?
>
> -- 
> RAPID maintainer
> http://savannah.nongnu.org/projects/rapid/ 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Ada and Unicode
  2022-04-03 16:51   ` Thomas
@ 2023-04-04  0:02     ` Thomas
  0 siblings, 0 replies; 63+ messages in thread
From: Thomas @ 2023-04-04  0:02 UTC (permalink / raw)


In article 
<fantome.forums.tDeContes-079FD6.18515603042022@news.free.fr>,
 Thomas <fantome.forums.tDeContes@free.fr.invalid> wrote:

> In article <f9d91cb0-c9bb-4d42-a1a9-0cd546da436cn@googlegroups.com>,
>  Vadim Godunko <vgodunko@gmail.com> wrote:
> 
> > On Sunday, April 18, 2021 at 1:03:14 AM UTC+3, DrPi wrote:
> 
> > > What's the way to manage Unicode correctly ? 


> > Ada doesn't have good Unicode support. :( So, you need to find suitable set 
> > of "workarounds".
> > 
> > There are few different aspects of Unicode support need to be considered:
> > 
> > 1. Representation of string literals. If you want to use non-ASCII 
> > characters 
> > in source code, you need to use -gnatW8 switch and it will require use of 
> > Wide_Wide_String everywhere.
> > 2. Internal representation during application execution. You are forced to 
> > use Wide_Wide_String at previous step, so it will be UCS4/UTF32.
> 
> > It is hard to say that it is reasonable set of features for modern world.
> 
> I don't think Ada would be lacking that much, for having good UTF-8 
> support.
> 
> the cardinal point is to be able to fill a 
> Ada.Strings.UTF_Encoding.UTF_8_String with a litteral.
> (once you got it, when you'll try to fill a Standard.String with a 
> non-Latin-1 character, it'll make an error, i think it's fine :-) )
> 
> does Ada 202x allow it ?


hi !

I think I found a quite nice solution!
(reading <t3lj44$fh5$1@dont-email.me> again)
(not tested yet)


it's not perfect as in the rules of the art,
but it is:

- Ada 2012 compatible
- better than writing UTF-8 Ada code and then telling gnat it is Latin-1
  (in this way it would take UTF_8_String for what it is:
  an array of octets, but it would not detect an invalid UTF-8 string,
  and if someone tells it's really UTF-8 all goes wrong)
- better than being limited to ASCII in string literals
- never need to explicitely declare Wide_Wide_String:
  it's always implicit, for very short time,
  and AFAIK eligible for optimization



package UTF_Encoding is

   subtype UTF_8_String is Ada.Strings.UTF_Encoding.UTF_8_String;

   function "+" (A : in Wide_Wide_String) return UTF_8_String
   renames Ada.Strings.UTF_Encoding.Wide_Wide_Strings.Encode;

end UTF_Encoding;


then we can do:


package User is

   use UTF_Encoding;

   My_String : UTF_8_String := + "Greek characters + smileys";

end User;


if you want to avoid "use UTF_Encoding;",
i think "use type UTF_Encoding.UTF_8_String;" doesn't work,
but this should work:


package UTF_Encoding is

   subtype UTF_8_String is Ada.Strings.UTF_Encoding.UTF_8_String;

   type Literals_For_UTF_8_String is new Wide_Wide_String;

   function "+" (A : in Literals_For_UTF_8_String) return UTF_8_String
   renames Ada.Strings.UTF_Encoding.Wide_Wide_Strings.Encode;

end UTF_Encoding;


package User is

   use type UTF_Encoding.Literals_For_UTF_8_String;

   My_String : UTF_Encoding.UTF_8_String
               := + "Greek characters + smileys";

end User;



what do you think about that ? good idea or not ? :-)

-- 
RAPID maintainer
http://savannah.nongnu.org/projects/rapid/

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2023-04-04  0:02 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-17 22:03 Ada and Unicode DrPi
2021-04-18  0:02 ` Luke A. Guest
2021-04-19  9:09   ` DrPi
2021-04-19  8:29 ` Maxim Reznik
2021-04-19  9:28   ` DrPi
2021-04-19 13:50     ` Maxim Reznik
2021-04-19 15:51       ` DrPi
2021-04-19 11:15   ` Simon Wright
2021-04-19 11:50     ` Luke A. Guest
2021-04-19 15:53     ` DrPi
2022-04-03 19:20     ` Thomas
2022-04-04  6:10       ` Vadim Godunko
2022-04-04 14:19         ` Simon Wright
2022-04-04 15:11           ` Simon Wright
2022-04-05  7:59           ` Vadim Godunko
2022-04-08  9:01             ` Simon Wright
2023-03-30 23:35         ` Thomas
2022-04-04 14:33       ` Simon Wright
2021-04-19  9:08 ` Stephen Leake
2021-04-19  9:34   ` Dmitry A. Kazakov
2021-04-19 11:56   ` Luke A. Guest
2021-04-19 12:13     ` Luke A. Guest
2021-04-19 15:48       ` DrPi
2021-04-19 12:52     ` Dmitry A. Kazakov
2021-04-19 13:00       ` Luke A. Guest
2021-04-19 13:10         ` Dmitry A. Kazakov
2021-04-19 13:15           ` Luke A. Guest
2021-04-19 13:31             ` Dmitry A. Kazakov
2022-04-03 17:24               ` Thomas
2021-04-19 13:24         ` J-P. Rosen
2021-04-20 19:13           ` Randy Brukardt
2022-04-03 18:04           ` Thomas
2022-04-06 18:57             ` J-P. Rosen
2022-04-07  1:30               ` Randy Brukardt
2022-04-08  8:56                 ` Simon Wright
2022-04-08  9:26                   ` Dmitry A. Kazakov
2022-04-08 19:19                     ` Simon Wright
2022-04-08 19:45                       ` Dmitry A. Kazakov
2022-04-09  4:05                         ` Randy Brukardt
2022-04-09  7:43                           ` Simon Wright
2022-04-09 10:27                           ` DrPi
2022-04-09 16:46                             ` Dennis Lee Bieber
2022-04-09 18:59                               ` DrPi
2022-04-10  5:58                             ` Vadim Godunko
2022-04-10 18:59                               ` DrPi
2022-04-12  6:13                               ` Randy Brukardt
2021-04-19 16:07         ` DrPi
2021-04-20 19:06         ` Randy Brukardt
2022-04-03 18:37           ` Thomas
2022-04-04 23:52             ` Randy Brukardt
2023-03-31  3:06               ` Thomas
2023-04-01 10:18                 ` Randy Brukardt
2021-04-19 16:14   ` DrPi
2021-04-19 17:12     ` Björn Lundin
2021-04-19 19:44       ` DrPi
2022-04-16  2:32   ` Thomas
2021-04-19 13:18 ` Vadim Godunko
2022-04-03 16:51   ` Thomas
2023-04-04  0:02     ` Thomas
2021-04-19 22:40 ` Shark8
2021-04-20 15:05   ` Simon Wright
2021-04-20 19:17     ` Randy Brukardt
2021-04-20 20:04       ` Simon Wright

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox