comp.lang.ada
 help / color / mirror / Atom feed
* Strange crash on custom iterator
@ 2018-06-30 10:48 Lucretia
  2018-06-30 11:32 ` Simon Wright
  0 siblings, 1 reply; 73+ messages in thread
From: Lucretia @ 2018-06-30 10:48 UTC (permalink / raw)


I finally got around to getting back to my iterator and on a first test implementation, i.e. to just iterate over each element of the array, the thing crashes in the Element function when accessing the array through the cursor.

The source is https://bpaste.net/show/6c5fca4c0ffd and the gdb session is https://bpaste.net/show/5b0cf9d2be79

Any idea how it could be doing this? I'm wondering if this is because I'm not using a tagged type.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-06-30 10:48 Strange crash on custom iterator Lucretia
@ 2018-06-30 11:32 ` Simon Wright
  2018-06-30 12:02   ` Lucretia
  0 siblings, 1 reply; 73+ messages in thread
From: Simon Wright @ 2018-06-30 11:32 UTC (permalink / raw)


Lucretia <laguest9000@googlemail.com> writes:

> I finally got around to getting back to my iterator and on a first
> test implementation, i.e. to just iterate over each element of the
> array, the thing crashes in the Element function when accessing the
> array through the cursor.
>
> The source is https://bpaste.net/show/6c5fca4c0ffd and the gdb session
> is https://bpaste.net/show/5b0cf9d2be79

UCA.Encoding missing?


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-06-30 11:32 ` Simon Wright
@ 2018-06-30 12:02   ` Lucretia
  2018-06-30 14:25     ` Simon Wright
  0 siblings, 1 reply; 73+ messages in thread
From: Lucretia @ 2018-06-30 12:02 UTC (permalink / raw)


On Saturday, 30 June 2018 12:32:14 UTC+1, Simon Wright  wrote:
> Lucretia <> writes:
> 
> > I finally got around to getting back to my iterator and on a first
> > test implementation, i.e. to just iterate over each element of the
> > array, the thing crashes in the Element function when accessing the
> > array through the cursor.
> >
> > The source is https://bpaste.net/show/6c5fca4c0ffd and the gdb session
> > is https://bpaste.net/show/5b0cf9d2be79
> 
> UCA.Encoding missing?

Balls! https://bpaste.net/show/a0d108820ce6

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-06-30 12:02   ` Lucretia
@ 2018-06-30 14:25     ` Simon Wright
  2018-06-30 14:33       ` Lucretia
  2018-06-30 14:34       ` Lucretia
  0 siblings, 2 replies; 73+ messages in thread
From: Simon Wright @ 2018-06-30 14:25 UTC (permalink / raw)


Lucretia <laguest9000@googlemail.com> writes:

> On Saturday, 30 June 2018 12:32:14 UTC+1, Simon Wright  wrote:
>> Lucretia <> writes:
>>
>> > I finally got around to getting back to my iterator and on a first
>> > test implementation, i.e. to just iterate over each element of the
>> > array, the thing crashes in the Element function when accessing the
>> > array through the cursor.
>> >
>> > The source is https://bpaste.net/show/6c5fca4c0ffd and the gdb session
>> > is https://bpaste.net/show/5b0cf9d2be79
>>
>> UCA.Encoding missing?
>
> Balls! https://bpaste.net/show/a0d108820ce6

First, I think Has_Element should probably be

   function Has_Element (Position : in Cursor) return Boolean is
   begin
      return Position.Index in Position.Data'Range;
   end Has_Element;

Second, there's something odd about the Address_To_Access_Conversions:
in Iterate, the address of the passed Container (which is on the
stack!!!)  appears in I.Data, but I.Data's length is 0.

I got it to work (at first glance) with

   type Cursor is
      record
         Data  : Unicode_String_Access := null;
         Index : Positive              := Positive'Last;
      end record;

   type Code_Point_Iterator is new Limited_Controlled and Code_Point_Iterators.Forward_Iterator with
      record
         Data  : Unicode_String_Access := null;
      end record;

and in Iterate

      return I : Code_Point_Iterator :=
         (Limited_Controlled with Data => new Unicode_String'(Container)) do

but of course you probably don't want the copy.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-06-30 14:25     ` Simon Wright
@ 2018-06-30 14:33       ` Lucretia
  2018-06-30 19:25         ` Simon Wright
  2018-06-30 14:34       ` Lucretia
  1 sibling, 1 reply; 73+ messages in thread
From: Lucretia @ 2018-06-30 14:33 UTC (permalink / raw)


On Saturday, 30 June 2018 15:25:39 UTC+1, Simon Wright  wrote:

> but of course you probably don't want the copy.

Exactly! :(


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-06-30 14:25     ` Simon Wright
  2018-06-30 14:33       ` Lucretia
@ 2018-06-30 14:34       ` Lucretia
  1 sibling, 0 replies; 73+ messages in thread
From: Lucretia @ 2018-06-30 14:34 UTC (permalink / raw)


On Saturday, 30 June 2018 15:25:39 UTC+1, Simon Wright  wrote:

> First, I think Has_Element should probably be

Thanks, BTW :)

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-06-30 14:33       ` Lucretia
@ 2018-06-30 19:25         ` Simon Wright
  2018-06-30 19:36           ` Luke A. Guest
  0 siblings, 1 reply; 73+ messages in thread
From: Simon Wright @ 2018-06-30 19:25 UTC (permalink / raw)


Lucretia <laguest9000@googlemail.com> writes:

> On Saturday, 30 June 2018 15:25:39 UTC+1, Simon Wright  wrote:
>
>> but of course you probably don't want the copy.
>
> Exactly! :(

I suspect that Unicode_String would need to be by-reference.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-06-30 19:25         ` Simon Wright
@ 2018-06-30 19:36           ` Luke A. Guest
  2018-07-01 18:06             ` Jacob Sparre Andersen
  0 siblings, 1 reply; 73+ messages in thread
From: Luke A. Guest @ 2018-06-30 19:36 UTC (permalink / raw)


Simon Wright <simon@pushface.org> wrote:
> Lucretia <
> 
>> On Saturday, 30 June 2018 15:25:39 UTC+1, Simon Wright  wrote:
>> 
>>> but of course you probably don't want the copy.
>> 
>> Exactly! :(
> 
> I suspect that Unicode_String would need to be by-reference.
> 

Yeah I think I’m going to have to make it tagged.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-06-30 19:36           ` Luke A. Guest
@ 2018-07-01 18:06             ` Jacob Sparre Andersen
  2018-07-01 19:59               ` Simon Wright
  2018-07-02  8:31               ` Lucretia
  0 siblings, 2 replies; 73+ messages in thread
From: Jacob Sparre Andersen @ 2018-07-01 18:06 UTC (permalink / raw)


Luke A. Guest wrote:
> Simon Wright <simon@pushface.org> wrote:

>> I suspect that Unicode_String would need to be by-reference.
>
> Yeah I think I’m going to have to make it tagged.

You don't need to make it tagged, to pass it by reference.  It is enough
to make the formal parameter aliased.

Greetings,

Jacob
-- 
What the Iron Maiden was to stupid tyrants, the committee was to
Lord Vetinari; it was only slightly more expensive, far less messy,
considerably more efficient and, best of all, you had to *force*
people to climb inside the Iron Maiden.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-01 18:06             ` Jacob Sparre Andersen
@ 2018-07-01 19:59               ` Simon Wright
  2018-07-02 17:43                 ` Luke A. Guest
  2018-07-02  8:31               ` Lucretia
  1 sibling, 1 reply; 73+ messages in thread
From: Simon Wright @ 2018-07-01 19:59 UTC (permalink / raw)


Jacob Sparre Andersen <jacob@jacob-sparre.dk> writes:

> Luke A. Guest wrote:
>> Simon Wright <simon@pushface.org> wrote:
>
>>> I suspect that Unicode_String would need to be by-reference.
>>
>> Yeah I think I’m going to have to make it tagged.
>
> You don't need to make it tagged, to pass it by reference.  It is enough
> to make the formal parameter aliased.

Yes, that works (except you have to make the container you're iterating
over aliased too).

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-01 18:06             ` Jacob Sparre Andersen
  2018-07-01 19:59               ` Simon Wright
@ 2018-07-02  8:31               ` Lucretia
  1 sibling, 0 replies; 73+ messages in thread
From: Lucretia @ 2018-07-02  8:31 UTC (permalink / raw)


On Sunday, 1 July 2018 19:06:43 UTC+1, Jacob Sparre Andersen  wrote:
> Luke A. Guest wrote:
> > Simon Wright <> wrote:
> 
> >> I suspect that Unicode_String would need to be by-reference.
> >
> > Yeah I think I’m going to have to make it tagged.
> 
> You don't need to make it tagged, to pass it by reference.  It is enough
> to make the formal parameter aliased.

Same crash.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-01 19:59               ` Simon Wright
@ 2018-07-02 17:43                 ` Luke A. Guest
  2018-07-02 19:42                   ` Simon Wright
  0 siblings, 1 reply; 73+ messages in thread
From: Luke A. Guest @ 2018-07-02 17:43 UTC (permalink / raw)


Simon Wright <> wrote:

>> You don't need to make it tagged, to pass it by reference.  It is enough
>> to make the formal parameter aliased.
> 
> Yes, that works (except you have to make the container you're iterating
> over aliased too).

I had to make the iterate for nation take “aliased in out” and make the
array aliased, but it still does in the same place.




^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-02 17:43                 ` Luke A. Guest
@ 2018-07-02 19:42                   ` Simon Wright
  2018-07-03 14:08                     ` Lucretia
  0 siblings, 1 reply; 73+ messages in thread
From: Simon Wright @ 2018-07-02 19:42 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 450 bytes --]

Luke A. Guest <laguest@archeia.com> writes:

> Simon Wright <> wrote:
>
>>> You don't need to make it tagged, to pass it by reference.  It is enough
>>> to make the formal parameter aliased.
>> 
>> Yes, that works (except you have to make the container you're iterating
>> over aliased too).
>
> I had to make the iterate for nation take “aliased in out” and make the
> array aliased, but it still does in the same place.

This worked for me ..


[-- Attachment #2: gnatchop-me --]
[-- Type: text/plain, Size: 9880 bytes --]

--  Copyright 2018, Luke A. Guest
--  License TBD.

with Ada.Characters.Latin_1;
with Ada.Text_IO; use Ada.Text_IO;
with UCA.Encoding;
with UCA.Iterators;

procedure Test is
   package L1 renames Ada.Characters.Latin_1;

   package Octet_IO is new Ada.Text_IO.Modular_IO (UCA.Octets);
   use Octet_IO;

   --  D  : UCA.Octets         := Character'Pos ('Q');
   --  A  : UCA.Unicode_String := UCA.To_Array (D);
   --  A2 : UCA.Unicode_String := UCA.Unicode_String'(1, 0, 0, 0, 0, 0, 1, 0);
   --  D2 : UCA.Octets         := UCA.To_Octet (A2);

   --  package OA_IO is new Ada.Text_IO.Integer_IO (Num => UCA.Bits);

   use UCA.Encoding;
   A : aliased UCA.Unicode_String :=
     +("ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ" & L1.LF &
         "Hello, world" & L1.LF &
         "Sîne klâwen durh die wolken sint geslagen," & L1.LF &
         "Τη γλώσσα μου έδωσαν ελληνική" & L1.LF &
         "मैं काँच खा सकता हूँ और मुझे उससे कोई चोट नहीं पहुंचती." & L1.LF &
         "میں کانچ کھا سکتا ہوں اور مجھے تکلیف نہیں ہوتی");
   B : aliased UCA.Unicode_String :=
     (225, 154, 160, 225, 155, 135, 225, 154, 187, 225, 155, 171, 225, 155, 146, 225, 155, 166, 225,
      154, 166, 225, 155, 171, 225, 154, 160, 225, 154, 177, 225, 154, 169, 225, 154, 160, 225, 154,
      162, 225, 154, 177, 225, 155, 171, 225, 154, 160, 225, 155, 129, 225, 154, 177, 225, 154, 170,
      225, 155, 171, 225, 154, 183, 225, 155, 150, 225, 154, 187, 225, 154, 185, 225, 155, 166, 225,
      155, 154, 225, 154, 179, 225, 154, 162, 225, 155, 151,
      10,
      72, 101, 108, 108, 111, 44, 32, 119, 111, 114, 108, 100,
      10,
      83, 195, 174, 110, 101, 32, 107, 108, 195, 162, 119, 101, 110, 32, 100, 117, 114, 104, 32, 100,
      105, 101, 32, 119, 111, 108, 107, 101, 110, 32, 115, 105, 110, 116, 32, 103, 101, 115, 108, 97,
      103, 101, 110, 44,
      10,
      206, 164, 206, 183, 32, 206, 179, 206, 187, 207, 142, 207, 131, 207, 131, 206, 177, 32, 206, 188,
      206, 191, 207, 133, 32, 206, 173, 206, 180, 207, 137, 207, 131, 206, 177, 206, 189, 32, 206, 181,
      206, 187, 206, 187, 206, 183, 206, 189, 206, 185, 206, 186, 206, 174,
      10,
      224, 164, 174, 224, 165, 136, 224, 164, 130, 32, 224, 164, 149, 224, 164, 190, 224, 164, 129, 224,
      164, 154, 32, 224, 164, 150, 224, 164, 190, 32, 224, 164, 184, 224, 164, 149, 224, 164, 164, 224,
      164, 190, 32, 224, 164, 185, 224, 165, 130, 224, 164, 129, 32, 224, 164, 148, 224, 164, 176, 32,
      224, 164, 174, 224, 165, 129, 224, 164, 157, 224, 165, 135, 32, 224, 164, 137, 224, 164, 184, 224,
      164, 184, 224, 165, 135, 32, 224, 164, 149, 224, 165, 139, 224, 164, 136, 32, 224, 164, 154, 224,
      165, 139, 224, 164, 159, 32, 224, 164, 168, 224, 164, 185, 224, 165, 128, 224, 164, 130, 32, 224,
      164, 170, 224, 164, 185, 224, 165, 129, 224, 164, 130, 224, 164, 154, 224, 164, 164, 224, 165, 128, 46,
      10,
      217, 133, 219, 140, 218, 186, 32, 218, 169, 216, 167, 217, 134, 218, 134, 32, 218, 169, 218, 190,
      216, 167, 32, 216, 179, 218, 169, 216, 170, 216, 167, 32, 219, 129, 217, 136, 218, 186, 32, 216,
      167, 217, 136, 216, 177, 32, 217, 133, 216, 172, 218, 190, 219, 146, 32, 216, 170, 218, 169, 217,
      132, 219, 140, 217, 129, 32, 217, 134, 219, 129, 219, 140, 218, 186, 32, 219, 129, 217, 136, 216,
      170, 219, 140);
begin
--   Put_Line ("A => " & To_UTF_8_String (A));
   Put_Line ("A => " & L1.LF & String (+A));

   Put_Line ("A => ");
   Put ('(');

   for E of A loop
      Put (Item => E, Base => 2);
      Put (", ");
   end loop;

   Put (')');
   New_Line;

   Put_Line ("B => " & L1.LF & String (+B));

   Put_Line ("A (Iterated) => ");

   for I in UCA.Iterators.Iterate (A) loop
      Put (UCA.Iterators.Element (I));       --  ERROR! Dies in Element, Data has nothing gdb => p position - $1 = (data => (), index => 1)
   end loop;

   New_Line;
end Test;

with Ada.Strings.UTF_Encoding;
with Ada.Unchecked_Conversion;

package UCA is
   use Ada.Strings.UTF_Encoding;

   type Octets is mod 2 ** 8 with
     Size => 8;

   type Unicode_String is array (Positive range <>) of Octets with
     Pack => True;

   type Unicode_String_Access is access all Unicode_String;

   --  This should match Wide_Wide_Character in size.
   type Code_Points is mod 2 ** 32 with
     Static_Predicate => Code_Points in 0 .. 16#0000_D7FF# or Code_Points in 16#0000_E000# .. 16#0010_FFFF#,
     Size             => 32;

private
   type Bits is range 0 .. 1 with
     Size => 1;

   type Bit_Range is range 0 .. Octets'Size - 1;
end UCA;

with Ada.Finalization;
with Ada.Iterator_Interfaces;
private with System.Address_To_Access_Conversions;

package UCA.Iterators is
   ---------------------------------------------------------------------------------------------------------------------
   --  Iteration over code points.
   ---------------------------------------------------------------------------------------------------------------------
   type Cursor is private;
   pragma Preelaborable_Initialization (Cursor);

   function Has_Element (Position : in Cursor) return Boolean;

   function Element (Position : in Cursor) return Octets;

   package Code_Point_Iterators is new Ada.Iterator_Interfaces (Cursor, Has_Element);

   function Iterate (Container : aliased in Unicode_String) return Code_Point_Iterators.Forward_Iterator'Class;
   function Iterate (Container : aliased in Unicode_String; Start : in Cursor) return
     Code_Point_Iterators.Forward_Iterator'Class;

   ---------------------------------------------------------------------------------------------------------------------
   --  Iteration over grapheme clusters.
   ---------------------------------------------------------------------------------------------------------------------
private
   use Ada.Finalization;

   package Convert is new System.Address_To_Access_Conversions (Unicode_String);

   type Cursor is
      record
         Data  : Convert.Object_Pointer := null;
         Index : Positive               := Positive'Last;
      end record;

   type Code_Point_Iterator is new Limited_Controlled and Code_Point_Iterators.Forward_Iterator with
      record
         Data  : Convert.Object_Pointer := null;
      end record;

   overriding
   function First (Object : in Code_Point_Iterator) return Cursor;

   overriding
   function Next  (Object : in Code_Point_Iterator; Position : Cursor) return Cursor;

end UCA.Iterators;

with Ada.Text_IO; use Ada.Text_IO;

package body UCA.Iterators is
   package Octet_IO is new Ada.Text_IO.Modular_IO (UCA.Octets);
   use Octet_IO;

   use type Convert.Object_Pointer;

   function Has_Element (Position : in Cursor) return Boolean is
   begin
      return Position.Index in Position.Data'Range;
   end Has_Element;

   function Element (Position : in Cursor) return Octets is
   begin
      if Position.Data = null then
         raise Constraint_Error with "Fuck!";
      end if;
      Put ("<< Element - " & Positive'Image (Position.Index) & " - ");
      Put (Position.Data (Position.Index));
      Put_Line (" >>");

      return Position.Data (Position.Index);
   end Element;

   function Iterate (Container : aliased in Unicode_String) return Code_Point_Iterators.Forward_Iterator'Class is
   begin
      Put_Line ("<< iterate >>");
      return I : Code_Point_Iterator := (Limited_Controlled with
        Data => Convert.To_Pointer (Container'Address)) do
         if I.Data = null then
            Put_Line ("Data => null");
         else
            Put_Line ("Data => not null - Length: " & Positive'Image (I.Data'Length));
         end if;
         null;
      end return;
   end Iterate;

   function Iterate (Container : aliased in Unicode_String; Start : in Cursor) return
     Code_Point_Iterators.Forward_Iterator'Class is
   begin
      Put_Line ("<< iterate >>");
      return I : Code_Point_Iterator := (Limited_Controlled with
        Data => Convert.To_Pointer (Container'Address)) do
         if I.Data = null then
            Put_Line ("Data => null");
         else
            Put_Line ("Data => not null");
         end if;
         null;
      end return;
   end Iterate;

   ---------------------------------------------------------------------------------------------------------------------
   --  Iteration over grapheme clusters.
   ---------------------------------------------------------------------------------------------------------------------
   overriding
   function First (Object : in Code_Point_Iterator) return Cursor is
   begin
      return (Data => Object.Data, Index => Positive'First);
   end First;

   overriding
   function Next  (Object : in Code_Point_Iterator; Position : Cursor) return Cursor is
   begin
      return (Data => Object.Data, Index => Position.Index + 1);
   end Next;
end UCA.Iterators;

--  Copyright © 2018, Luke A. Guest
with Ada.Unchecked_Conversion;

package body UCA.Encoding is
   function To_Unicode_String (Str : in String) return Unicode_String is
      Result : Unicode_String (1 .. Str'Length) with
        Address => Str'Address;
   begin
      return Result;
   end To_Unicode_String;

   function To_String (Str : in Unicode_String) return String is
      Result : String (1 .. Str'Length) with
        Address => Str'Address;
   begin
      return Result;
   end To_String;
end UCA.Encoding;

package UCA.Encoding is
   use Ada.Strings.UTF_Encoding;

   function To_Unicode_String (Str : in String) return Unicode_String;
   function To_String (Str : in Unicode_String) return String;

   function "+" (Str : in String) return Unicode_String renames To_Unicode_String;
   function "+" (Str : in Unicode_String) return String renames To_String;
end UCA.Encoding;

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-02 19:42                   ` Simon Wright
@ 2018-07-03 14:08                     ` Lucretia
  2018-07-03 14:17                       ` J-P. Rosen
  0 siblings, 1 reply; 73+ messages in thread
From: Lucretia @ 2018-07-03 14:08 UTC (permalink / raw)


On Monday, 2 July 2018 20:42:59 UTC+1, Simon Wright  wrote:

> This worked for me ..

Thanks, needed the extra in Has_Element as well.

But there are other issues as well:

1) Cannot pass an array which has been declared and initialised to an aliased parameter:

procedure Mem is
   type Unicode_String is array (Positive range <>) of Integer;

   procedure Inner (B : aliased in out Unicode_String) is null;

   S : aliased Unicode_String(1..10) := (others => Integer'First);
--   S : aliased Unicode_String := (1..10 => Integer'First);
begin
   Inner (S);
end Mem;

2) raised STORAGE_ERROR : stack overflow or erroneous memory access, when using "'Access" instead of "package Convert is new System.Address_To_Access_Conversions (Unicode_String);" and "'Address"

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-03 14:08                     ` Lucretia
@ 2018-07-03 14:17                       ` J-P. Rosen
  2018-07-03 15:06                         ` Lucretia
  0 siblings, 1 reply; 73+ messages in thread
From: J-P. Rosen @ 2018-07-03 14:17 UTC (permalink / raw)


Le 03/07/2018 à 16:08, Lucretia a écrit :
>    type Unicode_String is array (Positive range <>) of Integer;
Array of Integer???? For a Unicode_String...

Btw, do you know the package Ada.Strings.UTF_Encoding ?

-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-03 14:17                       ` J-P. Rosen
@ 2018-07-03 15:06                         ` Lucretia
  2018-07-03 15:45                           ` J-P. Rosen
  0 siblings, 1 reply; 73+ messages in thread
From: Lucretia @ 2018-07-03 15:06 UTC (permalink / raw)


On Tuesday, 3 July 2018 15:17:14 UTC+1, J-P. Rosen  wrote:
> Le 03/07/2018 à 16:08, Lucretia a écrit :
> >    type Unicode_String is array (Positive range <>) of Integer;
> Array of Integer???? For a Unicode_String...

Firstly read the rest of this thread, secondly, i should've renamed that in that simple test, because IT'S A TEST to show an error in the compiler. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86391

> Btw, do you know the package Ada.Strings.UTF_Encoding ?

Yes, I'm well aware of this completely useless type.

1) It's a subtype of String, which is incorrect as UTF-8 is not a superset of Latin 1, this should never have been allowed.

2) Ada needs a decent Unicode library not this half-arsed crap we have now.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-03 15:06                         ` Lucretia
@ 2018-07-03 15:45                           ` J-P. Rosen
  2018-07-03 15:55                             ` Lucretia
  2018-07-03 15:57                             ` Dmitry A. Kazakov
  0 siblings, 2 replies; 73+ messages in thread
From: J-P. Rosen @ 2018-07-03 15:45 UTC (permalink / raw)


Le 03/07/2018 à 17:06, Lucretia a écrit :
> 1) It's a subtype of String, which is incorrect as UTF-8 is not a 
> superset of Latin 1, this should never have been allowed.
In the first version of the AI, it was a different type. This has been
discussed, and found much more user-friendly to have it as a subtype of
String. Please read the discussions.

> 2) Ada needs a decent Unicode library not this half-arsed crap we 
> have now.
This package is about encoding only. What would you expect from a
Unicode library?

-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-03 15:45                           ` J-P. Rosen
@ 2018-07-03 15:55                             ` Lucretia
  2018-07-03 17:00                               ` J-P. Rosen
  2018-07-03 15:57                             ` Dmitry A. Kazakov
  1 sibling, 1 reply; 73+ messages in thread
From: Lucretia @ 2018-07-03 15:55 UTC (permalink / raw)


On Tuesday, 3 July 2018 16:46:00 UTC+1, J-P. Rosen  wrote:
> Le 03/07/2018 à 17:06, Lucretia a écrit :
> > 1) It's a subtype of String, which is incorrect as UTF-8 is not a 
> > superset of Latin 1, this should never have been allowed.
> In the first version of the AI, it was a different type. This has been
> discussed, and found much more user-friendly to have it as a subtype of
> String. Please read the discussions.
> 
> > 2) Ada needs a decent Unicode library not this half-arsed crap we 
> > have now.
> This package is about encoding only. What would you expect from a
> Unicode library?

Iterators over basic elements, the octets, Iterators over code points, Iterators over grapheme clusters, BIDI Iterators , etc. Access to the UCD. Unicode Regexps, streams, to start.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-03 15:45                           ` J-P. Rosen
  2018-07-03 15:55                             ` Lucretia
@ 2018-07-03 15:57                             ` Dmitry A. Kazakov
  2018-07-03 16:07                               ` Lucretia
  1 sibling, 1 reply; 73+ messages in thread
From: Dmitry A. Kazakov @ 2018-07-03 15:57 UTC (permalink / raw)


On 2018-07-03 17:45, J-P. Rosen wrote:
> Le 03/07/2018 à 17:06, Lucretia a écrit :
>> 1) It's a subtype of String, which is incorrect as UTF-8 is not a
>> superset of Latin 1, this should never have been allowed.
> In the first version of the AI, it was a different type. This has been
> discussed, and found much more user-friendly to have it as a subtype of
> String. Please read the discussions.

It must be both a different type with a distinct representation and 
constraints and a subtype (in non-Ada sense, the way Integer is a 
subtype of Universal_Integer).

>> 2) Ada needs a decent Unicode library not this half-arsed crap we
>> have now.
> This package is about encoding only. What would you expect from a
> Unicode library?

Proper typing, for a start?

P.S. It is clear that no decent library may exist without fixing Ada 
type system.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-03 15:57                             ` Dmitry A. Kazakov
@ 2018-07-03 16:07                               ` Lucretia
  2018-07-03 16:36                                 ` Dmitry A. Kazakov
  0 siblings, 1 reply; 73+ messages in thread
From: Lucretia @ 2018-07-03 16:07 UTC (permalink / raw)


On Tuesday, 3 July 2018 16:57:07 UTC+1, Dmitry A. Kazakov  wrote:
> On 2018-07-03 17:45, J-P. Rosen wrote:

> >> 2) Ada needs a decent Unicode library not this half-arsed crap we
> >> have now.
> > This package is about encoding only. What would you expect from a
> > Unicode library?
> 
> Proper typing, for a start?
> 
> P.S. It is clear that no decent library may exist without fixing Ada 
> type system.

In what way is it broken?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-03 16:07                               ` Lucretia
@ 2018-07-03 16:36                                 ` Dmitry A. Kazakov
  2018-07-03 16:42                                   ` Lucretia
                                                     ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Dmitry A. Kazakov @ 2018-07-03 16:36 UTC (permalink / raw)


On 2018-07-03 18:07, Lucretia wrote:
> On Tuesday, 3 July 2018 16:57:07 UTC+1, Dmitry A. Kazakov  wrote:
>> On 2018-07-03 17:45, J-P. Rosen wrote:
> 
>>>> 2) Ada needs a decent Unicode library not this half-arsed crap we
>>>> have now.
>>> This package is about encoding only. What would you expect from a
>>> Unicode library?
>>
>> Proper typing, for a start?
>>
>> P.S. It is clear that no decent library may exist without fixing Ada
>> type system.
> 
> In what way is it broken?

It is not broken, it misses key features like interface inheritance. 
E.g. UTF8_String and String must share interfaces but have different 
representations. Strings [and characters] is a network of related 
mutually and implicitly convertible types. There is no way to design 
that in Ada without magic.

Without having types related, we get a geometric explosion of packages: 
character type x encoding method x fixed/bounded/unbounded. Clearly 
nobody would ever add UTF-8 into this mess, because this will double the 
number of packages where strings are used. Static polymorphism 
(generics/overloading) does not work here.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-03 16:36                                 ` Dmitry A. Kazakov
@ 2018-07-03 16:42                                   ` Lucretia
  2018-07-03 16:45                                     ` Lucretia
  2018-07-03 20:18                                     ` Dmitry A. Kazakov
  2018-07-03 18:54                                   ` Dan'l Miller
  2018-07-04  7:33                                   ` J-P. Rosen
  2 siblings, 2 replies; 73+ messages in thread
From: Lucretia @ 2018-07-03 16:42 UTC (permalink / raw)


On Tuesday, 3 July 2018 17:36:12 UTC+1, Dmitry A. Kazakov  wrote:

> Without having types related, we get a geometric explosion of packages: 
> character type x encoding method x fixed/bounded/unbounded. Clearly 
> nobody would ever add UTF-8 into this mess, because this will double the 
> number of packages where strings are used. Static polymorphism 
> (generics/overloading) does not work here.

Well, they kind of already did that by subtyping UTF_String from String, of which it's not a subtype, it's just they are both arrays of 8-bit entities.

Am i wrong, should I just implement what I need on top of the standard lib and just use the UTF* types in my code? What about unbounded_utf_strings? Just use the normal unbounded_string? It's not like it's going to be checking for it to be correct utf8 is it, but I can't write an iterator for that from outside the rts though.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-03 16:42                                   ` Lucretia
@ 2018-07-03 16:45                                     ` Lucretia
  2018-07-03 20:18                                     ` Dmitry A. Kazakov
  1 sibling, 0 replies; 73+ messages in thread
From: Lucretia @ 2018-07-03 16:45 UTC (permalink / raw)


On Tuesday, 3 July 2018 17:42:53 UTC+1, Lucretia  wrote:


>the normal unbounded_string? It's not like it's going to be checking for it to be correct utf8 is it, but I can't write an iterator for that from outside the rts though.

Well, that's not quite correct, I could.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-03 15:55                             ` Lucretia
@ 2018-07-03 17:00                               ` J-P. Rosen
  0 siblings, 0 replies; 73+ messages in thread
From: J-P. Rosen @ 2018-07-03 17:00 UTC (permalink / raw)


Le 03/07/2018 à 17:55, Lucretia a écrit :
>> This package is about encoding only. What would you expect from a 
>> Unicode library?
> Iterators over basic elements, the octets, Iterators over code
> points, Iterators over grapheme clusters, BIDI Iterators , etc.
> Access to the UCD. Unicode Regexps, streams, to start.

Fine, you are welcome to propose a specification (not for the next
version of the standard, it's too late), and if it is useful it might
interest compiler vendors anyway, or you may provide your own
implementation as free software.

Regarding the above package, it serves its purpose, no more, no less.

-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-03 16:36                                 ` Dmitry A. Kazakov
  2018-07-03 16:42                                   ` Lucretia
@ 2018-07-03 18:54                                   ` Dan'l Miller
  2018-07-03 20:22                                     ` Dmitry A. Kazakov
  2018-07-04  7:33                                   ` J-P. Rosen
  2 siblings, 1 reply; 73+ messages in thread
From: Dan'l Miller @ 2018-07-03 18:54 UTC (permalink / raw)


On Tuesday, July 3, 2018 at 11:36:12 AM UTC-5, Dmitry A. Kazakov wrote:
> On 2018-07-03 18:07, Lucretia wrote:
> > On Tuesday, 3 July 2018 16:57:07 UTC+1, Dmitry A. Kazakov  wrote:
> >> On 2018-07-03 17:45, J-P. Rosen wrote:
> > 
> >>>> 2) Ada needs a decent Unicode library not this half-arsed crap we
> >>>> have now.
> >>> This package is about encoding only. What would you expect from a
> >>> Unicode library?
> >>
> >> Proper typing, for a start?
> >>
> >> P.S. It is clear that no decent library may exist without fixing Ada
> >> type system.
> > 
> > In what way is it broken?
> 
> It is not broken, it misses key features like interface inheritance. 

Wait, what?  No extension of interfaces in Ada, eh?

https://www.adacore.com/gems/gem-48
type Animal is interface;
type Animal_Extension_1 is interface and Animal;

Or are we now claiming that Ada's •extension• is not •inheritance•?  (But ironically that Ada83's subtyping is inheritance, as per a prior recent thread's debate.)


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-03 16:42                                   ` Lucretia
  2018-07-03 16:45                                     ` Lucretia
@ 2018-07-03 20:18                                     ` Dmitry A. Kazakov
  2018-07-03 21:04                                       ` Lucretia
  1 sibling, 1 reply; 73+ messages in thread
From: Dmitry A. Kazakov @ 2018-07-03 20:18 UTC (permalink / raw)


On 2018-07-03 18:42, Lucretia wrote:
> On Tuesday, 3 July 2018 17:36:12 UTC+1, Dmitry A. Kazakov  wrote:
> 
>> Without having types related, we get a geometric explosion of packages:
>> character type x encoding method x fixed/bounded/unbounded. Clearly
>> nobody would ever add UTF-8 into this mess, because this will double the
>> number of packages where strings are used. Static polymorphism
>> (generics/overloading) does not work here.
> 
> Well, they kind of already did that by subtyping UTF_String from String, of which it's not a subtype, it's just they are both arrays of 8-bit entities.

No. Both are arrays of code points and arrays of octets. The ranges of 
code points are different. The correspondence between code points and 
octets are different. Thus the subtyping is broken.

> Am i wrong, should I just implement what I need on top of the standard lib and just use the UTF* types in my code? What about unbounded_utf_strings? Just use the normal unbounded_string? It's not like it's going to be checking for it to be correct utf8 is it, but I can't write an iterator for that from outside the rts though.

There is no way to do it right in Ada for now.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-03 18:54                                   ` Dan'l Miller
@ 2018-07-03 20:22                                     ` Dmitry A. Kazakov
  0 siblings, 0 replies; 73+ messages in thread
From: Dmitry A. Kazakov @ 2018-07-03 20:22 UTC (permalink / raw)


On 2018-07-03 20:54, Dan'l Miller wrote:
> On Tuesday, July 3, 2018 at 11:36:12 AM UTC-5, Dmitry A. Kazakov wrote:
>> On 2018-07-03 18:07, Lucretia wrote:
>>> On Tuesday, 3 July 2018 16:57:07 UTC+1, Dmitry A. Kazakov  wrote:
>>>> On 2018-07-03 17:45, J-P. Rosen wrote:
>>>
>>>>>> 2) Ada needs a decent Unicode library not this half-arsed crap we
>>>>>> have now.
>>>>> This package is about encoding only. What would you expect from a
>>>>> Unicode library?
>>>>
>>>> Proper typing, for a start?
>>>>
>>>> P.S. It is clear that no decent library may exist without fixing Ada
>>>> type system.
>>>
>>> In what way is it broken?
>>
>> It is not broken, it misses key features like interface inheritance.
> 
> Wait, what?  No extension of interfaces in Ada, eh?

Tagged record extensions are unsuitable for the purpose.

If you don't believe me, try yourself.

> Or are we now claiming that Ada's •extension• is not •inheritance•?

Tagged extension is inheritance. The reverse is false. Not all 
inheritance is by tagged extension.

> (But ironically that Ada83's subtyping is inheritance, as per a prior recent thread's debate.)

It is, but it won't do this job either.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-03 20:18                                     ` Dmitry A. Kazakov
@ 2018-07-03 21:04                                       ` Lucretia
  2018-07-04  1:26                                         ` Dan'l Miller
  2018-07-04  7:21                                         ` Dmitry A. Kazakov
  0 siblings, 2 replies; 73+ messages in thread
From: Lucretia @ 2018-07-03 21:04 UTC (permalink / raw)


On Tuesday, 3 July 2018 21:18:28 UTC+1, Dmitry A. Kazakov  wrote:

> > Well, they kind of already did that by subtyping UTF_String from String, of which it's not a subtype, it's just they are both arrays of 8-bit entities.
> 
> No. Both are arrays of code points and arrays of octets. The ranges of 
> code points are different. The correspondence between code points and 
> octets are different. Thus the subtyping is broken.

I know the difference between code points and octets and their arrays. I was saying that UTF_String is not a valid subtype of String because String is Latin 1 and UTF_String is a superset of 7-bit ASCII, not 8-bit Latin 1.

> > Am i wrong, should I just implement what I need on top of the standard lib and just use the UTF* types in my code? What about unbounded_utf_strings? Just use the normal unbounded_string? It's not like it's going to be checking for it to be correct utf8 is it, but I can't write an iterator for that from outside the rts though.
> 
> There is no way to do it right in Ada for now.

What do you mean exactly????

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-03 21:04                                       ` Lucretia
@ 2018-07-04  1:26                                         ` Dan'l Miller
  2018-07-04  1:59                                           ` Lucretia
  2018-07-04  7:21                                         ` Dmitry A. Kazakov
  1 sibling, 1 reply; 73+ messages in thread
From: Dan'l Miller @ 2018-07-04  1:26 UTC (permalink / raw)


On Tuesday, July 3, 2018 at 4:04:54 PM UTC-5, Lucretia wrote:
> On Tuesday, 3 July 2018 21:18:28 UTC+1, Dmitry A. Kazakov  wrote:
> 
> > > Well, they kind of already did that by subtyping UTF_String from String, of which it's not a subtype, it's just they are both arrays of 8-bit entities.
> > 
> > No. Both are arrays of code points and arrays of octets. The ranges of 
> > code points are different. The correspondence between code points and 
> > octets are different. Thus the subtyping is broken.
> 
> I know the difference between code points and octets and their arrays. I was saying that UTF_String is
> not a valid subtype of String because String is Latin 1 and UTF_String is a superset of 7-bit ASCII, not
> 8-bit Latin 1.

Well, there are 2 ways of looking at UTF-8: before versus after parsing.

is not a superset:
One is whether each 8-bit value in Latin-1 has the same value in the UTF-8 octet-by-octet representation •prior• to parsing.  Using this analysis, all of the upper 128 values have a different meaning than in Latin-1.

is a superset:
But the other way of looking at UTF-8 is what character is represented by the multi-byte encoding •after• parsing.  In this view, the lowest 256 values of Unicode/ISO10646 conform to Latin-1 (with some quibbling over whether the mark-parity control codes from 16#80 to 16#9F have precisely the same meaning versus reserved/unencoded at various editions of various standards).

> > > Am i wrong, should I just implement what I need on top of the standard lib and just use the UTF* types in my code? What about unbounded_utf_strings? Just use the normal unbounded_string? It's not like it's going to be checking for it to be correct utf8 is it, but I can't write an iterator for that from outside the rts though.
> > 
> > There is no way to do it right in Ada for now.
> 
> What do you mean exactly????

He means that it needs his extrapolation-of-Steelman-3-3F idea for compile-time tagged types that are not tagged records.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04  1:26                                         ` Dan'l Miller
@ 2018-07-04  1:59                                           ` Lucretia
  2018-07-04  7:37                                             ` Dmitry A. Kazakov
                                                               ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Lucretia @ 2018-07-04  1:59 UTC (permalink / raw)


On Wednesday, 4 July 2018 02:26:52 UTC+1, Dan'l Miller  wrote:

> > I know the difference between code points and octets and their arrays. I was saying that UTF_String is
> > not a valid subtype of String because String is Latin 1 and UTF_String is a superset of 7-bit ASCII, not
> > 8-bit Latin 1.
> 
> Well, there are 2 ways of looking at UTF-8: before versus after parsing.
> 
> is not a superset:
> One is whether each 8-bit value in Latin-1 has the same value in the UTF-8 octet-by-octet representation •prior• to parsing.  Using this analysis, all of the upper 128 values have a different meaning than in Latin-1.

You're answering a question that wasn't asked.
 
> is a superset:
> But the other way of looking at UTF-8 is what character is represented by the multi-byte encoding •after• parsing.  In this view, the lowest 256 values of Unicode/ISO10646 conform to Latin-1 (with some quibbling over whether the mark-parity control codes from 16#80 to 16#9F have precisely the same meaning versus reserved/unencoded at various editions of various standards).

And again, wasn't asked.
 
> > > > Am i wrong, should I just implement what I need on top of the standard lib and just use the UTF* types in my code? What about unbounded_utf_strings? Just use the normal unbounded_string? It's not like it's going to be checking for it to be correct utf8 is it, but I can't write an iterator for that from outside the rts though.
> > > 
> > > There is no way to do it right in Ada for now.
> > 
> > What do you mean exactly????
> 
> He means that it needs his extrapolation-of-Steelman-3-3F idea for compile-time tagged types that are not tagged records.

I've just read it. Yeah, I agree that Ada should be able to extend records with data, not functions/procedures, but I don't see how the lack of that is a hindrance to creating a decent unicode lib. The fact that he refuses to answer such a simple question, i.e "WTF are you on about?" explains a lot.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-03 21:04                                       ` Lucretia
  2018-07-04  1:26                                         ` Dan'l Miller
@ 2018-07-04  7:21                                         ` Dmitry A. Kazakov
  1 sibling, 0 replies; 73+ messages in thread
From: Dmitry A. Kazakov @ 2018-07-04  7:21 UTC (permalink / raw)


On 2018-07-03 23:04, Lucretia wrote:
> On Tuesday, 3 July 2018 21:18:28 UTC+1, Dmitry A. Kazakov  wrote:
> 
>>> Well, they kind of already did that by subtyping UTF_String from String, of which it's not a subtype, it's just they are both arrays of 8-bit entities.
>>
>> No. Both are arrays of code points and arrays of octets. The ranges of
>> code points are different. The correspondence between code points and
>> octets are different. Thus the subtyping is broken.
> 
> I know the difference between code points and octets and their arrays. I was saying that UTF_String is not a valid subtype of String because String is Latin 1 and UTF_String is a superset of 7-bit ASCII, not 8-bit Latin 1.

No, that does not break subtyping if Constraint_Error is in the 
contract. Subtyping is broken when the array of Latin-1 code points 
(String) corresponds to the array of representation units (octets of 
UTF8_String).

Array of Latin-1 code points corresponds to the array of Unicode code 
points. It has nothing to do with the underlying encoding, whatever it 
might be.

Each string implements two unrelated array interfaces:

1. Array of encoding units, e.g. array of octets
2. Array of code points

#1 and #2 are historically confused because one resembles another for a 
certain class encodings like ASCII, UCS-2, UCS-4. They are absolutely 
different for UTF-8 and UTF-16.

>>> Am i wrong, should I just implement what I need on top of the standard lib and just use the UTF* types in my code? What about unbounded_utf_strings? Just use the normal unbounded_string? It's not like it's going to be checking for it to be correct utf8 is it, but I can't write an iterator for that from outside the rts though.
>>
>> There is no way to do it right in Ada for now.
> 
> What do you mean exactly????

For simplicity start with designing character types: Character, 
Wide_Character and Wide_Wide_Character as related types.

    X : Character; -- Character'Size = 8
    Y : Wide_Character := Y; -- This must be legal

Already this is impossible in Ada.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-03 16:36                                 ` Dmitry A. Kazakov
  2018-07-03 16:42                                   ` Lucretia
  2018-07-03 18:54                                   ` Dan'l Miller
@ 2018-07-04  7:33                                   ` J-P. Rosen
  2018-07-04  7:53                                     ` Dmitry A. Kazakov
  2 siblings, 1 reply; 73+ messages in thread
From: J-P. Rosen @ 2018-07-04  7:33 UTC (permalink / raw)


Le 03/07/2018 à 18:36, Dmitry A. Kazakov a écrit :
> E.g. UTF8_String and String must share interfaces but have different
> representations.
No. UTF_8 is useful only for IOs, as soon as you want to use a UTF
string, you need to convert it to a Wide_String.

Why? Because even the simplest operation (Length, Indexing) are O(N) and
are mostly equivalent to decoding the whole string.

-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04  1:59                                           ` Lucretia
@ 2018-07-04  7:37                                             ` Dmitry A. Kazakov
  2018-07-04 12:46                                             ` Dan'l Miller
  2018-07-04 13:37                                             ` Dennis Lee Bieber
  2 siblings, 0 replies; 73+ messages in thread
From: Dmitry A. Kazakov @ 2018-07-04  7:37 UTC (permalink / raw)


On 2018-07-04 03:59, Lucretia wrote:

> I've just read it. Yeah, I agree that Ada should be able to extend records with data, not functions/procedures, but I don't see how the lack of that is a hindrance to creating a decent unicode lib. The fact that he refuses to answer such a simple question, i.e "WTF are you on about?" explains a lot.

I never refuse answering. Let me state the requirements of a sane 
implementation:

1. All types related. You can pass String where UTF8_String is expected 
and conversely keeping the *semantics*. That means that when one string 
contains a-umlaut it stays a-umlaut in another string.

2. All strings are arrays of Unicode points. You can iterate characters 
even in an UTF-8 string or a DEC RADIX-50 string.

3. All strings are arrays of the corresponding representation units. You 
can iterate representation units.

4. All string representations stripped of the the bounds have machine 
representations in the stated encoding. You can pass a flat UTF-8 string 
down to a C library with no fuss.

5. For any string operation one can provide either a type-specific 
implementation or inherit a body from another strings type (see #1). You 
can write a specific Put_Line for Latin-1 string or use (inherit) 
Put_Line for UTF-8 string.
--------------

The point is that this is impossible in Ada. If you think otherwise, you 
are welcome to outline a way.

What fixes required to be able to implement it is a subject of serious 
discussion about the Ada type system, which nobody seemingly is 
interested in, at the time. Though I am ready when you are.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04  7:33                                   ` J-P. Rosen
@ 2018-07-04  7:53                                     ` Dmitry A. Kazakov
  2018-07-04  9:55                                       ` J-P. Rosen
  2018-07-04 19:02                                       ` G. B.
  0 siblings, 2 replies; 73+ messages in thread
From: Dmitry A. Kazakov @ 2018-07-04  7:53 UTC (permalink / raw)


On 2018-07-04 09:33, J-P. Rosen wrote:
> Le 03/07/2018 à 18:36, Dmitry A. Kazakov a écrit :
>> E.g. UTF8_String and String must share interfaces but have different
>> representations.
> No. UTF_8 is useful only for IOs, as soon as you want to use a UTF
> string, you need to convert it to a Wide_String.

I cannot. Wide_String is UCS-2 which is not full Unicode.

Anyway, whatever conversion of representations needed it must be 
transparent to the user.

> Why? Because even the simplest operation (Length, Indexing) are O(N) and
> are mostly equivalent to decoding the whole string.

Premature optimization, huh? And you still need UTF-8 string type even 
if you are going to convert it to something else. Back to the square 
one, how to design an UTF-8 string type?

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04  7:53                                     ` Dmitry A. Kazakov
@ 2018-07-04  9:55                                       ` J-P. Rosen
  2018-07-04 10:01                                         ` Dmitry A. Kazakov
  2018-07-04 19:02                                       ` G. B.
  1 sibling, 1 reply; 73+ messages in thread
From: J-P. Rosen @ 2018-07-04  9:55 UTC (permalink / raw)


Le 04/07/2018 à 09:53, Dmitry A. Kazakov a écrit :
> On 2018-07-04 09:33, J-P. Rosen wrote:
>> Le 03/07/2018 à 18:36, Dmitry A. Kazakov a écrit :
>>> E.g. UTF8_String and String must share interfaces but have
>>> different representations.
>> No. UTF_8 is useful only for IOs, as soon as you want to use a UTF 
>> string, you need to convert it to a Wide_String.
> 
> I cannot. Wide_String is UCS-2 which is not full Unicode.
For most purposes, Wide_String is sufficient, unless you really need to
support emojis or ancient chinese. In those cases, decode to
Wide_Wide_String, no problem.

> Anyway, whatever conversion of representations needed it must be 
> transparent to the user.
> 
>> Why? Because even the simplest operation (Length, Indexing) are
>> O(N) and are mostly equivalent to decoding the whole string.
> 
> Premature optimization, huh? And you still need UTF-8 string type
> even if you are going to convert it to something else. Back to the
> square one, how to design an UTF-8 string type?
> 
Choosing a representation that allows a more efficient algorithm is
proper design, not premature optimization.

And the point is that when you receive a string, you don't know before
looking at the BOM (or other recognition techniques) whether the octets
you received are pure Latin-1 or UTF_8 encoded. So you need to store it
in a plain String.

We discussed that point, and the agreement was that making a different
type would force the user to many conversions that would bring nothing
but trouble, and make Ada once again look impractical out of excessive
purism.

-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04  9:55                                       ` J-P. Rosen
@ 2018-07-04 10:01                                         ` Dmitry A. Kazakov
  2018-07-04 11:30                                           ` J-P. Rosen
  0 siblings, 1 reply; 73+ messages in thread
From: Dmitry A. Kazakov @ 2018-07-04 10:01 UTC (permalink / raw)


On 2018-07-04 11:55, J-P. Rosen wrote:
> Le 04/07/2018 à 09:53, Dmitry A. Kazakov a écrit :
>> On 2018-07-04 09:33, J-P. Rosen wrote:

>> Premature optimization, huh? And you still need UTF-8 string type
>> even if you are going to convert it to something else. Back to the
>> square one, how to design an UTF-8 string type?
>>
> Choosing a representation that allows a more efficient algorithm is
> proper design, not premature optimization.

But UTF-8 is actually more efficient in most cases than 
Wide_Wide_String. Random string indexing is practically never used.

> And the point is that when you receive a string, you don't know before
> looking at the BOM (or other recognition techniques) whether the octets
> you received are pure Latin-1 or UTF_8 encoded. So you need to store it
> in a plain String.

That is not a string at all, it is a stream array or an array of octets.

> We discussed that point, and the agreement was that making a different
> type would force the user to many conversions that would bring nothing
> but trouble, and make Ada once again look impractical out of excessive
> purism.

Exactly my point. Explicit conversion are necessary because Ada's type 
system is unable to model strings in a type-safe way.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 10:01                                         ` Dmitry A. Kazakov
@ 2018-07-04 11:30                                           ` J-P. Rosen
  2018-07-04 13:27                                             ` Dmitry A. Kazakov
  2018-07-04 17:51                                             ` Jacob Sparre Andersen
  0 siblings, 2 replies; 73+ messages in thread
From: J-P. Rosen @ 2018-07-04 11:30 UTC (permalink / raw)


Le 04/07/2018 à 12:01, Dmitry A. Kazakov a écrit :
> But UTF-8 is actually more efficient in most cases than
> Wide_Wide_String. Random string indexing is practically never used.
!!!! I, and many others, often need to search substrings within a
string; actually, I would have a hard time finding an example of string
manipulation without indexing...

>> We discussed that point, and the agreement was that making a different
>> type would force the user to many conversions that would bring nothing
>> but trouble, and make Ada once again look impractical out of excessive
>> purism.
> 
> Exactly my point. Explicit conversion are necessary because Ada's type
> system is unable to model strings in a type-safe way.
So, you want different types, plus a typing system that would allow to
mix the types and make them compatible... You might as well put
everything in the same type!

Anyway, the ARG has to deal with Ada as it is, not as Dmitry dreams it
should be...

-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04  1:59                                           ` Lucretia
  2018-07-04  7:37                                             ` Dmitry A. Kazakov
@ 2018-07-04 12:46                                             ` Dan'l Miller
  2018-07-04 13:37                                             ` Dennis Lee Bieber
  2 siblings, 0 replies; 73+ messages in thread
From: Dan'l Miller @ 2018-07-04 12:46 UTC (permalink / raw)


On Tuesday, July 3, 2018 at 8:59:59 PM UTC-5, Lucretia wrote:
> On Wednesday, 4 July 2018 02:26:52 UTC+1, Dan'l Miller  wrote:
> 
> > > I know the difference between code points and octets and their arrays. I was saying that UTF_String is
> > > not a valid subtype of String because String is Latin 1 and UTF_String is a superset of 7-bit ASCII, not
> > > 8-bit Latin 1.
> > 
> > Well, there are 2 ways of looking at UTF-8: before versus after parsing.
> > 
> > is not a superset:
> > One is whether each 8-bit value in Latin-1 has the same value in the UTF-8 octet-by-octet representation •prior• to parsing.  Using this analysis, all of the upper 128 values have a different meaning than in Latin-1.
> 
> You're answering a question that wasn't asked.
>  
> > is a superset:
> > But the other way of looking at UTF-8 is what character is represented by the multi-byte encoding •after• parsing.  In this view, the lowest 256 values of Unicode/ISO10646 conform to Latin-1 (with some quibbling over whether the mark-parity control codes from 16#80 to 16#9F have precisely the same meaning versus reserved/unencoded at various editions of various standards).
> 
> And again, wasn't asked.

It is quite on-topic though.  This difference of looking at it from the pre-parsed versus post-parsed perspectives is at the heart of the difference of opinion of Luke/Dmitry (String) versus J-P. Rosen (Wide_String and Wide_Wide_String) arising throughout this thread.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 11:30                                           ` J-P. Rosen
@ 2018-07-04 13:27                                             ` Dmitry A. Kazakov
  2018-07-04 14:37                                               ` Dan'l Miller
  2018-07-04 17:51                                             ` Jacob Sparre Andersen
  1 sibling, 1 reply; 73+ messages in thread
From: Dmitry A. Kazakov @ 2018-07-04 13:27 UTC (permalink / raw)


On 2018-07-04 13:30, J-P. Rosen wrote:
> Le 04/07/2018 à 12:01, Dmitry A. Kazakov a écrit :
>> But UTF-8 is actually more efficient in most cases than
>> Wide_Wide_String. Random string indexing is practically never used.
> !!!! I, and many others, often need to search substrings within a
> string; actually, I would have a hard time finding an example of string
> manipulation without indexing...
> 
>>> We discussed that point, and the agreement was that making a different
>>> type would force the user to many conversions that would bring nothing
>>> but trouble, and make Ada once again look impractical out of excessive
>>> purism.
>>
>> Exactly my point. Explicit conversion are necessary because Ada's type
>> system is unable to model strings in a type-safe way.
> So, you want different types, plus a typing system that would allow to
> mix the types and make them compatible.

Yes, because they are semantically same: arrays of code points.

> .. You might as well put
> everything in the same type!

No, because they must have different representations.

> Anyway, the ARG has to deal with Ada as it is, not as Dmitry dreams it
> should be...

It requires someone more influential, wise and knowledgeable than me to 
make and then push such a proposal. I would be satisfied if more people 
saw the roots of problems with strings etc.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04  1:59                                           ` Lucretia
  2018-07-04  7:37                                             ` Dmitry A. Kazakov
  2018-07-04 12:46                                             ` Dan'l Miller
@ 2018-07-04 13:37                                             ` Dennis Lee Bieber
  2 siblings, 0 replies; 73+ messages in thread
From: Dennis Lee Bieber @ 2018-07-04 13:37 UTC (permalink / raw)


On Tue, 3 Jul 2018 18:59:57 -0700 (PDT), Lucretia
<laguest9000@googlemail.com> declaimed the following:

>
>I've just read it. Yeah, I agree that Ada should be able to extend records with data, not functions/procedures, but I don't see how the lack of that is a hindrance to creating a decent unicode lib. The fact that he refuses to answer such a simple question, i.e "WTF are you on about?" explains a lot.

	Ah, but what IS a "decent unicode lib(rary)"?

	There is a regular over in comp.lang.python who tends to rant that
Python 3.2 (maybe 3.1) "broke" unicode handling because some of his
non-real-world benchmarks run slower.

	Current Python3 internally uses 1, 2, or 4 bytes per character in a
string based upon the widest individual character. If everything fits in
8-bits, it uses 1-byte/char strings. If even one character requires
16-bits, the entire string will use 2-byte/char. Of course, since strings
are immutable in Python, there is no concern about replacing one char in a
1-byte/char string with a 2-byte char -- one has to create a whole new
string, which operation detects the presence of a 2-byte char and allocates
all characters as 2-byte wide.

	The scheme allows for direct indexing of characters -- no confusion of
indexing a prefix byte, or misinterpreting a suffix byte.

	I don't think such a string type would go far in Ada: one loses
mutation in place, and also can not define memory usage limits ahead of
time (unless one provides for worst case 4-byte/char -- in which case one
might just use that all the way through and retain in place mutation).


-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
	wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/ 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 13:27                                             ` Dmitry A. Kazakov
@ 2018-07-04 14:37                                               ` Dan'l Miller
  2018-07-04 14:43                                                 ` Dan'l Miller
                                                                   ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Dan'l Miller @ 2018-07-04 14:37 UTC (permalink / raw)


On Wednesday, July 4, 2018 at 8:27:53 AM UTC-5, Dmitry A. Kazakov wrote:
> On 2018-07-04 13:30, J-P. Rosen wrote:
> > Le 04/07/2018 à 12:01, Dmitry A. Kazakov a écrit :
> >> But UTF-8 is actually more efficient in most cases than
> >> Wide_Wide_String. Random string indexing is practically never used.
> > !!!! I, and many others, often need to search substrings within a
> > string; actually, I would have a hard time finding an example of string
> > manipulation without indexing...
> > 
> >>> We discussed that point, and the agreement was that making a different
> >>> type would force the user to many conversions that would bring nothing
> >>> but trouble, and make Ada once again look impractical out of excessive
> >>> purism.
> >>
> >> Exactly my point. Explicit conversion are necessary because Ada's type
> >> system is unable to model strings in a type-safe way.
> > So, you want different types, plus a typing system that would allow to
> > mix the types and make them compatible.
> 
> Yes, because they are semantically same: arrays of code points.
> 
> > .. You might as well put
> > everything in the same type!
> 
> No, because they must have different representations.
> 
> > Anyway, the ARG has to deal with Ada as it is, not as Dmitry dreams it
> > should be...
> 
> It requires someone more influential, wise and knowledgeable than me to 
> make and then push such a proposal. I would be satisfied if more people 
> saw the roots of problems with strings etc.

I think that perhaps /all/ readers of this see at least one •problem• with UTF-8 (and perhaps Unicode/ISO10646 in general in Ada, regardless of choice of encoding) in Ada's String (and perhaps Wide_String and Wide_Wide_String too).

The difficulty is that •no one• has the single •solution• for this problem or these concomitant problems.  Not even J-P. Rosen is a possessor of complete solution in his Wide_Wide_String recommendation, because his replies seem to factually-incorrectly imply that there exists a fully-normalized single-codepoint character in Unicode/ISO10646 for each grapheme/letter.  The following article provides 7 examples in 4 languages (2 of which are European languages, no less!) where a single grapheme's most-compact representation in Unicode/ISO10646 is a multi-codepoint sequence.

The absolutely most infamous of these 7 examples is the Lithuanian one.  Because through flukes of sociopolitical history, Vietnamese, French, German, and so forth all had pre-1992 ISO standards or IBM-Microsoft-Apple code-pages for their letters with diacritics, their languages' letters with diacritics got standardized in Unicode/ISO10646 as single codepoints, e.g., ü as U+FC instead of ¨ U+308 followed by u U+75.  Poor old Lithuania was under Soviet occupation from 1944 to 1991, during which the Soviets tried to suppress the Lithuanian language.  Due to this suppression, the Soviet character-encoding standards never standardized encodings for Lithuanian letters with all the Lithuanian-specific diacritical marks, such as the 2 example letters given in the article linked above.  Because the timespan was so short from the Soviet occupation leaving Lithuania in 1991 to the 1992 cut-off of pre-existing character-encoding standards to which Unicode/ISO10646 must be encode as single codepoints, poor old Lithuanian characters are 2nd-class citizens in Unicode/ISO10646, whereas all the Western European languages (and their former colonies) with diacritical marks are first-class citizens in Unicode/ISO10646.  This is a cause of somewhat of a protracted slow-motion multidecade trench warfare between Lithuania and Unicode/ISO10646 over this issue, made worse every time someone elsewhere on the planet whips up a brand-new character-with-single-codepoint that has never ever existed in the history of humankind and then standardizes this brand-new contrived grapheme-with-single-codepoint in Unicode/ISO10646.

Oh, but Japan and Silicon Valley can devise emojis galore in recent years and not be restricted by strict enforcement of this no-preexisting-character-encoding rule.  Why?  I guess because emojis are cool, but Lithuanian characters are booooorrrrrrrring.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 14:37                                               ` Dan'l Miller
@ 2018-07-04 14:43                                                 ` Dan'l Miller
  2018-07-04 14:57                                                 ` J-P. Rosen
  2018-07-04 15:41                                                 ` Lucretia
  2 siblings, 0 replies; 73+ messages in thread
From: Dan'l Miller @ 2018-07-04 14:43 UTC (permalink / raw)


On Wednesday, July 4, 2018 at 9:37:40 AM UTC-5, Dan'l Miller wrote:
> On Wednesday, July 4, 2018 at 8:27:53 AM UTC-5, Dmitry A. Kazakov wrote:
> > On 2018-07-04 13:30, J-P. Rosen wrote:
> > > Le 04/07/2018 à 12:01, Dmitry A. Kazakov a écrit :
> > >> But UTF-8 is actually more efficient in most cases than
> > >> Wide_Wide_String. Random string indexing is practically never used.
> > > !!!! I, and many others, often need to search substrings within a
> > > string; actually, I would have a hard time finding an example of string
> > > manipulation without indexing...
> > > 
> > >>> We discussed that point, and the agreement was that making a different
> > >>> type would force the user to many conversions that would bring nothing
> > >>> but trouble, and make Ada once again look impractical out of excessive
> > >>> purism.
> > >>
> > >> Exactly my point. Explicit conversion are necessary because Ada's type
> > >> system is unable to model strings in a type-safe way.
> > > So, you want different types, plus a typing system that would allow to
> > > mix the types and make them compatible.
> > 
> > Yes, because they are semantically same: arrays of code points.
> > 
> > > .. You might as well put
> > > everything in the same type!
> > 
> > No, because they must have different representations.
> > 
> > > Anyway, the ARG has to deal with Ada as it is, not as Dmitry dreams it
> > > should be...
> > 
> > It requires someone more influential, wise and knowledgeable than me to 
> > make and then push such a proposal. I would be satisfied if more people 
> > saw the roots of problems with strings etc.
> 
> I think that perhaps /all/ readers of this see at least one •problem• with UTF-8 (and perhaps Unicode/ISO10646 in general in Ada, regardless of choice of encoding) in Ada's String (and perhaps Wide_String and Wide_Wide_String too).
> 
> The difficulty is that •no one• has the single •solution• for this problem or these concomitant problems.  Not even J-P. Rosen is a possessor of complete solution in his Wide_Wide_String recommendation, because his replies seem to factually-incorrectly imply that there exists a fully-normalized single-codepoint character in Unicode/ISO10646 for each grapheme/letter.  The following article provides 7 examples in 4 languages (2 of which are European languages, no less!) where a single grapheme's most-compact representation in Unicode/ISO10646 is a multi-codepoint sequence.
> 
> The absolutely most infamous of these 7 examples is the Lithuanian one.  Because through flukes of sociopolitical history, Vietnamese, French, German, and so forth all had pre-1992 ISO standards or IBM-Microsoft-Apple code-pages for their letters with diacritics, their languages' letters with diacritics got standardized in Unicode/ISO10646 as single codepoints, e.g., ü as U+FC instead of ¨ U+308 followed by u U+75.  Poor old Lithuania was under Soviet occupation from 1944 to 1991, during which the Soviets tried to suppress the Lithuanian language.  Due to this suppression, the Soviet character-encoding standards never standardized encodings for Lithuanian letters with all the Lithuanian-specific diacritical marks, such as the 2 example letters given in the article linked above.  Because the timespan was so short from the Soviet occupation leaving Lithuania in 1991 to the 1992 cut-off of pre-existing character-encoding standards to which Unicode/ISO10646 must be encode as single codepoints, poor old Lithuanian characters are 2nd-class citizens in Unicode/ISO10646, whereas all the Western European languages (and their former colonies) with diacritical marks are first-class citizens in Unicode/ISO10646.  This is a cause of somewhat of a protracted slow-motion multidecade trench warfare between Lithuania and Unicode/ISO10646 over this issue, made worse every time someone elsewhere on the planet whips up a brand-new character-with-single-codepoint that has never ever existed in the history of humankind and then standardizes this brand-new contrived grapheme-with-single-codepoint in Unicode/ISO10646.
> 
> Oh, but Japan and Silicon Valley can devise emojis galore in recent years and not be restricted by strict enforcement of this no-preexisting-character-encoding rule.  Why?  I guess because emojis are cool, but Lithuanian characters are booooorrrrrrrring.

Oh, it would help if I would press the paste key:
http://unicode.org/standard/where


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 14:37                                               ` Dan'l Miller
  2018-07-04 14:43                                                 ` Dan'l Miller
@ 2018-07-04 14:57                                                 ` J-P. Rosen
  2018-07-04 15:41                                                 ` Lucretia
  2 siblings, 0 replies; 73+ messages in thread
From: J-P. Rosen @ 2018-07-04 14:57 UTC (permalink / raw)


Le 04/07/2018 à 16:37, Dan'l Miller a écrit :
> The difficulty is that •no one• has the single •solution• for this
> problem or these concomitant problems.  Not even J-P. Rosen is a
> possessor of complete solution in his Wide_Wide_String
> recommendation, because his replies seem to factually-incorrectly
> imply that there exists a fully-normalized single-codepoint character
> in Unicode/ISO10646 for each grapheme/letter.

You are right that characters not in normalized form (not only
lithuanians!) may have a representation as several code points, which
implies O(N) for some operations... But if it is encoded in UTF-8, you
need an extra O(N) operation to first decode the code point. The
difference between the two is still there.
-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 14:37                                               ` Dan'l Miller
  2018-07-04 14:43                                                 ` Dan'l Miller
  2018-07-04 14:57                                                 ` J-P. Rosen
@ 2018-07-04 15:41                                                 ` Lucretia
  2018-07-04 16:55                                                   ` Dan'l Miller
  2 siblings, 1 reply; 73+ messages in thread
From: Lucretia @ 2018-07-04 15:41 UTC (permalink / raw)


On Wednesday, 4 July 2018 15:37:40 UTC+1, Dan'l Miller  wrote:

> The difficulty is that •no one• has the single •solution• for this problem or these concomitant problems.  Not even J-P. Rosen is a possessor of complete solution in his Wide_Wide_String recommendation, because his replies seem to factually-incorrectly imply that there exists a fully-normalized single-codepoint character in Unicode/ISO10646 for each grapheme/letter.

JP Rosen told me to go read the AI on the matter, which I did. He states they talked about it, there's not much talking in the AI at all! Bob Dewar states they shouldn't really abuse the *String types by subtyping and does exactly that by introducing a package he wrote to handle UTF using those subtypes. The rest of the AI is about how to fit that into the standard.

Back then, they should've chosen the Unicode standard over the ISO10646 as it's freely available, yes the encodings are interchangeable, but that's not really the point. 

They should've decided to obsolete the current mess, the same way they did with ASCII and made String and Unbounded_String UTF-8 encoded. They could still have the old latin based strings as compatibility types. They should've made all source be encoded the same way, which they did anyway for the iso spec.

Then defined a bunch of iterators for the types based on code points, grapheme clusters, word/line boundaries, bidi, etc.

Then taken out all references to characters as that concept isn't really applicable to Unicode as a "character" can be one or more code points.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 15:41                                                 ` Lucretia
@ 2018-07-04 16:55                                                   ` Dan'l Miller
  2018-07-04 18:01                                                     ` Shark8
  0 siblings, 1 reply; 73+ messages in thread
From: Dan'l Miller @ 2018-07-04 16:55 UTC (permalink / raw)


On Wednesday, July 4, 2018 at 10:41:49 AM UTC-5, Lucretia wrote:
> On Wednesday, 4 July 2018 15:37:40 UTC+1, Dan'l Miller  wrote:
> 
> > The difficulty is that •no one• has the single •solution• for this problem or these concomitant
> > problems.  Not even J-P. Rosen is a possessor of complete solution in his Wide_Wide_String
> > recommendation, because his replies seem to factually-incorrectly imply that there exists a fully
> > normalized single-codepoint character in Unicode/ISO10646 for each grapheme/letter.
> 
> JP Rosen told me to go read the AI on the matter, which I did. He states they talked about it, there's not
> much talking in the AI at all! Bob Dewar states they shouldn't really abuse the *String types by subtyping
> and does exactly that by introducing a package he wrote to handle UTF using those subtypes. The rest
> of the AI is about how to fit that into the standard.
> 
> Back then, they should've chosen the Unicode standard over the ISO10646 as it's freely available, yes
> the encodings are interchangeable, but that's not really the point. 

1) As a fellow ISO standard (ISO8652), Ada is compelled by ISO rules to comply with ISO standards (instead of other standards bodies) when an ISO standard exists for that topic.

2) In the end, what difference to Ada would actually occur by the ARG considering Unicode the normative reference instead of ISO10646 the normative reference.  The Unicode-specific extensions are higher in the food chain (e.g., bidirectional algorithms) than Ada's libraries (or language) have ever bitten off to chew.

> They should've decided to obsolete the current mess, the same way they did with ASCII and made String
> and Unbounded_String UTF-8 encoded. They could still have the old latin based strings as compatibility
> types. They should've made all source be encoded the same way, which they did anyway for the iso
> spec.
> 
> Then defined a bunch of iterators for the types based on code points, grapheme clusters, word/line
> boundaries, bidi, etc.

Yes, parsing/decoding iterators over UTF-8 and UTF-16 would be awesome.  Where-is-the-next-fully-formed-grapheme iterators would be awesome for UTF-32 and UCS4 to make processing of combining characters (both in never-single-codepoint graphemes and in not-normalized-but-could-have-been multi-codepoint sequences) would be awesome.  But then again, why bother waiting decade or two for the standard library?  Ada could have a Boost-esque library outside of the ISO8652 standard, where, say, Luke & Dmitry contribute such a better solution.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 11:30                                           ` J-P. Rosen
  2018-07-04 13:27                                             ` Dmitry A. Kazakov
@ 2018-07-04 17:51                                             ` Jacob Sparre Andersen
  2018-07-04 18:06                                               ` Shark8
  2018-07-05 18:06                                               ` Randy Brukardt
  1 sibling, 2 replies; 73+ messages in thread
From: Jacob Sparre Andersen @ 2018-07-04 17:51 UTC (permalink / raw)


J-P. Rosen <rosen@adalog.fr> writes:

> !!!! I, and many others, often need to search substrings within a
> string; actually, I would have a hard time finding an example of
> string manipulation without indexing...

When you search for a substring within a string, you're typically
treating it in a very sequential manner.  Maintaining a "cursor"
pointing at the octet position in the UTF-8 encoded string would be just
as practical in most (all?) of the string processing I can remember
doing?

Counting the number of code points(?) in a string takes longer time, but
if you want the actual number of graphemes in the string,
Wide_Wide_Character is practically just as slow as a UTF-8 encoded
string.

> So, you want different types, plus a typing system that would allow to
> mix the types and make them compatible... You might as well put
> everything in the same type!

It would be nice if the encoding and character set of a string were
"implementation details".  I'm not sure how to do it, but I think it is
worth trying to find a solution for Ada.  (I think I was introduced to
how the KDE library does it once, but IIRC only encoding was abstracted
away.)

Greetings,

Jacob
-- 
»Saving keystrokes is the job of the text editor, not the
 programming language.«                    -- Preben Randhol

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 16:55                                                   ` Dan'l Miller
@ 2018-07-04 18:01                                                     ` Shark8
  2018-07-04 18:57                                                       ` Dmitry A. Kazakov
  0 siblings, 1 reply; 73+ messages in thread
From: Shark8 @ 2018-07-04 18:01 UTC (permalink / raw)


On Wednesday, July 4, 2018 at 10:55:08 AM UTC-6, Dan'l Miller wrote:
> 
> 1) As a fellow ISO standard (ISO8652), Ada is compelled by ISO rules to comply with ISO standards (instead of other standards bodies) when an ISO standard exists for that topic.

Except that, if you'll allow me to be blunt and politically incorrect, Unicode is a terrible [non-]solution to the problem. Mark my words: Building/standardizing on Unicode will only bring pain and suffering.

"The purpose of standardization is to aid the creative craftsman, *not* to enforce the common mediocrity."
— Author unknown; found on a blackboard at Eglin Air Force Base

Unicode is fatally flawed because it does enforce the common mediocrity. Much like, eg, strings for representing paths is the common way to do things, so too is Unicode overly tied to the wrongheaded manner of doing things: discarding actual structure in favor of ad hoc calculation, eliminating semantically useful [and needed information] and hoping to be able to recover it with later processing.

As an example, the sentence "The Hebrew word for 'man' is 'אדם' (Adam)."  is *NOT* merely a sequence of graphemes, codepoints, and/or bytes. It is a semantically meaningful text consisting of multiple languages... and *this* is what Unicode discards.

A much better way to handle something like this would be a sort of multi-lingual 'string'/'sequence' type, where the above would be in a Lisp-ish structure: ((English-string "The Hebrew word for " (quotation "man") "is "), (quotation (Hebrew-string "אדם")), English-string (parenthetical "Adam")).

But Unicode discards all that information, instead opting for ('T', 'h', 'e', ' ', 'H', 'e' 'b' ...) and offloading the structure-recovery to whatever text-processing / display-method API there is.

But this is all par-for-course within computer-science and "the industry" -- Welcome to the wonderful world of unix/C "small tools" and "pipes" where text processing is mandatory and at every step of the problem you discard all type-information, forcing everything downstream to re-parse the text -- all over again; "enforce the common mediocrity" thy name is Unix/C.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 17:51                                             ` Jacob Sparre Andersen
@ 2018-07-04 18:06                                               ` Shark8
  2018-07-04 18:59                                                 ` Dan'l Miller
                                                                   ` (2 more replies)
  2018-07-05 18:06                                               ` Randy Brukardt
  1 sibling, 3 replies; 73+ messages in thread
From: Shark8 @ 2018-07-04 18:06 UTC (permalink / raw)


On Wednesday, July 4, 2018 at 11:51:20 AM UTC-6, Jacob Sparre Andersen wrote:
> 
> It would be nice if the encoding and character set of a string were
> "implementation details".  I'm not sure how to do it, but I think it is
> worth trying to find a solution for Ada.  (I think I was introduced to
> how the KDE library does it once, but IIRC only encoding was abstracted
> away.)

Indeed so!
This is the way we /should/ have strings; where [[Wide_]Wide_]String are all generic with things like 'character-set' and 'search' and 'encoding' as formal parameters.

Sadly this will likely never happen because it would break backwards compatibility.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 18:01                                                     ` Shark8
@ 2018-07-04 18:57                                                       ` Dmitry A. Kazakov
  2018-07-04 19:53                                                         ` Shark8
  0 siblings, 1 reply; 73+ messages in thread
From: Dmitry A. Kazakov @ 2018-07-04 18:57 UTC (permalink / raw)


On 2018-07-04 20:01, Shark8 wrote:
> On Wednesday, July 4, 2018 at 10:55:08 AM UTC-6, Dan'l Miller wrote:

> As an example, the sentence "The Hebrew word for 'man' is 'אדם' (Adam)."  is *NOT* merely a sequence of graphemes, codepoints, and/or bytes. It is a semantically meaningful text consisting of multiple languages... and *this* is what Unicode discards.

And rightly so. Like 91093835.6 is just a number instead "meaningful": 
the mass of a stationary electron.

One fundamental principle of software design is abstraction in the sense 
of throwing away unnecessary information. A printer may know nothing 
about Hebrew.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 18:06                                               ` Shark8
@ 2018-07-04 18:59                                                 ` Dan'l Miller
  2018-07-04 19:01                                                 ` Dmitry A. Kazakov
  2018-07-04 21:00                                                 ` Jacob Sparre Andersen
  2 siblings, 0 replies; 73+ messages in thread
From: Dan'l Miller @ 2018-07-04 18:59 UTC (permalink / raw)


On Wednesday, July 4, 2018 at 1:06:17 PM UTC-5, Shark8 wrote:
> On Wednesday, July 4, 2018 at 11:51:20 AM UTC-6, Jacob Sparre Andersen wrote:
> > 
> > It would be nice if the encoding and character set of a string were
> > "implementation details".  I'm not sure how to do it, but I think it is
> > worth trying to find a solution for Ada.  (I think I was introduced to
> > how the KDE library does it once, but IIRC only encoding was abstracted
> > away.)
> 
> Indeed so!
> This is the way we /should/ have strings; where [[Wide_]Wide_]String are all generic with things like 'character-set' and 'search' and 'encoding' as formal parameters.
> 
> Sadly this will likely never happen because it would break backwards compatibility.

Then do it outside of the standardization process in a Boost-esque library on GitHub/GitLab/SourceForge to launch a de facto standard that establishes ISO's vaunted ‘established industry practice’.  If C++ can do it, then so can Ada.

That being said, I believe that a far better model than Boost's exists for the cream rising to the top.  Instead of battle-of-the-emails-establishes-king-of-the-hill dominance hierarchies (with all due respect to the esteemed Jordan Peterson), I would recommend multiple concurrently-competing library designs, then a rigorous (repeated? annual?) bake-off among the competitors, evaluating multiple criteria:  runtime performance, engineering-time design flexibility/tunabilty/ease-of-use, maintainability over time.  Oh, call them the Yellow, Blue, Red, and Green libraries.  Who did •that• before for language definition?  …  but for potential standard library content instead this time.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 18:06                                               ` Shark8
  2018-07-04 18:59                                                 ` Dan'l Miller
@ 2018-07-04 19:01                                                 ` Dmitry A. Kazakov
  2018-07-05 18:08                                                   ` Randy Brukardt
  2018-07-04 21:00                                                 ` Jacob Sparre Andersen
  2 siblings, 1 reply; 73+ messages in thread
From: Dmitry A. Kazakov @ 2018-07-04 19:01 UTC (permalink / raw)


On 2018-07-04 20:06, Shark8 wrote:
> On Wednesday, July 4, 2018 at 11:51:20 AM UTC-6, Jacob Sparre Andersen wrote:
>>
>> It would be nice if the encoding and character set of a string were
>> "implementation details".  I'm not sure how to do it, but I think it is
>> worth trying to find a solution for Ada.  (I think I was introduced to
>> how the KDE library does it once, but IIRC only encoding was abstracted
>> away.)
> 
> Indeed so!
> This is the way we /should/ have strings; where [[Wide_]Wide_]String are all generic with things like 'character-set' and 'search' and 'encoding' as formal parameters.
> 
> Sadly this will likely never happen because it would break backwards compatibility.

It would break nothing. Old package will become renamings of new 
instances. Well, except for dire deforestation should new RM be ever 
printed...

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04  7:53                                     ` Dmitry A. Kazakov
  2018-07-04  9:55                                       ` J-P. Rosen
@ 2018-07-04 19:02                                       ` G. B.
  2018-07-04 19:16                                         ` Dmitry A. Kazakov
  1 sibling, 1 reply; 73+ messages in thread
From: G. B. @ 2018-07-04 19:02 UTC (permalink / raw)


Dmitry A. Kazakov <mailbox@dmitry-kazakov.de> wrote:

> Back to the square 
> one, how to design an UTF-8 string type?

Never.

What is the proper representation of 3?

Which role does a UTF play, other than during I/O operations? So, that’s a
type that stands for certain I/O operations of certain objects...
Practically, that’s  properly typed proper procedures, no?






^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 19:02                                       ` G. B.
@ 2018-07-04 19:16                                         ` Dmitry A. Kazakov
  2018-07-04 20:40                                           ` G. B.
  0 siblings, 1 reply; 73+ messages in thread
From: Dmitry A. Kazakov @ 2018-07-04 19:16 UTC (permalink / raw)


On 2018-07-04 21:02, G. B. wrote:
> Dmitry A. Kazakov <mailbox@dmitry-kazakov.de> wrote:
> 
>> Back to the square
>> one, how to design an UTF-8 string type?
> 
> Never.
> 
> What is the proper representation of 3?

What is 3 here?

> Which role does a UTF play, other than during I/O operations?

UTF-8 is a preferable encoding for most text processing purposes. Should 
string types never be fixed, a quick and dirty solution would be 
throwing wide string types away and declaring String with all its 
bastards (Unbounded_String etc) UTF-8.

> So, that’s a
> type that stands for certain I/O operations of certain objects...

No, it is a type that stands for string.

> Practically, that’s  properly typed proper procedures, no?

You lost me here again.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 18:57                                                       ` Dmitry A. Kazakov
@ 2018-07-04 19:53                                                         ` Shark8
  2018-07-04 20:05                                                           ` Lucretia
  2018-07-04 20:43                                                           ` Dmitry A. Kazakov
  0 siblings, 2 replies; 73+ messages in thread
From: Shark8 @ 2018-07-04 19:53 UTC (permalink / raw)


On Wednesday, July 4, 2018 at 12:57:40 PM UTC-6, Dmitry A. Kazakov wrote:
> On 2018-07-04 20:01, Shark8 wrote:
> > On Wednesday, July 4, 2018 at 10:55:08 AM UTC-6, Dan'l Miller wrote:
> 
> > As an example, the sentence "The Hebrew word for 'man' is 'אדם' (Adam)."  is *NOT* merely a sequence of graphemes, codepoints, and/or bytes. It is a semantically meaningful text consisting of multiple languages... and *this* is what Unicode discards.
> 
> And rightly so. Like 91093835.6 is just a number instead "meaningful": 
> the mass of a stationary electron.
> 
> One fundamental principle of software design is abstraction in the sense 
> of throwing away unnecessary information. A printer may know nothing 
> about Hebrew.

Interesting how you're ready, willing and able to conflate all portions of data-storage/-management into a single operation: printing.

But let's take a step backward; what about displaying the text? One certainly could argue that Unicode is a good solution in this arena, after all havng the ability to encode all of human language is it's stated design-goal, so surely it must be well-suited to that, right?

Not really.
The mere fact of "combining characters" makes unicode no more suited to textual display than a sort of hypothetical Forth/PostScript where each word/token/character is processed by the display driver and rendered appropriately. (The aforementioned Lisp-like structure being executed is the procedure: "painting/displaying the english text, switch-to-hebrew, print/display hebrew text, switch-to-english, print/display english text", which of course can be further decomposed to "print 'T' [horizontal-stroke, vertical stroke] print 'h' [vertical stroke, curved stroke, vertical stroke] print 'e' [horizontal stroke, curved stroke] ....")

This is the essential idea behind PostScript printers; and it works well. (The same/analogous procedure must be executed in SW and transmitted to the printer in non-PostScript printers; usually using some proprietary printer-control-language, which is essentially what printer-drivers *ARE*.)

So, even working backward from your example of printing, where you claim that "knowledge of Hebrew is unneeded" is... well dubious. It's certainly needed  somewhere along the line for this example. My contention that "sequence of codepoints + font" is flatly stupid for a multi-language system.

Arguably it's stupid for a single-language system, too. As an example we could use paths: "root\projects\x\source" is flatly moronic*, and we can see this by how it pops up in multi-platform development: should there be a terminal '\'? do those '\' characters need escaped? do they need to be replaced with '/'? What we have is a sequence (root, projects, x, source) which corresponds to a path down a tree, but the common "industry practice" is to think of this as a string of characters and "parse/reparse/regex/reparse/whatever" textual manipulations to read what the structure is rather than sensibly save the structural information.

* forced upon us by stupid, thin APIs to the OS.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 19:53                                                         ` Shark8
@ 2018-07-04 20:05                                                           ` Lucretia
  2018-07-04 22:04                                                             ` Shark8
  2018-07-04 20:43                                                           ` Dmitry A. Kazakov
  1 sibling, 1 reply; 73+ messages in thread
From: Lucretia @ 2018-07-04 20:05 UTC (permalink / raw)


On Wednesday, 4 July 2018 20:53:21 UTC+1, Shark8  wrote:

> But let's take a step backward; what about displaying the text? One certainly could argue that Unicode is a good solution in this arena, after all havng the ability to encode all of human language is it's stated design-goal, so surely it must be well-suited to that, right?

You're wrong. Unicode is not about displaying text, it even says that in the spec, it's about representation. Stop trying to force Unicode into Lisp or Forth or whatever to try to add meaning to text.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 19:16                                         ` Dmitry A. Kazakov
@ 2018-07-04 20:40                                           ` G. B.
  2018-07-04 20:55                                             ` Dmitry A. Kazakov
  0 siblings, 1 reply; 73+ messages in thread
From: G. B. @ 2018-07-04 20:40 UTC (permalink / raw)


Dmitry A. Kazakov <mailbox@dmitry-kazakov.de> wrote:
> On 2018-07-04 21:02, G. B. wrote:
>> Dmitry A. Kazakov <mailbox@dmitry-kazakov.de> wrote:
>> 
>>> Back to the square
>>> one, how to design an UTF-8 string type?
>> 
>> Never.
>> 
>> What is the proper representation of 3?
> 
> What is 3 here?

It names a value of some type.

>> Which role does a UTF play, other than during I/O operations?
> 
> UTF-8 is a preferable encoding for most text processing purposes. 

Like finding the number of characters that some Ada string has?


> Should string types never be fixed, a quick and dirty solution would be 
> throwing wide string types away 

Maybe. Sort of works, in Java.

> and declaring String with all its 
> bastards (Unbounded_String etc) UTF-8.

I’d not want encoding here.

>> Practically, that’s  properly typed proper procedures, no?
> 
> You lost me here again.

A string to be output somewhere may need an encoding. (‘H’, ‘e’, ‘l’, ‘l’,
‘o’) does not need one to be useful, but output is performed by a value of
type File * String * Encoding -> Void: a properly typed procedure.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 19:53                                                         ` Shark8
  2018-07-04 20:05                                                           ` Lucretia
@ 2018-07-04 20:43                                                           ` Dmitry A. Kazakov
  1 sibling, 0 replies; 73+ messages in thread
From: Dmitry A. Kazakov @ 2018-07-04 20:43 UTC (permalink / raw)


On 2018-07-04 21:53, Shark8 wrote:

> But let's take a step backward; what about displaying the text? One certainly could argue that Unicode is a good solution in this arena, after all havng the ability to encode all of human language is it's stated design-goal, so surely it must be well-suited to that, right?

Some written languages, some human, some formal.

> Not really.
> The mere fact of "combining characters" makes unicode no more suited to textual display than a sort of hypothetical Forth/PostScript where each word/token/character is processed by the display driver and rendered appropriately. (The aforementioned Lisp-like structure being executed is the procedure: "painting/displaying the english text, switch-to-hebrew, print/display hebrew text, switch-to-english, print/display english text", which of course can be further decomposed to "print 'T' [horizontal-stroke, vertical stroke] print 'h' [vertical stroke, curved stroke, vertical stroke] print 'e' [horizontal stroke, curved stroke] ....")

Written languages evolve in order to adapt to the methods of writing. 
Many old methods do not fit well into Unicode. And, honestly, Unicode 
tried way too much to embrace things better to drop.

I miss ASCII times, really. It forced English (and nobody cared about 
correct spelling of the word naïve (:-)) [*]

> Arguably it's stupid for a single-language system, too. As an example we could use paths: "root\projects\x\source" is flatly moronic*, and we can see this by how it pops up in multi-platform development: should there be a terminal '\'? do those '\' characters need escaped? do they need to be replaced with '/'? What we have is a sequence (root, projects, x, source) which corresponds to a path down a tree, but the common "industry practice" is to think of this as a string of characters and "parse/reparse/regex/reparse/whatever" textual manipulations to read what the structure is rather than sensibly save the structural information.

Textual representation is ambiguous and this has nothing to do with the 
text, but with its meaning. Path is ambiguous and its meaning (the 
target file) is even more ambiguous. If you want it less ambiguous use 
sector and block numbers and color stickers to mark hard drives. What 
about 001 vs 1 vs 3/2 ... infinite list follows.

Meaning of a text is not the text itself. It is a fallacy. The meaning 
of a numeric literal is not the literal itself. The meaning of a Unicode 
string has nothing to do with Unicode. And, fundamentally, there is an 
infinite hierarchy of object-meta languages that never ends in some 
ultimate, final "Om", expressing everything and nothing.

-----------------
* Do Chinese still write top-to-bottom?

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 20:40                                           ` G. B.
@ 2018-07-04 20:55                                             ` Dmitry A. Kazakov
  2018-07-04 21:21                                               ` G.B.
  0 siblings, 1 reply; 73+ messages in thread
From: Dmitry A. Kazakov @ 2018-07-04 20:55 UTC (permalink / raw)


On 2018-07-04 22:40, G. B. wrote:
> Dmitry A. Kazakov <mailbox@dmitry-kazakov.de> wrote:
>> On 2018-07-04 21:02, G. B. wrote:
>>> Dmitry A. Kazakov <mailbox@dmitry-kazakov.de> wrote:
>>>
>>>> Back to the square
>>>> one, how to design an UTF-8 string type?
>>>
>>> Never.
>>>
>>> What is the proper representation of 3?
>>
>> What is 3 here?
> 
> It names a value of some type.

Which type? You name the type, I name the representation.

>>> Which role does a UTF play, other than during I/O operations?
>>
>> UTF-8 is a preferable encoding for most text processing purposes.
> 
> Like finding the number of characters that some Ada string has?

This operation is practically never required in text processing. The 
strength (and a design goal) of UTF-8 is that almost all useful 
operations defined in terms of characters are directly mapped into 
operations defined on octets. The rest may have whatever complexity, 
nobody cares.

>> Should string types never be fixed, a quick and dirty solution would be
>> throwing wide string types away
> 
> Maybe. Sort of works, in Java.
> 
>> and declaring String with all its
>> bastards (Unbounded_String etc) UTF-8.
> 
> I’d not want encoding here.

There is always some. UTF-8 is a choice with the best balance of 
advantages vs disadvantage.

>>> Practically, that’s  properly typed proper procedures, no?
>>
>> You lost me here again.
> 
> A string to be output somewhere may need an encoding. (‘H’, ‘e’, ‘l’, ‘l’,
> ‘o’) does not need one to be useful, but output is performed by a value of
> type File * String * Encoding -> Void: a properly typed procedure.

It is almost never decomposed this way. Encoding is a part of string 
type representation. File I/O is usually untyped or weakly typed. At 
best the encoding is a parameter of file open. Ada text I/O packages are 
designed to deal with a single type of strings with encoding taken from 
the string type. But I still have no idea what you want to say by that.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 18:06                                               ` Shark8
  2018-07-04 18:59                                                 ` Dan'l Miller
  2018-07-04 19:01                                                 ` Dmitry A. Kazakov
@ 2018-07-04 21:00                                                 ` Jacob Sparre Andersen
  2 siblings, 0 replies; 73+ messages in thread
From: Jacob Sparre Andersen @ 2018-07-04 21:00 UTC (permalink / raw)


Shark8 <onewingedshark@gmail.com> writes:

> This is the way we /should/ have strings; where [[Wide_]Wide_]String
> are all generic with things like 'character-set' and 'search' and
> 'encoding' as formal parameters.

If you made them generic, they would be different types.  That breaks
with one of the wishes Dmitry listed.

Greetings,

Jacob
-- 
"Magnetohydrodynamics combines the intuitive nature of Maxwell's
 equations with the easy solvability of the Navier-Stokes equations."


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 20:55                                             ` Dmitry A. Kazakov
@ 2018-07-04 21:21                                               ` G.B.
  2018-07-05  7:55                                                 ` Dmitry A. Kazakov
  0 siblings, 1 reply; 73+ messages in thread
From: G.B. @ 2018-07-04 21:21 UTC (permalink / raw)


On 04.07.18 22:55, Dmitry A. Kazakov wrote:
> On 2018-07-04 22:40, G. B. wrote:
>> Dmitry A. Kazakov <mailbox@dmitry-kazakov.de> wrote:
>>> On 2018-07-04 21:02, G. B. wrote:
>>>> Dmitry A. Kazakov <mailbox@dmitry-kazakov.de> wrote:
>>>>
>>>>> Back to the square
>>>>> one, how to design an UTF-8 string type?
>>>>
>>>> Never.
>>>>
>>>> What is the proper representation of 3?
>>>
>>> What is 3 here?
>>
>> It names a value of some type.
> 
> Which type? You name the type,

Any type whose objects' values include the value 3
and which does not specify a representation in source,
like Standard.Integer.


>> I’d not want encoding here.
> 
> There is always some.

Not in source, where design is fixed explicitly.

>  But I still have no idea what you want to say by that.

A properly typed procedure object handles the use case of
encoding I/O in a type safe way. The type is not that of
string-with-something composites. It is the type which covers
the use case procedurally.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 20:05                                                           ` Lucretia
@ 2018-07-04 22:04                                                             ` Shark8
  2018-07-05  0:12                                                               ` Dan'l Miller
  0 siblings, 1 reply; 73+ messages in thread
From: Shark8 @ 2018-07-04 22:04 UTC (permalink / raw)


On Wednesday, July 4, 2018 at 2:05:17 PM UTC-6, Lucretia wrote:
> On Wednesday, 4 July 2018 20:53:21 UTC+1, Shark8  wrote:
> 
> > But let's take a step backward; what about displaying the text? One certainly could argue that Unicode is a good solution in this arena, after all havng the ability to encode all of human language is it's stated design-goal, so surely it must be well-suited to that, right?
> 
> You're wrong. Unicode is not about displaying text, it even says that in the spec, it's about representation. Stop trying to force Unicode into Lisp or Forth or whatever to try to add meaning to text.

I didn't say it *was*, I used display as an example.
But you bring up a good point: it's a terrible representation, for all that I've said, and more.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 22:04                                                             ` Shark8
@ 2018-07-05  0:12                                                               ` Dan'l Miller
  2018-07-05  1:46                                                                 ` Shark8
  0 siblings, 1 reply; 73+ messages in thread
From: Dan'l Miller @ 2018-07-05  0:12 UTC (permalink / raw)


On Wednesday, July 4, 2018 at 5:04:13 PM UTC-5, Shark8 wrote:
> On Wednesday, July 4, 2018 at 2:05:17 PM UTC-6, Lucretia wrote:
> > On Wednesday, 4 July 2018 20:53:21 UTC+1, Shark8  wrote:
> > 
> > > But let's take a step backward; what about displaying the text? One certainly could argue that Unicode is a good solution in this arena, after all havng the ability to encode all of human language is it's stated design-goal, so surely it must be well-suited to that, right?
> > 
> > You're wrong. Unicode is not about displaying text, it even says that in the spec, it's about representation. Stop trying to force Unicode into Lisp or Forth or whatever to try to add meaning to text.
> 
> I didn't say it *was*, I used display as an example.
> But you bring up a good point: it's a terrible representation, for all that I've said, and more.

Shark8, it seems that your criticisms were that instead of representing the Hebrew letters, we ought to represent the whole Hebrew word.  Isn't that an entirely different problem-space higher in the food chain?

My qualms with Unicode is that it gets into far more topics than character encoding and then for some odd reason refuses to standardize single-codepoint representation of some language's letters (and then for some even odder reason standardizes offbeat emojis far beyond the original Japanese single-codepoint representations of old 1980s emoticons).  I guess all that billion codepoints beyond BMP is reserved for all the extra-terrestrial space-alien languages, not for us mere mortals on planet Earth.  Poor old Lithuanian needs to not only stand in line behind all the Western European nations (and their former colonies) but also poor old Lithuanian needs to stand in line behind E.T.

Shark8, what would be the better solution for character-encoding itself?  (not whole words)


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-05  0:12                                                               ` Dan'l Miller
@ 2018-07-05  1:46                                                                 ` Shark8
  2018-07-05  2:07                                                                   ` Luke A. Guest
  0 siblings, 1 reply; 73+ messages in thread
From: Shark8 @ 2018-07-05  1:46 UTC (permalink / raw)


On Wednesday, July 4, 2018 at 6:12:06 PM UTC-6, Dan'l Miller wrote:
> On Wednesday, July 4, 2018 at 5:04:13 PM UTC-5, Shark8 wrote:
> > On Wednesday, July 4, 2018 at 2:05:17 PM UTC-6, Lucretia wrote:
> > > On Wednesday, 4 July 2018 20:53:21 UTC+1, Shark8  wrote:
> > > 
> > > > But let's take a step backward; what about displaying the text? One certainly could argue that Unicode is a good solution in this arena, after all havng the ability to encode all of human language is it's stated design-goal, so surely it must be well-suited to that, right?
> > > 
> > > You're wrong. Unicode is not about displaying text, it even says that in the spec, it's about representation. Stop trying to force Unicode into Lisp or Forth or whatever to try to add meaning to text.
> > 
> > I didn't say it *was*, I used display as an example.
> > But you bring up a good point: it's a terrible representation, for all that I've said, and more.
> 
> Shark8, it seems that your criticisms were that instead of representing the Hebrew letters, we ought to represent the whole Hebrew word.  Isn't that an entirely different problem-space higher in the food chain?
> 
> My qualms with Unicode is that it gets into far more topics than character encoding and then for some odd reason refuses to standardize single-codepoint representation of some language's letters (and then for some even odder reason standardizes offbeat emojis far beyond the original Japanese single-codepoint representations of old 1980s emoticons).  I guess all that billion codepoints beyond BMP is reserved for all the extra-terrestrial space-alien languages, not for us mere mortals on planet Earth.  Poor old Lithuanian needs to not only stand in line behind all the Western European nations (and their former colonies) but also poor old Lithuanian needs to stand in line behind E.T.
> 
> Shark8, what would be the better solution for character-encoding itself?  (not whole words)

Whole-word isn't a terrible idea, per se. But the thrust I was getting at is the delination between languages: with Unicode it's a sequence of codepoints, independent of the actual item (word, sentence, etc) other than [perhaps] graphic-presented. That the example is (Eng,Eng,Eng...Eng, Heb,Heb,Heb,Heb, Eng,Eng,Eng...) codepoints is not the problem, though related, because it discards all information in favor of (num, num, num, num, ...) rather than actually considering alternate languages: IMO, ("The Hebrew word for man" (quote ADAM) (quote "Adam") ".") is much better as 'text' because we're preserving structure: [ENGLISH [THIS SECTION HEBREW] ENGLISH].


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-05  1:46                                                                 ` Shark8
@ 2018-07-05  2:07                                                                   ` Luke A. Guest
  2018-07-05 16:47                                                                     ` Shark8
  0 siblings, 1 reply; 73+ messages in thread
From: Luke A. Guest @ 2018-07-05  2:07 UTC (permalink / raw)


Shark8 <onewingedshark@gmail.com> wrote:

>> Shark8, what would be the better solution for character-encoding itself?
>>  (not whole words)
> 
> Whole-word isn't a terrible idea, per se. But the thrust I was getting at
> is the delination between languages: with Unicode it's a sequence of
> codepoints, independent of the actual item (word, sentence, etc) other
> than [perhaps] graphic-presented. That the example is (Eng,Eng,Eng...Eng,
> Heb,Heb,Heb,Heb, Eng,Eng,Eng...) codepoints is not the problem, though
> related, because it discards all information in favor of (num, num, num,
> num, ...) rather than actually considering alternate languages: IMO,
> ("The Hebrew word for man" (quote ADAM) (quote "Adam") ".") is much
> better as 'text' because we're preserving structure: [ENGLISH [THIS
> SECTION HEBREW] ENGLISH].
> 

I don’t understand why you think Unicode should carry linguistic
information when all it has ever been designed to do is encode symbols
across all languages and their direction.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 21:21                                               ` G.B.
@ 2018-07-05  7:55                                                 ` Dmitry A. Kazakov
  2018-07-06  8:28                                                   ` G.B.
  0 siblings, 1 reply; 73+ messages in thread
From: Dmitry A. Kazakov @ 2018-07-05  7:55 UTC (permalink / raw)


On 2018-07-04 23:21, G.B. wrote:
> On 04.07.18 22:55, Dmitry A. Kazakov wrote:
>> On 2018-07-04 22:40, G. B. wrote:
>>> Dmitry A. Kazakov <mailbox@dmitry-kazakov.de> wrote:
>>>> On 2018-07-04 21:02, G. B. wrote:
>>>>> Dmitry A. Kazakov <mailbox@dmitry-kazakov.de> wrote:
>>>>>
>>>>>> Back to the square
>>>>>> one, how to design an UTF-8 string type?
>>>>>
>>>>> Never.
>>>>>
>>>>> What is the proper representation of 3?
>>>>
>>>> What is 3 here?
>>>
>>> It names a value of some type.
>>
>> Which type? You name the type,
> 
> Any type whose objects' values include the value 3
> and which does not specify a representation in source,
> like Standard.Integer.

Any type from a set of types? You mean a class-wide object then. The 
representation of a class-wide object is (Tag, Value). So, name the 
specific type and you get the representation. You cannot skip that step. 
There is no values and representations of without types.

>>> I’d not want encoding here.
>>
>> There is always some.
> 
> Not in source, where design is fixed explicitly.
> 
>>  But I still have no idea what you want to say by that.
> 
> A properly typed procedure object handles the use case of
> encoding I/O in a type safe way. The type is not that of
> string-with-something composites. It is the type which covers
> the use case procedurally.

I do not quite understand this either, but it sounds more right than 
wrong. So?

P.S. It would be much easier, if you first stated a proposition and then 
illustrated it with an example, rather than trowing an example without 
any hits as to what class of circumstances this example is supposed to 
represent.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-05  2:07                                                                   ` Luke A. Guest
@ 2018-07-05 16:47                                                                     ` Shark8
  2018-07-05 17:19                                                                       ` Dan'l Miller
  0 siblings, 1 reply; 73+ messages in thread
From: Shark8 @ 2018-07-05 16:47 UTC (permalink / raw)


On Wednesday, July 4, 2018 at 8:07:56 PM UTC-6, Luke A. Guest wrote:
> Shark8 wrote:
> 
> >> Shark8, what would be the better solution for character-encoding itself?
> >>  (not whole words)
> > 
> > Whole-word isn't a terrible idea, per se. But the thrust I was getting at
> > is the delination between languages: with Unicode it's a sequence of
> > codepoints, independent of the actual item (word, sentence, etc) other
> > than [perhaps] graphic-presented. That the example is (Eng,Eng,Eng...Eng,
> > Heb,Heb,Heb,Heb, Eng,Eng,Eng...) codepoints is not the problem, though
> > related, because it discards all information in favor of (num, num, num,
> > num, ...) rather than actually considering alternate languages: IMO,
> > ("The Hebrew word for man" (quote ADAM) (quote "Adam") ".") is much
> > better as 'text' because we're preserving structure: [ENGLISH [THIS
> > SECTION HEBREW] ENGLISH].
> > 
> 
> I don’t understand why you think Unicode should carry linguistic
> information when all it has ever been designed to do is encode symbols
> across all languages and their direction.

I'm not saying that "Unicode should" do *anything* -- I'm saying Unicode solves *the wrong problem*.

"Encoding symbols" ties everything to a stupidly primitive level, forcing everything to such lowest common denominator so as to apply "the unix way" processing to text: discard all structural information, all semantic information, and have "some tool" regenerate it later... just like "the unix way" discards type-information in favor of forcing ad-hoc parsing on unstructured-text at every step between it's "small tools" connected together with 'pipes'.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-05 16:47                                                                     ` Shark8
@ 2018-07-05 17:19                                                                       ` Dan'l Miller
  2018-07-05 19:14                                                                         ` Shark8
  0 siblings, 1 reply; 73+ messages in thread
From: Dan'l Miller @ 2018-07-05 17:19 UTC (permalink / raw)


On Thursday, July 5, 2018 at 11:47:33 AM UTC-5, Shark8 wrote:
> On Wednesday, July 4, 2018 at 8:07:56 PM UTC-6, Luke A. Guest wrote:
> > Shark8 wrote:
> > 
> > >> Shark8, what would be the better solution for character-encoding itself?
> > >>  (not whole words)
> > > 
> > > Whole-word isn't a terrible idea, per se. But the thrust I was getting at
> > > is the delination between languages: with Unicode it's a sequence of
> > > codepoints, independent of the actual item (word, sentence, etc) other
> > > than [perhaps] graphic-presented. That the example is (Eng,Eng,Eng...Eng,
> > > Heb,Heb,Heb,Heb, Eng,Eng,Eng...) codepoints is not the problem, though
> > > related, because it discards all information in favor of (num, num, num,
> > > num, ...) rather than actually considering alternate languages: IMO,
> > > ("The Hebrew word for man" (quote ADAM) (quote "Adam") ".") is much
> > > better as 'text' because we're preserving structure: [ENGLISH [THIS
> > > SECTION HEBREW] ENGLISH].
> > > 
> > 
> > I don’t understand why you think Unicode should carry linguistic
> > information when all it has ever been designed to do is encode symbols
> > across all languages and their direction.
> 
> I'm not saying that "Unicode should" do *anything* -- I'm saying Unicode solves *the wrong problem*.
> 
> "Encoding symbols" ties everything to a stupidly primitive level, forcing everything to such lowest
> common denominator so as to apply "the unix way" processing to text: discard all structural information,
> all semantic information, and have "some tool" regenerate it later... just like "the unix way" discards
> type-information in favor of forcing ad-hoc parsing on unstructured-text at every step between it's
> "small tools" connected together with 'pipes'.

At some level I could conceivably agree with you in principle that a strictly-linear sequence of unadorned symbols is too low-level is some designs to be useful.  For example, there was a time in the 1970s through early 1980s when Texas Instruments microprocessors excessively modeled a Turing machine's tapes (dual-tape model).  No one nowadays would think that a processor should be strictly & intentionally designed to overtly model a Turing machine directly right down to the linear streams/tapes of symbols.

Unicode/ISO10646 is asinine in its insistence on a sequence of •multiple• codepoints being the ••shortest possible•• representation of some individual letter in some natural language.  Programmers want one-letter-one-codepoint representation in all languages—not some Turing-machine tape to process sequentially statefully, as Unicode demands even in its 32-bit UCS4 or UTF-32 representations.  Programmers don't want any “well, yeah but …” situations at all when they just finished executing the fully-normalize-all-the-codepoints-in-this-string subprogram (but that “well yeah but …” is the world we suffer in with Unicode/ISO10646 as currently defined).

But, Shark8, you seem to criticizing something a little different than that.  In some alternate universe where Unicode or ISO10646 transpired entirely differently, what would Unicode-done-right* look like, especially w.r.t. Ada strings.  It seems that you are alluding to some sort of multiple-strand string or something like that (not merely allocating the billion nonBMP codepoints better so that we would have a one-letter-one-codepoint axiom). 

* Yeah, I know, in Unicode done right, there wouldn't be any Unicode or ISO10646 at all, but what would there be instead and what would the strawman look like at all in Ada?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 17:51                                             ` Jacob Sparre Andersen
  2018-07-04 18:06                                               ` Shark8
@ 2018-07-05 18:06                                               ` Randy Brukardt
  1 sibling, 0 replies; 73+ messages in thread
From: Randy Brukardt @ 2018-07-05 18:06 UTC (permalink / raw)


"Jacob Sparre Andersen" <jacob@jacob-sparre.dk> wrote in message 
news:87efginb3c.fsf@adaheads.home...
> J-P. Rosen <rosen@adalog.fr> writes:
...
>> So, you want different types, plus a typing system that would allow to
>> mix the types and make them compatible... You might as well put
>> everything in the same type!
>
> It would be nice if the encoding and character set of a string were
> "implementation details".  I'm not sure how to do it, but I think it is
> worth trying to find a solution for Ada.  (I think I was introduced to
> how the KDE library does it once, but IIRC only encoding was abstracted
> away.)

It's relatively easy to do (see the first version of AI12-0021-1 for one 
way), but it is pervasive (if useful) and difficult to make efficient. And 
you have to throw away essentially everything that currently takes a 
String -- that's a bridge too far for almost everyone.

A bit of additional language support (around conversions) would make it more 
possible as a library, but the "throw everything away" aspect makes it 
unlikely to get wide use.

My personal opinion about this is that the ARG (as a whole) really does not 
care about these issues; the "solution" for Ada 2020 is a few more 
Wide_Wide_ madness packages. My view is that this is really more about 
checking off a box (we were asked to do *something* and we did *something*, 
now go away) than about any attempt to fix the issues. (Admittedly, it's too 
late to do anything else for Ada 2020 -- large new proposals are 
out-of-bounds now, they have to wait another cycle. But another set of junky 
patches doesn't really help anything other than the reduce the obvious 
pressure for a real solution.)

                                                        Randy.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-04 19:01                                                 ` Dmitry A. Kazakov
@ 2018-07-05 18:08                                                   ` Randy Brukardt
  2018-07-05 19:41                                                     ` Dmitry A. Kazakov
  0 siblings, 1 reply; 73+ messages in thread
From: Randy Brukardt @ 2018-07-05 18:08 UTC (permalink / raw)


"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:phj5j9$bju$1@gioia.aioe.org...
> On 2018-07-04 20:06, Shark8 wrote:
>> On Wednesday, July 4, 2018 at 11:51:20 AM UTC-6, Jacob Sparre Andersen 
>> wrote:
>>>
>>> It would be nice if the encoding and character set of a string were
>>> "implementation details".  I'm not sure how to do it, but I think it is
>>> worth trying to find a solution for Ada.  (I think I was introduced to
>>> how the KDE library does it once, but IIRC only encoding was abstracted
>>> away.)
>>
>> Indeed so!
>> This is the way we /should/ have strings; where [[Wide_]Wide_]String are 
>> all generic with things like 'character-set' and 'search' and 'encoding' 
>> as formal parameters.
>>
>> Sadly this will likely never happen because it would break backwards 
>> compatibility.
>
> It would break nothing. Old package will become renamings of new 
> instances. Well, except for dire deforestation should new RM be ever 
> printed...

That's not possible. As you like to say, String /= String'Class. The new 
libraries would almost all take String'Class (or whatever stand-in there 
is).

                                             Randy.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-05 17:19                                                                       ` Dan'l Miller
@ 2018-07-05 19:14                                                                         ` Shark8
  0 siblings, 0 replies; 73+ messages in thread
From: Shark8 @ 2018-07-05 19:14 UTC (permalink / raw)


On Thursday, July 5, 2018 at 11:20:00 AM UTC-6, Dan'l Miller wrote:
> 
> But, Shark8, you seem to criticizing something a little different than that.  In some alternate universe where Unicode or ISO10646 transpired entirely differently, what would Unicode-done-right* look like, especially w.r.t. Ada strings.  It seems that you are alluding to some sort of multiple-strand string or something like that (not merely allocating the billion nonBMP codepoints better so that we would have a one-letter-one-codepoint axiom). 

Well, Ada does like 'disassembling' things [concepts, etc] into usable component pieces, traditionally-speaking. So, I'd expect the multilingual problem-space would likely be decomposed into some usable/useful sets of types/subprograms.

To borrow from other ISO stuff, perhaps something like:

-- ISO 639-1
PACKAGE LANGUAGES IS
  Type Code is ( ab, aa, [...], za, zu );
  -- other stuff.
END LANGUAGES;

PACKAGE LANGUAGES.CONSTRUCTS IS
   -- A Text is a full sequence of linguistically meaningful data, a sequence of contexts.
   Type Text is private;
   -- subprograms...
   
   -- Essentailly a "string" w/ a language context.
   Type Context( Language : Code; Length : Natural ) is private;
   -- subprograms
PRIVATE
  --...
END LANGUAGES.CONSTRUCTS;

Or something; the point is the preservation of the structure/context of the sequence-of-symbols\words\graphemes\whatever to provide a solid multilingual foundation rather than throwing away all context, shoving everything in the Unicode-blender and having to deal with string-of-hexadecimal-sludge (codepoints) which, in-turn, forces reconstruction of the lost structures and contexts... maybe involving the [ab]use of RegEx, that always seems to be an answer when dealing with textually-represented data, hence why so many of our peers seem to think that RegEx is suitable for parsing/processing HTML....

Yes, it bucks the "everything is a string" mentality of C/unix influenced OS-APIs; where the analog of a path would be an actual vector of names [eg ("root", "projects", "source", "file.adb")] rather than a plain text-string [eg "root\projects\source\file.adb"] if applied to the OS as well.

The whole purpose is, as stated up-thread, "to aid the creative craftsman, not enforce mediocrity".


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-05 18:08                                                   ` Randy Brukardt
@ 2018-07-05 19:41                                                     ` Dmitry A. Kazakov
  0 siblings, 0 replies; 73+ messages in thread
From: Dmitry A. Kazakov @ 2018-07-05 19:41 UTC (permalink / raw)


On 2018-07-05 20:08, Randy Brukardt wrote:
> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
> news:phj5j9$bju$1@gioia.aioe.org...
>> On 2018-07-04 20:06, Shark8 wrote:
>>> On Wednesday, July 4, 2018 at 11:51:20 AM UTC-6, Jacob Sparre Andersen
>>> wrote:
>>>>
>>>> It would be nice if the encoding and character set of a string were
>>>> "implementation details".  I'm not sure how to do it, but I think it is
>>>> worth trying to find a solution for Ada.  (I think I was introduced to
>>>> how the KDE library does it once, but IIRC only encoding was abstracted
>>>> away.)
>>>
>>> Indeed so!
>>> This is the way we /should/ have strings; where [[Wide_]Wide_]String are
>>> all generic with things like 'character-set' and 'search' and 'encoding'
>>> as formal parameters.
>>>
>>> Sadly this will likely never happen because it would break backwards
>>> compatibility.
>>
>> It would break nothing. Old package will become renamings of new
>> instances. Well, except for dire deforestation should new RM be ever
>> printed...
> 
> That's not possible. As you like to say, String /= String'Class. The new
> libraries would almost all take String'Class (or whatever stand-in there
> is).

Possible, but as useless as existing implementation. I wished to say 
that there is no difference between overloading string types and 
overloading string types from generic instances. If Ada.Text_IO became 
renaming of Ada.Generic_Text_IO (...) the would change nothing.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-05  7:55                                                 ` Dmitry A. Kazakov
@ 2018-07-06  8:28                                                   ` G.B.
  2018-07-06  8:57                                                     ` Dmitry A. Kazakov
  0 siblings, 1 reply; 73+ messages in thread
From: G.B. @ 2018-07-06  8:28 UTC (permalink / raw)


On 05.07.18 09:55, Dmitry A. Kazakov wrote:

>>>>>>> Back to the square
>>>>>>> one, how to design an UTF-8 string type?
>>>>>>
>>>>>> Never.
>>>>>>
>>>>>> What is the proper representation of 3?
>>>>>
>>>>> What is 3 here?
>>>>
>>>> It names a value of some type.
>>>
>>> Which type? You name the type,
>>
>> Any type whose objects' values include the value 3
>> and which does not specify a representation in source,
>> like Standard.Integer.
> 
>  The representation of a class-wide object is (Tag, Value). 

Obviously, 3 is not, given integers in Ada. Also, your use of
"representation" seems to exclude Ada representation of
the Value part the pair introduced above. How's that?

So, is your "representation" an enthymeme that stipulates
some definitions affecting 3 in Ada source texts?

> So, name the specific type and you get the representation.

What is the representation declared by Standard.Integer?

> There is no values and representations of without types.

A red herring.

>>>> I’d not want encoding here.
>>>
>>> There is always some.
>>
>> Not in source, where design is fixed explicitly.

Anything on this one?

>>>  But I still have no idea what you want to say by that.
>>
>> A properly typed procedure object handles the use case of
>> encoding I/O in a type safe way. The type is not that of
>> string-with-something composites. It is the type which covers
>> the use case procedurally.
> 
> I do not quite understand this either, but it sounds more right than wrong. So?

So stop considering type for just data objects, consider types for
operation objects instead and then need to perpetually entangle
string objects with encoding objects is gone.

P.S.: A question is neither a proposition nor an example, but
it has been helpful in the past.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Strange crash on custom iterator
  2018-07-06  8:28                                                   ` G.B.
@ 2018-07-06  8:57                                                     ` Dmitry A. Kazakov
  0 siblings, 0 replies; 73+ messages in thread
From: Dmitry A. Kazakov @ 2018-07-06  8:57 UTC (permalink / raw)


On 2018-07-06 10:28, G.B. wrote:
> On 05.07.18 09:55, Dmitry A. Kazakov wrote:
> 
>>>>>>>> Back to the square
>>>>>>>> one, how to design an UTF-8 string type?
>>>>>>>
>>>>>>> Never.
>>>>>>>
>>>>>>> What is the proper representation of 3?
>>>>>>
>>>>>> What is 3 here?
>>>>>
>>>>> It names a value of some type.
>>>>
>>>> Which type? You name the type,
>>>
>>> Any type whose objects' values include the value 3
>>> and which does not specify a representation in source,
>>> like Standard.Integer.
>>
>>  The representation of a class-wide object is (Tag, Value). 
> 
> Obviously, 3 is not, given integers in Ada.

Of course it is. The representation of Integer 3 is the representation 
of Integer 3 and is not the representation of Integer'Class 3.

> Also, your use of
> "representation" seems to exclude Ada representation of
> the Value part the pair introduced above. How's that?

Not at all. If Integer'Class existed then representation of

    X : Integer'Class := 3:

would be exactly

    Integer'Tag
    Integer'(3)

whereas the representation of

    Y : Integer := 3:

is, as always:

    Integer'(3)

Types Integer'Class and Integer are different and have different 
representations. Each type has a representation of its own, no?

>> So, name the specific type and you get the representation.
> 
> What is the representation declared by Standard.Integer?

It is not declared, it is implied, usually the machine representation of 
signed integer of the machine word length.

>> There is no values and representations of without types.
> 
> A red herring.

But true, regardless. Questions like what is the representation of 3 are 
meaningless. The answer is "any".

>>>>> I’d not want encoding here.
>>>>
>>>> There is always some.
>>>
>>> Not in source, where design is fixed explicitly.
> 
> Anything on this one?

If you formulate the question so that I could understand it then ...

>>>>  But I still have no idea what you want to say by that.
>>>
>>> A properly typed procedure object handles the use case of
>>> encoding I/O in a type safe way. The type is not that of
>>> string-with-something composites. It is the type which covers
>>> the use case procedurally.
>>
>> I do not quite understand this either, but it sounds more right than 
>> wrong. So?
> 
> So stop considering type for just data objects, consider types for
> operation objects instead and then need to perpetually entangle
> string objects with encoding objects is gone.

Where I consider type as data objects and how is that relevant?

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2018-07-06  8:57 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-30 10:48 Strange crash on custom iterator Lucretia
2018-06-30 11:32 ` Simon Wright
2018-06-30 12:02   ` Lucretia
2018-06-30 14:25     ` Simon Wright
2018-06-30 14:33       ` Lucretia
2018-06-30 19:25         ` Simon Wright
2018-06-30 19:36           ` Luke A. Guest
2018-07-01 18:06             ` Jacob Sparre Andersen
2018-07-01 19:59               ` Simon Wright
2018-07-02 17:43                 ` Luke A. Guest
2018-07-02 19:42                   ` Simon Wright
2018-07-03 14:08                     ` Lucretia
2018-07-03 14:17                       ` J-P. Rosen
2018-07-03 15:06                         ` Lucretia
2018-07-03 15:45                           ` J-P. Rosen
2018-07-03 15:55                             ` Lucretia
2018-07-03 17:00                               ` J-P. Rosen
2018-07-03 15:57                             ` Dmitry A. Kazakov
2018-07-03 16:07                               ` Lucretia
2018-07-03 16:36                                 ` Dmitry A. Kazakov
2018-07-03 16:42                                   ` Lucretia
2018-07-03 16:45                                     ` Lucretia
2018-07-03 20:18                                     ` Dmitry A. Kazakov
2018-07-03 21:04                                       ` Lucretia
2018-07-04  1:26                                         ` Dan'l Miller
2018-07-04  1:59                                           ` Lucretia
2018-07-04  7:37                                             ` Dmitry A. Kazakov
2018-07-04 12:46                                             ` Dan'l Miller
2018-07-04 13:37                                             ` Dennis Lee Bieber
2018-07-04  7:21                                         ` Dmitry A. Kazakov
2018-07-03 18:54                                   ` Dan'l Miller
2018-07-03 20:22                                     ` Dmitry A. Kazakov
2018-07-04  7:33                                   ` J-P. Rosen
2018-07-04  7:53                                     ` Dmitry A. Kazakov
2018-07-04  9:55                                       ` J-P. Rosen
2018-07-04 10:01                                         ` Dmitry A. Kazakov
2018-07-04 11:30                                           ` J-P. Rosen
2018-07-04 13:27                                             ` Dmitry A. Kazakov
2018-07-04 14:37                                               ` Dan'l Miller
2018-07-04 14:43                                                 ` Dan'l Miller
2018-07-04 14:57                                                 ` J-P. Rosen
2018-07-04 15:41                                                 ` Lucretia
2018-07-04 16:55                                                   ` Dan'l Miller
2018-07-04 18:01                                                     ` Shark8
2018-07-04 18:57                                                       ` Dmitry A. Kazakov
2018-07-04 19:53                                                         ` Shark8
2018-07-04 20:05                                                           ` Lucretia
2018-07-04 22:04                                                             ` Shark8
2018-07-05  0:12                                                               ` Dan'l Miller
2018-07-05  1:46                                                                 ` Shark8
2018-07-05  2:07                                                                   ` Luke A. Guest
2018-07-05 16:47                                                                     ` Shark8
2018-07-05 17:19                                                                       ` Dan'l Miller
2018-07-05 19:14                                                                         ` Shark8
2018-07-04 20:43                                                           ` Dmitry A. Kazakov
2018-07-04 17:51                                             ` Jacob Sparre Andersen
2018-07-04 18:06                                               ` Shark8
2018-07-04 18:59                                                 ` Dan'l Miller
2018-07-04 19:01                                                 ` Dmitry A. Kazakov
2018-07-05 18:08                                                   ` Randy Brukardt
2018-07-05 19:41                                                     ` Dmitry A. Kazakov
2018-07-04 21:00                                                 ` Jacob Sparre Andersen
2018-07-05 18:06                                               ` Randy Brukardt
2018-07-04 19:02                                       ` G. B.
2018-07-04 19:16                                         ` Dmitry A. Kazakov
2018-07-04 20:40                                           ` G. B.
2018-07-04 20:55                                             ` Dmitry A. Kazakov
2018-07-04 21:21                                               ` G.B.
2018-07-05  7:55                                                 ` Dmitry A. Kazakov
2018-07-06  8:28                                                   ` G.B.
2018-07-06  8:57                                                     ` Dmitry A. Kazakov
2018-07-02  8:31               ` Lucretia
2018-06-30 14:34       ` Lucretia

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox