GNOGA - RFC UXStrings package.

comp.lang.ada
 help / color / mirror / Atom feed

* GNOGA - RFC UXStrings package.
@ 2020-05-11  8:59 Blady
  2020-05-11 17:44 ` Jere
  0 siblings, 1 reply; 4+ messages in thread
From: Blady @ 2020-05-11  8:59 UTC (permalink / raw)


Hello,

Gnoga (https://sourceforge.net/p/gnoga) internal character strings 
implementation is based on both Ada types String and Unbounded_String. 
The native Ada String encoding is Latin-1 whereas transactions with the 
Javascript part are in UTF-8 encoding.

Some drawbacks come up, for instance, with internationalization of 
programs (see Localize Gnoga demo 
https://sourceforge.net/p/gnoga/code/ci/dev_1.6/tree/demo/localize):

• several conversions between String and Unbounded_String objects
• it isn't usable out of Latin-1 character set, characters out of 
Latin-1 set are blanked
• continuous conversions between Latin-1 and UTF-8, each sent and 
received transaction between Ada and Javascript parts

Two ways of improvement: native dynamic length handling and Unicode support.

First possibility is using UTF-8 as internal implementation in 
Unbounded_String objects. The simplest way but Gnoga uses many times 
character indexes to parse Javascript messages that is not easy to 
achieved with UTF-8 which may have several lengths to represent one 
character. String parsing will be time consuming. Some combinations may 
lead to incorrect UTF-8 representation.

Second possibility is to use Unbounded_Wide_String or 
Unbounded_Wide_Wide_String. Using Unbounded_Wide_String is quite being 
in the middle of the river might as well use Unbounded_Wide_Wide_String. 
In this latter case the memory penalty is heavy for only few accentuated 
character occurrences. So back to Unbounded_Wide_String but you'll miss 
the so essential emojis ;-)

Third possibility is to make no choice between Latin-1, Wide and 
Wide_Wide characters. The object shall adapt its inner implementation to 
the actual content. For instance with English language the most often 
use case will be Latin-1 inner implementation, for French language the 
most often will be Latin-1 with some exceptions with BMP (Unicode Basic 
Multilingual Plane) implementation such as in "cœur", for Greek language 
the most often will be BMP implementation. The programmer won't make any 
representation choice when for example receiving UTF-8 messages:

   S2 : UXString;
   ...
   S2 := "Received: " & From_UTF8 (Message);

Automatically S2 will adapt its inner representation to the received 
characters.

Package named UXStrings (Unicode Extended String) is containing :
(see 
https://sourceforge.net/p/gnoga/code/ci/dev_1.6/tree/deps/uxstrings/src/uxstrings.ads)

The first part contains renaming statements of Ada types. Ada String 
type is structurally an array of Latin-1 characters thus is renamed as 
Latin_1_Character_Array. And so on.

The second part defines the USXString type as a tagged private type 
which has got aspects such as Constant_Indexing, Variable_Indexing and 
Iterable, so we can write:

   S1, S2, S3 : UXString;
   C          : Character;
   WC         : Wide_Character;
   WWC        : Wide_Wide_Character;
   ...
   C   := S1 (3);
   WC  := S1 (2);
   WWC := S1 (1);
   S1 (3) := WWC;
   S1 (2) := WC;
   S1 (1) := C;
   for I in S3 loop
      C   := S3 (I);
      WC  := S3 (I);
      WWC := S3 (I);
      Put_Line (Character'pos (C)'img & Wide_Character'pos (WC)'img & 
Wide_Wide_Character'pos (WWC)'img);
   end loop;

The third part defines conversion functions between UXString and various 
encoding such as Latin-1, BMP (USC-2), Unicode (USC-4), UTF-8 or UTF-16, 
so we can write:

   S1  := From_Latin_1 ("blah blah");
   S2  := From_BMP ("une soirée passée à étudier la physique ω=Δθ/Δt...");
   S3  := From_Unicode ("une soirée passée à étudier les mathématiques 
ℕ⊂𝕂...");
   Send (To_UTF_8 (S1) & To_UTF_8 (S3));

The fourth part defines various API coming from Unbounded_String such as 
Append, "&", Slice, "=", Index and so on.

The private and implementation parts are not yet defined. One idea is to 
use the XStrings from GNATColl.
(see 
https://github.com/AdaCore/gnatcoll-core/blob/master/src/gnatcoll-strings_impl.ads)

Feel free to send feedback about UXStrings 
(https://sourceforge.net/p/gnoga/code/ci/dev_1.6/tree/deps/uxstrings/src/uxstrings.ads) 
specification source code on the forum or on Gnoga mailing list 
(https://sourceforge.net/p/gnoga/mailman/gnoga-list).

Thanks, Pascal.
https://blady.pagesperso-orange.fr

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: GNOGA - RFC UXStrings package.
  2020-05-11  8:59 GNOGA - RFC UXStrings package Blady
@ 2020-05-11 17:44 ` Jere
  2020-05-12  8:13   ` Blady
  0 siblings, 1 reply; 4+ messages in thread
From: Jere @ 2020-05-11 17:44 UTC (permalink / raw)


On Monday, May 11, 2020 at 4:59:33 AM UTC-4, Blady wrote:
> Hello,
> 
> Gnoga (https://sourceforge.net/p/gnoga) internal character strings 
> implementation is based on both Ada types String and Unbounded_String. 
> The native Ada String encoding is Latin-1 whereas transactions with the 
> Javascript part are in UTF-8 encoding.
> 
> Some drawbacks come up, for instance, with internationalization of 
> programs (see Localize Gnoga demo 
> https://sourceforge.net/p/gnoga/code/ci/dev_1.6/tree/demo/localize):
> 
> • several conversions between String and Unbounded_String objects
> • it isn't usable out of Latin-1 character set, characters out of 
> Latin-1 set are blanked
> • continuous conversions between Latin-1 and UTF-8, each sent and 
> received transaction between Ada and Javascript parts
> 
> Two ways of improvement: native dynamic length handling and Unicode support.
> 
> First possibility is using UTF-8 as internal implementation in 
> Unbounded_String objects. The simplest way but Gnoga uses many times 
> character indexes to parse Javascript messages that is not easy to 
> achieved with UTF-8 which may have several lengths to represent one 
> character. String parsing will be time consuming. Some combinations may 
> lead to incorrect UTF-8 representation.
> 
> Second possibility is to use Unbounded_Wide_String or 
> Unbounded_Wide_Wide_String. Using Unbounded_Wide_String is quite being 
> in the middle of the river might as well use Unbounded_Wide_Wide_String. 
> In this latter case the memory penalty is heavy for only few accentuated 
> character occurrences. So back to Unbounded_Wide_String but you'll miss 
> the so essential emojis ;-)
> 
> Third possibility is to make no choice between Latin-1, Wide and 
> Wide_Wide characters. The object shall adapt its inner implementation to 
> the actual content. For instance with English language the most often 
> use case will be Latin-1 inner implementation, for French language the 
> most often will be Latin-1 with some exceptions with BMP (Unicode Basic 
> Multilingual Plane) implementation such as in "cœur", for Greek language 
> the most often will be BMP implementation. The programmer won't make any 
> representation choice when for example receiving UTF-8 messages:
> 
>    S2 : UXString;
>    ...
>    S2 := "Received: " & From_UTF8 (Message);
> 
> Automatically S2 will adapt its inner representation to the received 
> characters.
> 
> Package named UXStrings (Unicode Extended String) is containing :
> (see 
> https://sourceforge.net/p/gnoga/code/ci/dev_1.6/tree/deps/uxstrings/src/uxstrings.ads)
> 
> The first part contains renaming statements of Ada types. Ada String 
> type is structurally an array of Latin-1 characters thus is renamed as 
> Latin_1_Character_Array. And so on.
> 
> The second part defines the USXString type as a tagged private type 
> which has got aspects such as Constant_Indexing, Variable_Indexing and 
> Iterable, so we can write:
> 
>    S1, S2, S3 : UXString;
>    C          : Character;
>    WC         : Wide_Character;
>    WWC        : Wide_Wide_Character;
>    ...
>    C   := S1 (3);
>    WC  := S1 (2);
>    WWC := S1 (1);
>    S1 (3) := WWC;
>    S1 (2) := WC;
>    S1 (1) := C;
>    for I in S3 loop
>       C   := S3 (I);
>       WC  := S3 (I);
>       WWC := S3 (I);
>       Put_Line (Character'pos (C)'img & Wide_Character'pos (WC)'img & 
> Wide_Wide_Character'pos (WWC)'img);
>    end loop;
> 
> The third part defines conversion functions between UXString and various 
> encoding such as Latin-1, BMP (USC-2), Unicode (USC-4), UTF-8 or UTF-16, 
> so we can write:
> 
>    S1  := From_Latin_1 ("blah blah");
>    S2  := From_BMP ("une soirée passée à étudier la physique ω=Δθ/Δt...");
>    S3  := From_Unicode ("une soirée passée à étudier les mathématiques 
> ℕ⊂𝕂...");
>    Send (To_UTF_8 (S1) & To_UTF_8 (S3));
> 
> The fourth part defines various API coming from Unbounded_String such as 
> Append, "&", Slice, "=", Index and so on.
> 
> The private and implementation parts are not yet defined. One idea is to 
> use the XStrings from GNATColl.
> (see 
> https://github.com/AdaCore/gnatcoll-core/blob/master/src/gnatcoll-strings_impl.ads)
> 
> Feel free to send feedback about UXStrings 
> (https://sourceforge.net/p/gnoga/code/ci/dev_1.6/tree/deps/uxstrings/src/uxstrings.ads) 
> specification source code on the forum or on Gnoga mailing list 
> (https://sourceforge.net/p/gnoga/mailman/gnoga-list).
> 
> Thanks, Pascal.
> https://blady.pagesperso-orange.fr

I would be hesitant to use gnatcoll directly.  The one nice
thing about Gnoga is that (at least previously), it tried not
to fully rely on GNAT.  Using gnatcoll would be a step in the
wrong direction in that respect.  From personal experience, 
I used Gnoga to create a program for a GPU module using a variant
of linux.  If Gnoga suddenly started requiring gnatcoll, then 
that program would no longer work as I was unable to get most
of Adacore's additional libraries to even compile in that variant
of linux.  This included gnatcoll.  

Additionally, the library Gnoga already leverages (Simple
Components [1]) already has some UTF-8 functionality you 
might be able to leverage.  You might check that out.

One other thing, if you are interested, you might send a
message to a fellow who goes by Entomy on github.  His 
area of expertise is text parsing, localization, etc. and
he is experienced in Ada.  He might have some libraries 
or tools you could leverage.  You could probably catch
him at his twitter pretty easily:  
https://twitter.com/pkell7


[1]: http://www.dmitry-kazakov.de/ada/strings_edit.htm#Strings_Edit.UTF8.Maps.Unicode_Set

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: GNOGA - RFC UXStrings package.
  2020-05-11 17:44 ` Jere
@ 2020-05-12  8:13   ` Blady
  2020-05-12  9:35     ` Dmitry A. Kazakov
  0 siblings, 1 reply; 4+ messages in thread
From: Blady @ 2020-05-12  8:13 UTC (permalink / raw)


Le 11/05/2020 à 19:44, Jere a écrit :
> On Monday, May 11, 2020 at 4:59:33 AM UTC-4, Blady wrote:
>> Hello,
>>
>> Gnoga (https://sourceforge.net/p/gnoga) internal character strings
>> implementation is based on both Ada types String and Unbounded_String.
>> The native Ada String encoding is Latin-1 whereas transactions with the
>> Javascript part are in UTF-8 encoding.
>>
>> Some drawbacks come up, for instance, with internationalization of
>> programs (see Localize Gnoga demo
>> https://sourceforge.net/p/gnoga/code/ci/dev_1.6/tree/demo/localize):
>>
>> • several conversions between String and Unbounded_String objects
>> • it isn't usable out of Latin-1 character set, characters out of
>> Latin-1 set are blanked
>> • continuous conversions between Latin-1 and UTF-8, each sent and
>> received transaction between Ada and Javascript parts
>>
>> Two ways of improvement: native dynamic length handling and Unicode support.
>>
...
>>
>> Feel free to send feedback about UXStrings
>> (https://sourceforge.net/p/gnoga/code/ci/dev_1.6/tree/deps/uxstrings/src/uxstrings.ads)
>> specification source code on the forum or on Gnoga mailing list
>> (https://sourceforge.net/p/gnoga/mailman/gnoga-list).
>>
>> Thanks, Pascal.
>> https://blady.pagesperso-orange.fr
> 
> I would be hesitant to use gnatcoll directly.  The one nice
> thing about Gnoga is that (at least previously), it tried not
> to fully rely on GNAT.  Using gnatcoll would be a step in the
> wrong direction in that respect.  From personal experience,
> I used Gnoga to create a program for a GPU module using a variant
> of linux.  If Gnoga suddenly started requiring gnatcoll, then
> that program would no longer work as I was unable to get most
> of Adacore's additional libraries to even compile in that variant
> of linux.  This included gnatcoll.
> 
> Additionally, the library Gnoga already leverages (Simple
> Components [1]) already has some UTF-8 functionality you
> might be able to leverage.  You might check that out.
> 
> One other thing, if you are interested, you might send a
> message to a fellow who goes by Entomy on github.  His
> area of expertise is text parsing, localization, etc. and
> he is experienced in Ada.  He might have some libraries
> or tools you could leverage.  You could probably catch
> him at his twitter pretty easily:
> https://twitter.com/pkell7
> 
> 
> [1]: http://www.dmitry-kazakov.de/ada/strings_edit.htm#Strings_Edit.UTF8.Maps.Unicode_Set
> 

Hello Jere, I agree that GNATColl dependency would be too heavy for 
Gnoga. At least GNATColl might be an inspiration for an implementation goal.

I've checked Simple Components, it might be completed with some parsing 
functions in order to fulfill all Gnoga needs. But I think that UTF-8 
(or UTF-16) internal representation would make too much penalties in 
term of execution time which is critical for Gnoga as server.

That's why I would like to experiment some data structure with smart 
character size (1, 2 or 4 bytes) and smart string length (either static 
or dynamic).

Thanks for the Entomy pointer.

Regards, Pascal.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: GNOGA - RFC UXStrings package.
  2020-05-12  8:13   ` Blady
@ 2020-05-12  9:35     ` Dmitry A. Kazakov
  0 siblings, 0 replies; 4+ messages in thread
From: Dmitry A. Kazakov @ 2020-05-12  9:35 UTC (permalink / raw)

On 2020-05-12 10:13, Blady wrote:

> I've checked Simple Components, it might be completed with some parsing 
> functions in order to fulfill all Gnoga needs.

You are welcome to ask.

I am not sure what kind of parsing you mean, most of nightmarish legacy 
encodings are supported already, e.g.

    http://www.dmitry-kazakov.de/ada/strings_edit.htm#7.10

> But I think that UTF-8 
> (or UTF-16) internal representation would make too much penalties in 
> term of execution time which is critical for Gnoga as server.

Well, whatever minor overhead UTF-8 may have it is in order of many 
magnitude less than Unbounded_String or what you do in your code for 
UXStrings would inflict.

> That's why I would like to experiment some data structure with smart 
> character size (1, 2 or 4 bytes) and smart string length (either static 
> or dynamic).

When I am concerned about performance:

1. I make all content in UTF-8. I convert anything to UTF-8 first, if I 
get it from outside.

2. I never use dynamically allocated strings in any form, never in the 
standard memory pool. If I really, really need a pool, I use a custom 
arena pool and allocate a String there. As a nice side effect the server 
will be resilient all sorts of something-is-too-large attacks, no space 
in the arena, drop connection, bye.

3. I never copy anything. Thus, again, never Unbounded_String, only 
String and its slices.

4. I never tokenize anything. I walk down the string in a single pass, 
notice start/stop indices of a token, pass a string slice down to a 
semantic callback, better, pass it straight to a look-up table. No 
string copies.

5. I never use Wide or Wide_Wide. They are mess and require conversions 
=> copying => a lot of resources.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-05-12  9:35 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-11  8:59 GNOGA - RFC UXStrings package Blady
2020-05-11 17:44 ` Jere
2020-05-12  8:13   ` Blady
2020-05-12  9:35     ` Dmitry A. Kazakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox