From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!feeder.eternal-september.org!aioe.org!.POSTED.XmkwtIszsEpVP9G8FvDgzw.user.gioia.aioe.org!not-for-mail
From: Blady <p.p11@orange.fr>
Newsgroups: comp.lang.ada
Subject: GNOGA - RFC UXStrings package.
Date: Mon, 11 May 2020 10:59:30 +0200
Organization: Aioe.org NNTP Server
Message-ID: <r9b45i$412$1@gioia.aioe.org>
NNTP-Posting-Host: XmkwtIszsEpVP9G8FvDgzw.user.gioia.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Complaints-To: abuse@aioe.org
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0)
 Gecko/20100101 Thunderbird/68.8.0
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-US
X-Mozilla-News-Host: news://nntp.aioe.org:119
Xref: reader01.eternal-september.org comp.lang.ada:58647
Date: 2020-05-11T10:59:30+02:00
List-Id: <comp.lang.ada>

Hello,

Gnoga (https://sourceforge.net/p/gnoga) internal character strings 
implementation is based on both Ada types String and Unbounded_String. 
The native Ada String encoding is Latin-1 whereas transactions with the 
Javascript part are in UTF-8 encoding.

Some drawbacks come up, for instance, with internationalization of 
programs (see Localize Gnoga demo 
https://sourceforge.net/p/gnoga/code/ci/dev_1.6/tree/demo/localize):

• several conversions between String and Unbounded_String objects
• it isn't usable out of Latin-1 character set, characters out of 
Latin-1 set are blanked
• continuous conversions between Latin-1 and UTF-8, each sent and 
received transaction between Ada and Javascript parts

Two ways of improvement: native dynamic length handling and Unicode support.

First possibility is using UTF-8 as internal implementation in 
Unbounded_String objects. The simplest way but Gnoga uses many times 
character indexes to parse Javascript messages that is not easy to 
achieved with UTF-8 which may have several lengths to represent one 
character. String parsing will be time consuming. Some combinations may 
lead to incorrect UTF-8 representation.

Second possibility is to use Unbounded_Wide_String or 
Unbounded_Wide_Wide_String. Using Unbounded_Wide_String is quite being 
in the middle of the river might as well use Unbounded_Wide_Wide_String. 
In this latter case the memory penalty is heavy for only few accentuated 
character occurrences. So back to Unbounded_Wide_String but you'll miss 
the so essential emojis ;-)

Third possibility is to make no choice between Latin-1, Wide and 
Wide_Wide characters. The object shall adapt its inner implementation to 
the actual content. For instance with English language the most often 
use case will be Latin-1 inner implementation, for French language the 
most often will be Latin-1 with some exceptions with BMP (Unicode Basic 
Multilingual Plane) implementation such as in "cœur", for Greek language 
the most often will be BMP implementation. The programmer won't make any 
representation choice when for example receiving UTF-8 messages:

   S2 : UXString;
   ...
   S2 := "Received: " & From_UTF8 (Message);

Automatically S2 will adapt its inner representation to the received 
characters.

Package named UXStrings (Unicode Extended String) is containing :
(see 
https://sourceforge.net/p/gnoga/code/ci/dev_1.6/tree/deps/uxstrings/src/uxstrings.ads)

The first part contains renaming statements of Ada types. Ada String 
type is structurally an array of Latin-1 characters thus is renamed as 
Latin_1_Character_Array. And so on.

The second part defines the USXString type as a tagged private type 
which has got aspects such as Constant_Indexing, Variable_Indexing and 
Iterable, so we can write:

   S1, S2, S3 : UXString;
   C          : Character;
   WC         : Wide_Character;
   WWC        : Wide_Wide_Character;
   ...
   C   := S1 (3);
   WC  := S1 (2);
   WWC := S1 (1);
   S1 (3) := WWC;
   S1 (2) := WC;
   S1 (1) := C;
   for I in S3 loop
      C   := S3 (I);
      WC  := S3 (I);
      WWC := S3 (I);
      Put_Line (Character'pos (C)'img & Wide_Character'pos (WC)'img & 
Wide_Wide_Character'pos (WWC)'img);
   end loop;

The third part defines conversion functions between UXString and various 
encoding such as Latin-1, BMP (USC-2), Unicode (USC-4), UTF-8 or UTF-16, 
so we can write:

   S1  := From_Latin_1 ("blah blah");
   S2  := From_BMP ("une soirée passée à étudier la physique ω=Δθ/Δt...");
   S3  := From_Unicode ("une soirée passée à étudier les mathématiques 
ℕ⊂𝕂...");
   Send (To_UTF_8 (S1) & To_UTF_8 (S3));

The fourth part defines various API coming from Unbounded_String such as 
Append, "&", Slice, "=", Index and so on.

The private and implementation parts are not yet defined. One idea is to 
use the XStrings from GNATColl.
(see 
https://github.com/AdaCore/gnatcoll-core/blob/master/src/gnatcoll-strings_impl.ads)

Feel free to send feedback about UXStrings 
(https://sourceforge.net/p/gnoga/code/ci/dev_1.6/tree/deps/uxstrings/src/uxstrings.ads) 
specification source code on the forum or on Gnoga mailing list 
(https://sourceforge.net/p/gnoga/mailman/gnoga-list).

Thanks, Pascal.
https://blady.pagesperso-orange.fr