From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-0.9 required=5.0 tests=BAYES_00,FORGED_GMAIL_RCVD,
	FREEMAIL_FROM autolearn=no autolearn_force=no version=3.4.4
X-Received: by 2002:a0c:ba2e:: with SMTP id
 w46mr11413111qvf.120.1589219095522;
        Mon, 11 May 2020 10:44:55 -0700 (PDT)
X-Received: by 2002:aca:a845:: with SMTP id r66mr19635679oie.44.1589219095079;
 Mon, 11 May 2020 10:44:55 -0700 (PDT)
Path: 
 eternal-september.org!reader01.eternal-september.org!feeder.eternal-september.org!news.gegeweb.eu!gegeweb.org!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Mon, 11 May 2020 10:44:54 -0700 (PDT)
In-Reply-To: <r9b45i$412$1@gioia.aioe.org>
Complaints-To: groups-abuse@google.com
Injection-Info: google-groups.googlegroups.com; posting-host=70.109.61.2;
 posting-account=QF6XPQoAAABce2NyPxxDAaKdAkN6RgAf
NNTP-Posting-Host: 70.109.61.2
References: <r9b45i$412$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4065382c-ef1f-47c4-a0ea-74d736536447@googlegroups.com>
Subject: Re: GNOGA - RFC UXStrings package.
From: Jere <jhb.chat@gmail.com>
Injection-Date: Mon, 11 May 2020 17:44:55 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Xref: reader01.eternal-september.org comp.lang.ada:58648
Date: 2020-05-11T10:44:54-07:00
List-Id: <comp.lang.ada>

On Monday, May 11, 2020 at 4:59:33 AM UTC-4, Blady wrote:
> Hello,
>=20
> Gnoga (https://sourceforge.net/p/gnoga) internal character strings=20
> implementation is based on both Ada types String and Unbounded_String.=20
> The native Ada String encoding is Latin-1 whereas transactions with the=
=20
> Javascript part are in UTF-8 encoding.
>=20
> Some drawbacks come up, for instance, with internationalization of=20
> programs (see Localize Gnoga demo=20
> https://sourceforge.net/p/gnoga/code/ci/dev_1.6/tree/demo/localize):
>=20
> =E2=80=A2 several conversions between String and Unbounded_String objects
> =E2=80=A2 it isn't usable out of Latin-1 character set, characters out of=
=20
> Latin-1 set are blanked
> =E2=80=A2 continuous conversions between Latin-1 and UTF-8, each sent and=
=20
> received transaction between Ada and Javascript parts
>=20
> Two ways of improvement: native dynamic length handling and Unicode suppo=
rt.
>=20
> First possibility is using UTF-8 as internal implementation in=20
> Unbounded_String objects. The simplest way but Gnoga uses many times=20
> character indexes to parse Javascript messages that is not easy to=20
> achieved with UTF-8 which may have several lengths to represent one=20
> character. String parsing will be time consuming. Some combinations may=
=20
> lead to incorrect UTF-8 representation.
>=20
> Second possibility is to use Unbounded_Wide_String or=20
> Unbounded_Wide_Wide_String. Using Unbounded_Wide_String is quite being=20
> in the middle of the river might as well use Unbounded_Wide_Wide_String.=
=20
> In this latter case the memory penalty is heavy for only few accentuated=
=20
> character occurrences. So back to Unbounded_Wide_String but you'll miss=
=20
> the so essential emojis ;-)
>=20
> Third possibility is to make no choice between Latin-1, Wide and=20
> Wide_Wide characters. The object shall adapt its inner implementation to=
=20
> the actual content. For instance with English language the most often=20
> use case will be Latin-1 inner implementation, for French language the=20
> most often will be Latin-1 with some exceptions with BMP (Unicode Basic=
=20
> Multilingual Plane) implementation such as in "c=C5=93ur", for Greek lang=
uage=20
> the most often will be BMP implementation. The programmer won't make any=
=20
> representation choice when for example receiving UTF-8 messages:
>=20
>    S2 : UXString;
>    ...
>    S2 :=3D "Received: " & From_UTF8 (Message);
>=20
> Automatically S2 will adapt its inner representation to the received=20
> characters.
>=20
> Package named UXStrings (Unicode Extended String) is containing :
> (see=20
> https://sourceforge.net/p/gnoga/code/ci/dev_1.6/tree/deps/uxstrings/src/u=
xstrings.ads)
>=20
> The first part contains renaming statements of Ada types. Ada String=20
> type is structurally an array of Latin-1 characters thus is renamed as=20
> Latin_1_Character_Array. And so on.
>=20
> The second part defines the USXString type as a tagged private type=20
> which has got aspects such as Constant_Indexing, Variable_Indexing and=20
> Iterable, so we can write:
>=20
>    S1, S2, S3 : UXString;
>    C          : Character;
>    WC         : Wide_Character;
>    WWC        : Wide_Wide_Character;
>    ...
>    C   :=3D S1 (3);
>    WC  :=3D S1 (2);
>    WWC :=3D S1 (1);
>    S1 (3) :=3D WWC;
>    S1 (2) :=3D WC;
>    S1 (1) :=3D C;
>    for I in S3 loop
>       C   :=3D S3 (I);
>       WC  :=3D S3 (I);
>       WWC :=3D S3 (I);
>       Put_Line (Character'pos (C)'img & Wide_Character'pos (WC)'img &=20
> Wide_Wide_Character'pos (WWC)'img);
>    end loop;
>=20
> The third part defines conversion functions between UXString and various=
=20
> encoding such as Latin-1, BMP (USC-2), Unicode (USC-4), UTF-8 or UTF-16,=
=20
> so we can write:
>=20
>    S1  :=3D From_Latin_1 ("blah blah");
>    S2  :=3D From_BMP ("une soir=C3=A9e pass=C3=A9e =C3=A0 =C3=A9tudier la=
 physique =CF=89=3D=CE=94=CE=B8/=CE=94t...");
>    S3  :=3D From_Unicode ("une soir=C3=A9e pass=C3=A9e =C3=A0 =C3=A9tudie=
r les math=C3=A9matiques=20
> =E2=84=95=E2=8A=82=F0=9D=95=82...");
>    Send (To_UTF_8 (S1) & To_UTF_8 (S3));
>=20
> The fourth part defines various API coming from Unbounded_String such as=
=20
> Append, "&", Slice, "=3D", Index and so on.
>=20
> The private and implementation parts are not yet defined. One idea is to=
=20
> use the XStrings from GNATColl.
> (see=20
> https://github.com/AdaCore/gnatcoll-core/blob/master/src/gnatcoll-strings=
_impl.ads)
>=20
> Feel free to send feedback about UXStrings=20
> (https://sourceforge.net/p/gnoga/code/ci/dev_1.6/tree/deps/uxstrings/src/=
uxstrings.ads)=20
> specification source code on the forum or on Gnoga mailing list=20
> (https://sourceforge.net/p/gnoga/mailman/gnoga-list).
>=20
> Thanks, Pascal.
> https://blady.pagesperso-orange.fr

I would be hesitant to use gnatcoll directly.  The one nice
thing about Gnoga is that (at least previously), it tried not
to fully rely on GNAT.  Using gnatcoll would be a step in the
wrong direction in that respect.  From personal experience,=20
I used Gnoga to create a program for a GPU module using a variant
of linux.  If Gnoga suddenly started requiring gnatcoll, then=20
that program would no longer work as I was unable to get most
of Adacore's additional libraries to even compile in that variant
of linux.  This included gnatcoll. =20

Additionally, the library Gnoga already leverages (Simple
Components [1]) already has some UTF-8 functionality you=20
might be able to leverage.  You might check that out.

One other thing, if you are interested, you might send a
message to a fellow who goes by Entomy on github.  His=20
area of expertise is text parsing, localization, etc. and
he is experienced in Ada.  He might have some libraries=20
or tools you could leverage.  You could probably catch
him at his twitter pretty easily: =20
https://twitter.com/pkell7


[1]: http://www.dmitry-kazakov.de/ada/strings_edit.htm#Strings_Edit.UTF8.Ma=
ps.Unicode_Set