From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-0.9 required=5.0 tests=BAYES_00,FORGED_GMAIL_RCVD, FREEMAIL_FROM autolearn=no autolearn_force=no version=3.4.4 X-Received: by 2002:a0c:ba2e:: with SMTP id w46mr11413111qvf.120.1589219095522; Mon, 11 May 2020 10:44:55 -0700 (PDT) X-Received: by 2002:aca:a845:: with SMTP id r66mr19635679oie.44.1589219095079; Mon, 11 May 2020 10:44:55 -0700 (PDT) Path: eternal-september.org!reader01.eternal-september.org!feeder.eternal-september.org!news.gegeweb.eu!gegeweb.org!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail Newsgroups: comp.lang.ada Date: Mon, 11 May 2020 10:44:54 -0700 (PDT) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: google-groups.googlegroups.com; posting-host=70.109.61.2; posting-account=QF6XPQoAAABce2NyPxxDAaKdAkN6RgAf NNTP-Posting-Host: 70.109.61.2 References: User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: <4065382c-ef1f-47c4-a0ea-74d736536447@googlegroups.com> Subject: Re: GNOGA - RFC UXStrings package. From: Jere Injection-Date: Mon, 11 May 2020 17:44:55 +0000 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Xref: reader01.eternal-september.org comp.lang.ada:58648 Date: 2020-05-11T10:44:54-07:00 List-Id: On Monday, May 11, 2020 at 4:59:33 AM UTC-4, Blady wrote: > Hello, >=20 > Gnoga (https://sourceforge.net/p/gnoga) internal character strings=20 > implementation is based on both Ada types String and Unbounded_String.=20 > The native Ada String encoding is Latin-1 whereas transactions with the= =20 > Javascript part are in UTF-8 encoding. >=20 > Some drawbacks come up, for instance, with internationalization of=20 > programs (see Localize Gnoga demo=20 > https://sourceforge.net/p/gnoga/code/ci/dev_1.6/tree/demo/localize): >=20 > =E2=80=A2 several conversions between String and Unbounded_String objects > =E2=80=A2 it isn't usable out of Latin-1 character set, characters out of= =20 > Latin-1 set are blanked > =E2=80=A2 continuous conversions between Latin-1 and UTF-8, each sent and= =20 > received transaction between Ada and Javascript parts >=20 > Two ways of improvement: native dynamic length handling and Unicode suppo= rt. >=20 > First possibility is using UTF-8 as internal implementation in=20 > Unbounded_String objects. The simplest way but Gnoga uses many times=20 > character indexes to parse Javascript messages that is not easy to=20 > achieved with UTF-8 which may have several lengths to represent one=20 > character. String parsing will be time consuming. Some combinations may= =20 > lead to incorrect UTF-8 representation. >=20 > Second possibility is to use Unbounded_Wide_String or=20 > Unbounded_Wide_Wide_String. Using Unbounded_Wide_String is quite being=20 > in the middle of the river might as well use Unbounded_Wide_Wide_String.= =20 > In this latter case the memory penalty is heavy for only few accentuated= =20 > character occurrences. So back to Unbounded_Wide_String but you'll miss= =20 > the so essential emojis ;-) >=20 > Third possibility is to make no choice between Latin-1, Wide and=20 > Wide_Wide characters. The object shall adapt its inner implementation to= =20 > the actual content. For instance with English language the most often=20 > use case will be Latin-1 inner implementation, for French language the=20 > most often will be Latin-1 with some exceptions with BMP (Unicode Basic= =20 > Multilingual Plane) implementation such as in "c=C5=93ur", for Greek lang= uage=20 > the most often will be BMP implementation. The programmer won't make any= =20 > representation choice when for example receiving UTF-8 messages: >=20 > S2 : UXString; > ... > S2 :=3D "Received: " & From_UTF8 (Message); >=20 > Automatically S2 will adapt its inner representation to the received=20 > characters. >=20 > Package named UXStrings (Unicode Extended String) is containing : > (see=20 > https://sourceforge.net/p/gnoga/code/ci/dev_1.6/tree/deps/uxstrings/src/u= xstrings.ads) >=20 > The first part contains renaming statements of Ada types. Ada String=20 > type is structurally an array of Latin-1 characters thus is renamed as=20 > Latin_1_Character_Array. And so on. >=20 > The second part defines the USXString type as a tagged private type=20 > which has got aspects such as Constant_Indexing, Variable_Indexing and=20 > Iterable, so we can write: >=20 > S1, S2, S3 : UXString; > C : Character; > WC : Wide_Character; > WWC : Wide_Wide_Character; > ... > C :=3D S1 (3); > WC :=3D S1 (2); > WWC :=3D S1 (1); > S1 (3) :=3D WWC; > S1 (2) :=3D WC; > S1 (1) :=3D C; > for I in S3 loop > C :=3D S3 (I); > WC :=3D S3 (I); > WWC :=3D S3 (I); > Put_Line (Character'pos (C)'img & Wide_Character'pos (WC)'img &=20 > Wide_Wide_Character'pos (WWC)'img); > end loop; >=20 > The third part defines conversion functions between UXString and various= =20 > encoding such as Latin-1, BMP (USC-2), Unicode (USC-4), UTF-8 or UTF-16,= =20 > so we can write: >=20 > S1 :=3D From_Latin_1 ("blah blah"); > S2 :=3D From_BMP ("une soir=C3=A9e pass=C3=A9e =C3=A0 =C3=A9tudier la= physique =CF=89=3D=CE=94=CE=B8/=CE=94t..."); > S3 :=3D From_Unicode ("une soir=C3=A9e pass=C3=A9e =C3=A0 =C3=A9tudie= r les math=C3=A9matiques=20 > =E2=84=95=E2=8A=82=F0=9D=95=82..."); > Send (To_UTF_8 (S1) & To_UTF_8 (S3)); >=20 > The fourth part defines various API coming from Unbounded_String such as= =20 > Append, "&", Slice, "=3D", Index and so on. >=20 > The private and implementation parts are not yet defined. One idea is to= =20 > use the XStrings from GNATColl. > (see=20 > https://github.com/AdaCore/gnatcoll-core/blob/master/src/gnatcoll-strings= _impl.ads) >=20 > Feel free to send feedback about UXStrings=20 > (https://sourceforge.net/p/gnoga/code/ci/dev_1.6/tree/deps/uxstrings/src/= uxstrings.ads)=20 > specification source code on the forum or on Gnoga mailing list=20 > (https://sourceforge.net/p/gnoga/mailman/gnoga-list). >=20 > Thanks, Pascal. > https://blady.pagesperso-orange.fr I would be hesitant to use gnatcoll directly. The one nice thing about Gnoga is that (at least previously), it tried not to fully rely on GNAT. Using gnatcoll would be a step in the wrong direction in that respect. From personal experience,=20 I used Gnoga to create a program for a GPU module using a variant of linux. If Gnoga suddenly started requiring gnatcoll, then=20 that program would no longer work as I was unable to get most of Adacore's additional libraries to even compile in that variant of linux. This included gnatcoll. =20 Additionally, the library Gnoga already leverages (Simple Components [1]) already has some UTF-8 functionality you=20 might be able to leverage. You might check that out. One other thing, if you are interested, you might send a message to a fellow who goes by Entomy on github. His=20 area of expertise is text parsing, localization, etc. and he is experienced in Ada. He might have some libraries=20 or tools you could leverage. You could probably catch him at his twitter pretty easily: =20 https://twitter.com/pkell7 [1]: http://www.dmitry-kazakov.de/ada/strings_edit.htm#Strings_Edit.UTF8.Ma= ps.Unicode_Set