From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on ip-172-31-74-118.ec2.internal X-Spam-Level: X-Spam-Status: No, score=-0.9 required=3.0 tests=BAYES_00,XPRIO autolearn=no autolearn_force=no version=3.4.6 Path: eternal-september.org!reader02.eternal-september.org!gandalf.srv.welterde.de!news.jacob-sparre.dk!franka.jacob-sparre.dk!pnx.dk!.POSTED.rrsoftware.com!not-for-mail From: "Randy Brukardt" Newsgroups: comp.lang.ada Subject: Re: How to read in a (long) UTF-8 file, incrementally? Date: Tue, 16 Nov 2021 14:23:28 -0600 Organization: JSA Research & Innovation Message-ID: References: <1c6b151b-f017-496d-b381-ba08bef1bbb7n@googlegroups.com> Injection-Date: Tue, 16 Nov 2021 20:23:29 -0000 (UTC) Injection-Info: franka.jacob-sparre.dk; posting-host="rrsoftware.com:24.196.82.226"; logging-data="31090"; mail-complaints-to="news@jacob-sparre.dk" X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2900.5931 X-RFC2646: Format=Flowed; Response X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.7246 Xref: reader02.eternal-september.org comp.lang.ada:63129 List-Id: "Dmitry A. Kazakov" wrote in message news:sn08jf$pkq$1@gioia.aioe.org... > On 2021-11-16 12:55, Marius Amado-Alves wrote: >> I'm worried. I need the concept of character, for proper text processing. > > Simply ignore or reject decomposed characters. Unicode calls that "requiing Normalization Form C". ("Form D" is all decomposed characters.) You'll note that what Ada compilers do with text not in Normalization Form C is implementation-defined; in particular, a compiler could reject such text. My understanding is that various Internet standards also require Normalization Form C. For instance, web pages are supposed to always be in that format. Whether browsers actually enforce that is unknown (they should enforce a lot of stuff about web pages, but generally just try to muddle through, which causes all kinds of security issues). Randy.