From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on ip-172-31-74-118.ec2.internal X-Spam-Level: X-Spam-Status: No, score=-0.5 required=3.0 tests=BAYES_05,FREEMAIL_FROM autolearn=ham autolearn_force=no version=3.4.6 X-Received: by 2002:a1c:4d0b:: with SMTP id o11mr13035116wmh.68.1635925383246; Wed, 03 Nov 2021 00:43:03 -0700 (PDT) X-Received: by 2002:a25:4d83:: with SMTP id a125mr45466206ybb.277.1635925382583; Wed, 03 Nov 2021 00:43:02 -0700 (PDT) Path: eternal-september.org!reader02.eternal-september.org!news.mixmin.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail Newsgroups: comp.lang.ada Date: Wed, 3 Nov 2021 00:43:02 -0700 (PDT) In-Reply-To: Injection-Info: google-groups.googlegroups.com; posting-host=87.117.51.48; posting-account=niG3UgoAAAD7iQ3takWjEn_gw6D9X3ww NNTP-Posting-Host: 87.117.51.48 References: User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: <90818857-b379-4c2a-81a2-f988ce8598ban@googlegroups.com> Subject: Re: How to read in a (long) UTF-8 file, incrementally? From: Vadim Godunko Injection-Date: Wed, 03 Nov 2021 07:43:03 +0000 Content-Type: text/plain; charset="UTF-8" Xref: reader02.eternal-september.org comp.lang.ada:63092 List-Id: On Tuesday, November 2, 2021 at 8:42:38 PM UTC+3, amado...@gmail.com wrote: > As I understand it, to work with Unicode text inside the program it is better to use the Wide_Wide (UTF-32) variants of everything. > > Now, Unicode files usually are in UTF-8. > > One solution is to read the entire file in one gulp to a String, then convert to Wide_Wide. This solution is not memory efficient, and it may not be possible in some tasks e.g. real time processing of lines of text. > > If the files has lines, I guess we can also work line by line (Text_IO). But the text may not have lines. Can be a long XML object, for example. > > So it should be possible to read a single UTF-8 character, right? Which might be 1, 2, 3, or 4 bytes long, so it must be read into a String, right? Or directly to Wide_Wide. Are there such functions? > There is special library to process Unicode text, see https://github.com/AdaCore/VSS; 'contrib' directory contains VSS.Utils.File_IO package to load file into Virtual_String. However, attempt to load whole file into the memory is bad decision usually.