From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
	ip-172-31-74-118.ec2.internal
X-Spam-Level: 
X-Spam-Status: No, score=-0.5 required=3.0 tests=BAYES_05,FREEMAIL_FROM
	autolearn=ham autolearn_force=no version=3.4.6
X-Received: by 2002:a1c:4d0b:: with SMTP id o11mr13035116wmh.68.1635925383246;
        Wed, 03 Nov 2021 00:43:03 -0700 (PDT)
X-Received: by 2002:a25:4d83:: with SMTP id a125mr45466206ybb.277.1635925382583;
 Wed, 03 Nov 2021 00:43:02 -0700 (PDT)
Path: eternal-september.org!reader02.eternal-september.org!news.mixmin.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Wed, 3 Nov 2021 00:43:02 -0700 (PDT)
In-Reply-To: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=87.117.51.48; posting-account=niG3UgoAAAD7iQ3takWjEn_gw6D9X3ww
NNTP-Posting-Host: 87.117.51.48
References: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <90818857-b379-4c2a-81a2-f988ce8598ban@googlegroups.com>
Subject: Re: How to read in a (long) UTF-8 file, incrementally?
From: Vadim Godunko <vgodunko@gmail.com>
Injection-Date: Wed, 03 Nov 2021 07:43:03 +0000
Content-Type: text/plain; charset="UTF-8"
Xref: reader02.eternal-september.org comp.lang.ada:63092
List-Id: <comp.lang.ada>

On Tuesday, November 2, 2021 at 8:42:38 PM UTC+3, amado...@gmail.com wrote:
> As I understand it, to work with Unicode text inside the program it is better to use the Wide_Wide (UTF-32) variants of everything. 
> 
> Now, Unicode files usually are in UTF-8. 
> 
> One solution is to read the entire file in one gulp to a String, then convert to Wide_Wide. This solution is not memory efficient, and it may not be possible in some tasks e.g. real time processing of lines of text. 
> 
> If the files has lines, I guess we can also work line by line (Text_IO). But the text may not have lines. Can be a long XML object, for example. 
> 
> So it should be possible to read a single UTF-8 character, right? Which might be 1, 2, 3, or 4 bytes long, so it must be read into a String, right? Or directly to Wide_Wide. Are there such functions? 
> 
There is special library to process Unicode text, see https://github.com/AdaCore/VSS; 'contrib' directory contains VSS.Utils.File_IO package to load file into Virtual_String. However, attempt to load whole file into the memory is bad decision usually.