From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
	ip-172-31-74-118.ec2.internal
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=3.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.6
Path: eternal-september.org!reader02.eternal-september.org!aioe.org!Lx7EM+81f32E0bqku+QpCA.user.46.165.242.75.POSTED!not-for-mail
From: "Luke A. Guest" <laguest@archeia.com>
Newsgroups: comp.lang.ada
Subject: Re: How to read in a (long) UTF-8 file, incrementally?
Date: Wed, 3 Nov 2021 08:48:58 +0000
Organization: Aioe.org NNTP Server
Message-ID: <sltigk$43o$1@gioia.aioe.org>
References: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="4216"; posting-host="Lx7EM+81f32E0bqku+QpCA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.14.0
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-GB
Xref: reader02.eternal-september.org comp.lang.ada:63093
List-Id: <comp.lang.ada>

On 02/11/2021 17:42, Marius Amado-Alves wrote:
> As I understand it, to work with Unicode text inside the program it is better to use the Wide_Wide (UTF-32) variants of everything.

You can take a look at my simple lib: https://github.com/Lucretia/uca

> Now, Unicode files usually are in UTF-8.
> 
> One solution is to read the entire file in one gulp to a String, then convert to Wide_Wide. This solution is not memory efficient, and it may not be possible in some tasks e.g. real time processing of lines of text.

It can read into a large string buffer.

> If the files has lines, I guess we can also work line by line (Text_IO). But the text may not have lines. Can be a long XML object, for example.

And can break it up into lines. There's no Unicode consistency checks.

The lib is a bit hacky, but seems to work for now. There's nothing more 
than what I've mentioned so far.