From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
	ip-172-31-74-118.ec2.internal
X-Spam-Level: 
X-Spam-Status: No, score=0.8 required=3.0 tests=BAYES_50,FREEMAIL_FROM
	autolearn=ham autolearn_force=no version=3.4.6
X-Received: by 2002:ad4:56a4:: with SMTP id bd4mr23957647qvb.16.1635874957635;
        Tue, 02 Nov 2021 10:42:37 -0700 (PDT)
X-Received: by 2002:a25:3787:: with SMTP id e129mr38373579yba.91.1635874957334;
 Tue, 02 Nov 2021 10:42:37 -0700 (PDT)
Path: eternal-september.org!reader02.eternal-september.org!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Tue, 2 Nov 2021 10:42:37 -0700 (PDT)
Injection-Info: google-groups.googlegroups.com; posting-host=94.60.6.132; posting-account=3cDqWgoAAAAZXc8D3pDqwa77IryJ2nnY
NNTP-Posting-Host: 94.60.6.132
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com>
Subject: How to read in a (long) UTF-8 file, incrementally?
From: Marius Amado-Alves <amado.alves@gmail.com>
Injection-Date: Tue, 02 Nov 2021 17:42:37 +0000
Content-Type: text/plain; charset="UTF-8"
Xref: reader02.eternal-september.org comp.lang.ada:63089
List-Id: <comp.lang.ada>

As I understand it, to work with Unicode text inside the program it is better to use the Wide_Wide (UTF-32) variants of everything.

Now, Unicode files usually are in UTF-8.

One solution is to read the entire file in one gulp to a String, then convert to Wide_Wide. This solution is not memory efficient, and it may not be possible in some tasks e.g. real time processing of lines of text.

If the files has lines, I guess we can also work line by line (Text_IO). But the text may not have lines. Can be a long XML object, for example.

So it should be possible to read a single UTF-8 character, right? Which might be 1, 2, 3, or 4 bytes long, so it must be read into a String, right? Or directly to Wide_Wide. Are there such functions?

Thanks a lot.