From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on ip-172-31-74-118.ec2.internal X-Spam-Level: X-Spam-Status: No, score=0.8 required=3.0 tests=BAYES_50,FREEMAIL_FROM autolearn=ham autolearn_force=no version=3.4.6 X-Received: by 2002:ad4:56a4:: with SMTP id bd4mr23957647qvb.16.1635874957635; Tue, 02 Nov 2021 10:42:37 -0700 (PDT) X-Received: by 2002:a25:3787:: with SMTP id e129mr38373579yba.91.1635874957334; Tue, 02 Nov 2021 10:42:37 -0700 (PDT) Path: eternal-september.org!reader02.eternal-september.org!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail Newsgroups: comp.lang.ada Date: Tue, 2 Nov 2021 10:42:37 -0700 (PDT) Injection-Info: google-groups.googlegroups.com; posting-host=94.60.6.132; posting-account=3cDqWgoAAAAZXc8D3pDqwa77IryJ2nnY NNTP-Posting-Host: 94.60.6.132 User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: Subject: How to read in a (long) UTF-8 file, incrementally? From: Marius Amado-Alves Injection-Date: Tue, 02 Nov 2021 17:42:37 +0000 Content-Type: text/plain; charset="UTF-8" Xref: reader02.eternal-september.org comp.lang.ada:63089 List-Id: As I understand it, to work with Unicode text inside the program it is better to use the Wide_Wide (UTF-32) variants of everything. Now, Unicode files usually are in UTF-8. One solution is to read the entire file in one gulp to a String, then convert to Wide_Wide. This solution is not memory efficient, and it may not be possible in some tasks e.g. real time processing of lines of text. If the files has lines, I guess we can also work line by line (Text_IO). But the text may not have lines. Can be a long XML object, for example. So it should be possible to read a single UTF-8 character, right? Which might be 1, 2, 3, or 4 bytes long, so it must be read into a String, right? Or directly to Wide_Wide. Are there such functions? Thanks a lot.