From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on ip-172-31-74-118.ec2.internal X-Spam-Level: X-Spam-Status: No, score=0.0 required=3.0 tests=BAYES_40,FREEMAIL_FROM autolearn=ham autolearn_force=no version=3.4.6 X-Received: by 2002:a05:6214:2427:: with SMTP id gy7mr47455976qvb.38.1637084293932; Tue, 16 Nov 2021 09:38:13 -0800 (PST) X-Received: by 2002:a05:6902:1543:: with SMTP id r3mr11015351ybu.166.1637084293676; Tue, 16 Nov 2021 09:38:13 -0800 (PST) Path: eternal-september.org!reader02.eternal-september.org!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail Newsgroups: comp.lang.ada Date: Tue, 16 Nov 2021 09:38:13 -0800 (PST) In-Reply-To: Injection-Info: google-groups.googlegroups.com; posting-host=87.117.51.195; posting-account=niG3UgoAAAD7iQ3takWjEn_gw6D9X3ww NNTP-Posting-Host: 87.117.51.195 References: <1c6b151b-f017-496d-b381-ba08bef1bbb7n@googlegroups.com> User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: <0a3065fd-1d17-416d-b640-427aca3a090bn@googlegroups.com> Subject: Re: How to read in a (long) UTF-8 file, incrementally? From: Vadim Godunko Injection-Date: Tue, 16 Nov 2021 17:38:13 +0000 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Received-Bytes: 2271 Xref: reader02.eternal-september.org comp.lang.ada:63128 List-Id: On Tuesday, November 16, 2021 at 2:55:06 PM UTC+3, amado...@gmail.com wrote= : > I'm worried. I need the concept of character, for proper text processing.= For example, I want to reference characters in a text file by their positi= on. Any tips/references on how to deal with combining characters, or any ot= her perturbating feature of Unicode, greatly appreciated.=20 >=20 > (For me, a combining character is not a character, the combination is. Un= icode agrees, right?) You can use VSS and Grapheme_Cluster_Iterator to lookup for grapheme cluste= r at given position and to obtain position of the grapheme cluster in the s= tring (as well as UTF-8/UTF-16 code units). However, concept of grapheme clusters doesn't handle special cases like tab= ulation stops; TAB is just single grapheme cluster.