Re: How to read in a (long) UTF-8 file, incrementally?

comp.lang.ada
 help / color / mirror / Atom feed

From: Vadim Godunko <vgodunko@gmail.com>
Subject: Re: How to read in a (long) UTF-8 file, incrementally?
Date: Tue, 16 Nov 2021 09:38:13 -0800 (PST)	[thread overview]
Message-ID: <0a3065fd-1d17-416d-b640-427aca3a090bn@googlegroups.com> (raw)
In-Reply-To: <f0d17e38-58c7-4914-ab9c-8632cecc8215n@googlegroups.com>

On Tuesday, November 16, 2021 at 2:55:06 PM UTC+3, amado...@gmail.com wrote:
> I'm worried. I need the concept of character, for proper text processing. For example, I want to reference characters in a text file by their position. Any tips/references on how to deal with combining characters, or any other perturbating feature of Unicode, greatly appreciated. 
> 
> (For me, a combining character is not a character, the combination is. Unicode agrees, right?)

You can use VSS and Grapheme_Cluster_Iterator to lookup for grapheme cluster at given position and to obtain position of the grapheme cluster in the string (as well as UTF-8/UTF-16 code units).

However, concept of grapheme clusters doesn't handle special cases like tabulation stops; TAB is just single grapheme cluster.

     prev parent reply	other threads:[~2021-11-16 17:38 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-02 17:42 How to read in a (long) UTF-8 file, incrementally? Marius Amado-Alves
2021-11-02 18:17 ` Dmitry A. Kazakov
2021-11-03  7:43 ` Vadim Godunko
2021-11-03  8:48 ` Luke A. Guest
2021-11-04 11:43   ` Marius Amado-Alves
2021-11-04 12:13     ` Dmitry A. Kazakov
2021-11-04 14:30     ` Luke A. Guest
2021-11-05 10:56       ` Marius Amado-Alves
2021-11-05 19:55         ` Simon Wright
2021-11-16 11:55           ` Marius Amado-Alves
2021-11-16 12:36             ` Dmitry A. Kazakov
2021-11-16 13:52               ` Marius Amado-Alves
2021-11-16 20:23               ` Randy Brukardt
2021-11-16 15:25             ` Luke A. Guest
2021-11-16 17:38             ` Vadim Godunko [this message]

replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox