From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
	ip-172-31-74-118.ec2.internal
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=3.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.6
Path: eternal-september.org!reader02.eternal-september.org!aioe.org!Hx95GBhnJb0Xc8StPhH8AA.user.46.165.242.91.POSTED!not-for-mail
From: "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Newsgroups: comp.lang.ada
Subject: Re: How to read in a (long) UTF-8 file, incrementally?
Date: Tue, 16 Nov 2021 13:36:00 +0100
Organization: Aioe.org NNTP Server
Message-ID: <sn08jf$pkq$1@gioia.aioe.org>
References: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com>
 <sltigk$43o$1@gioia.aioe.org>
 <c1973b0d-7f3e-487f-8766-586b2d8c69edn@googlegroups.com>
 <sm0qss$1vl7$1@gioia.aioe.org>
 <1c6b151b-f017-496d-b381-ba08bef1bbb7n@googlegroups.com>
 <lymtmixtqi.fsf@pushface.org>
 <f0d17e38-58c7-4914-ab9c-8632cecc8215n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="26266"; posting-host="Hx95GBhnJb0Xc8StPhH8AA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
 Thunderbird/91.3.1
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-US
Xref: reader02.eternal-september.org comp.lang.ada:63125
List-Id: <comp.lang.ada>

On 2021-11-16 12:55, Marius Amado-Alves wrote:
> I'm worried. I need the concept of character, for proper text processing.

Simply ignore or reject decomposed characters.

> For example, I want to reference characters in a text file by their position.

That is no problem either. There are two alternatives:

1. Fixed font representation. Reduce everything to normal glyphs, use 
string position corresponding to the beginning of an UTF-8 sequence.

2. Proportional font. Use a graphical user interface like GTK. The GTK 
text buffer has a data type (iterator) to indicate a place in the 
buffer, e.g. when a selection happens. These iterators are consistent 
with the glyphs as rendered on the screen and you can convert between 
them and string position.

> (For me, a combining character is not a character, the combination is. Unicode agrees, right?)

No, Unicode disagrees, e.g. É can be composed from E and acute accent. 
But it is advised just to ignore all this nonsense and consider:

    code point = character

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de