comp.lang.ada
 help / color / mirror / Atom feed
* How to read in a (long) UTF-8 file, incrementally?
@ 2021-11-02 17:42 Marius Amado-Alves
  2021-11-02 18:17 ` Dmitry A. Kazakov
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Marius Amado-Alves @ 2021-11-02 17:42 UTC (permalink / raw)


As I understand it, to work with Unicode text inside the program it is better to use the Wide_Wide (UTF-32) variants of everything.

Now, Unicode files usually are in UTF-8.

One solution is to read the entire file in one gulp to a String, then convert to Wide_Wide. This solution is not memory efficient, and it may not be possible in some tasks e.g. real time processing of lines of text.

If the files has lines, I guess we can also work line by line (Text_IO). But the text may not have lines. Can be a long XML object, for example.

So it should be possible to read a single UTF-8 character, right? Which might be 1, 2, 3, or 4 bytes long, so it must be read into a String, right? Or directly to Wide_Wide. Are there such functions?

Thanks a lot.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to read in a (long) UTF-8 file, incrementally?
  2021-11-02 17:42 How to read in a (long) UTF-8 file, incrementally? Marius Amado-Alves
@ 2021-11-02 18:17 ` Dmitry A. Kazakov
  2021-11-03  7:43 ` Vadim Godunko
  2021-11-03  8:48 ` Luke A. Guest
  2 siblings, 0 replies; 15+ messages in thread
From: Dmitry A. Kazakov @ 2021-11-02 18:17 UTC (permalink / raw)


On 2021-11-02 18:42, Marius Amado-Alves wrote:

> So it should be possible to read a single UTF-8 character, right? Which might be 1, 2, 3, or 4 bytes long, so it must be read into a String, right? Or directly to Wide_Wide. Are there such functions?

You simply read a stream of Characters into a buffer. Never ever use 
Wide or Wide_Wide, they are useless. Inside the buffer you must have 4 
Characters ahead unless the file end is reached. Usually you read until 
some separator like line end.

Then you call this:

http://www.dmitry-kazakov.de/ada/strings_edit.htm#Strings_Edit.UTF8.Get

That will give you a code point and advance the cursor to the next UTF-8 
character.

However, normally, no text processing task needs that. Whatever you want 
to do, you can accomplish it using normal String operations and normal 
String-based data structures like maps and tables. You need not to care 
about any UTF-8 character boundaries ever.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to read in a (long) UTF-8 file, incrementally?
  2021-11-02 17:42 How to read in a (long) UTF-8 file, incrementally? Marius Amado-Alves
  2021-11-02 18:17 ` Dmitry A. Kazakov
@ 2021-11-03  7:43 ` Vadim Godunko
  2021-11-03  8:48 ` Luke A. Guest
  2 siblings, 0 replies; 15+ messages in thread
From: Vadim Godunko @ 2021-11-03  7:43 UTC (permalink / raw)


On Tuesday, November 2, 2021 at 8:42:38 PM UTC+3, amado...@gmail.com wrote:
> As I understand it, to work with Unicode text inside the program it is better to use the Wide_Wide (UTF-32) variants of everything. 
> 
> Now, Unicode files usually are in UTF-8. 
> 
> One solution is to read the entire file in one gulp to a String, then convert to Wide_Wide. This solution is not memory efficient, and it may not be possible in some tasks e.g. real time processing of lines of text. 
> 
> If the files has lines, I guess we can also work line by line (Text_IO). But the text may not have lines. Can be a long XML object, for example. 
> 
> So it should be possible to read a single UTF-8 character, right? Which might be 1, 2, 3, or 4 bytes long, so it must be read into a String, right? Or directly to Wide_Wide. Are there such functions? 
> 
There is special library to process Unicode text, see https://github.com/AdaCore/VSS; 'contrib' directory contains VSS.Utils.File_IO package to load file into Virtual_String. However, attempt to load whole file into the memory is bad decision usually.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to read in a (long) UTF-8 file, incrementally?
  2021-11-02 17:42 How to read in a (long) UTF-8 file, incrementally? Marius Amado-Alves
  2021-11-02 18:17 ` Dmitry A. Kazakov
  2021-11-03  7:43 ` Vadim Godunko
@ 2021-11-03  8:48 ` Luke A. Guest
  2021-11-04 11:43   ` Marius Amado-Alves
  2 siblings, 1 reply; 15+ messages in thread
From: Luke A. Guest @ 2021-11-03  8:48 UTC (permalink / raw)


On 02/11/2021 17:42, Marius Amado-Alves wrote:
> As I understand it, to work with Unicode text inside the program it is better to use the Wide_Wide (UTF-32) variants of everything.

You can take a look at my simple lib: https://github.com/Lucretia/uca

> Now, Unicode files usually are in UTF-8.
> 
> One solution is to read the entire file in one gulp to a String, then convert to Wide_Wide. This solution is not memory efficient, and it may not be possible in some tasks e.g. real time processing of lines of text.

It can read into a large string buffer.

> If the files has lines, I guess we can also work line by line (Text_IO). But the text may not have lines. Can be a long XML object, for example.

And can break it up into lines. There's no Unicode consistency checks.

The lib is a bit hacky, but seems to work for now. There's nothing more 
than what I've mentioned so far.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to read in a (long) UTF-8 file, incrementally?
  2021-11-03  8:48 ` Luke A. Guest
@ 2021-11-04 11:43   ` Marius Amado-Alves
  2021-11-04 12:13     ` Dmitry A. Kazakov
  2021-11-04 14:30     ` Luke A. Guest
  0 siblings, 2 replies; 15+ messages in thread
From: Marius Amado-Alves @ 2021-11-04 11:43 UTC (permalink / raw)


Great libraries, thanks.

It still seems to me that Wide_Wide_Character is useful. It allows to represent the character directly in the sourcecode e.g.

   if C = '±' then ...

And Wide_Wide_Character'Pos should give the codepoint.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to read in a (long) UTF-8 file, incrementally?
  2021-11-04 11:43   ` Marius Amado-Alves
@ 2021-11-04 12:13     ` Dmitry A. Kazakov
  2021-11-04 14:30     ` Luke A. Guest
  1 sibling, 0 replies; 15+ messages in thread
From: Dmitry A. Kazakov @ 2021-11-04 12:13 UTC (permalink / raw)


On 2021-11-04 12:43, Marius Amado-Alves wrote:
> Great libraries, thanks.
> 
> It still seems to me that Wide_Wide_Character is useful. It allows to represent the character directly in the sourcecode e.g.
> 
>     if C = '±' then ...

If the source supports Unicode, it should do UTF-8 as well. So, you 
would write

    if C = "±" then ...

where C is String.

> And Wide_Wide_Character'Pos should give the codepoint.

Yes, but you need no Wide_Wide to get an integer value and if your 
objective is Unicode categorization, that is too complicated for manual 
comparisons. Use a library function [generated from UnicodeData.txt] 
instead.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to read in a (long) UTF-8 file, incrementally?
  2021-11-04 11:43   ` Marius Amado-Alves
  2021-11-04 12:13     ` Dmitry A. Kazakov
@ 2021-11-04 14:30     ` Luke A. Guest
  2021-11-05 10:56       ` Marius Amado-Alves
  1 sibling, 1 reply; 15+ messages in thread
From: Luke A. Guest @ 2021-11-04 14:30 UTC (permalink / raw)


On 04/11/2021 11:43, Marius Amado-Alves wrote:
> Great libraries, thanks.
> 
> It still seems to me that Wide_Wide_Character is useful. It allows to represent the character directly in the sourcecode e.g.
> 
>     if C = '±' then ...
> 
> And Wide_Wide_Character'Pos should give the codepoint.
> 

Characters no longer exist as a thing as one can even be represented as 
multiple utf-32 code points.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to read in a (long) UTF-8 file, incrementally?
  2021-11-04 14:30     ` Luke A. Guest
@ 2021-11-05 10:56       ` Marius Amado-Alves
  2021-11-05 19:55         ` Simon Wright
  0 siblings, 1 reply; 15+ messages in thread
From: Marius Amado-Alves @ 2021-11-05 10:56 UTC (permalink / raw)


> Characters no longer exist as a thing as one can even be represented as 
> multiple utf-32 code points.

You're alluding to combining characters?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to read in a (long) UTF-8 file, incrementally?
  2021-11-05 10:56       ` Marius Amado-Alves
@ 2021-11-05 19:55         ` Simon Wright
  2021-11-16 11:55           ` Marius Amado-Alves
  0 siblings, 1 reply; 15+ messages in thread
From: Simon Wright @ 2021-11-05 19:55 UTC (permalink / raw)


Marius Amado-Alves <amado.alves@gmail.com> writes:

>> Characters no longer exist as a thing as one can even be represented as 
>> multiple utf-32 code points.
>
> You're alluding to combining characters?

Fun & games on macOS[1]:

> $ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c p*.ads
> gcc -c páck3.ads
> páck3.ads:1:10: warning: file name does not match unit name, should be "páck3.ads"
> 
> The reason for this apparently-bizarre message is that macOS takes the
> composed form (lowercase a acute) and converts it under the hood to
> what HFS+ insists on, the fully decomposed form (lowercase a,
> combining acute); thus the names are actually different even though
> they _look_ the same.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114#c1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to read in a (long) UTF-8 file, incrementally?
  2021-11-05 19:55         ` Simon Wright
@ 2021-11-16 11:55           ` Marius Amado-Alves
  2021-11-16 12:36             ` Dmitry A. Kazakov
                               ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Marius Amado-Alves @ 2021-11-16 11:55 UTC (permalink / raw)


I'm worried. I need the concept of character, for proper text processing. For example, I want to reference characters in a text file by their position. Any tips/references on how to deal with combining characters, or any other perturbating feature of Unicode, greatly appreciated.

(For me, a combining character is not a character, the combination is. Unicode agrees, right?)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to read in a (long) UTF-8 file, incrementally?
  2021-11-16 11:55           ` Marius Amado-Alves
@ 2021-11-16 12:36             ` Dmitry A. Kazakov
  2021-11-16 13:52               ` Marius Amado-Alves
  2021-11-16 20:23               ` Randy Brukardt
  2021-11-16 15:25             ` Luke A. Guest
  2021-11-16 17:38             ` Vadim Godunko
  2 siblings, 2 replies; 15+ messages in thread
From: Dmitry A. Kazakov @ 2021-11-16 12:36 UTC (permalink / raw)


On 2021-11-16 12:55, Marius Amado-Alves wrote:
> I'm worried. I need the concept of character, for proper text processing.

Simply ignore or reject decomposed characters.

> For example, I want to reference characters in a text file by their position.

That is no problem either. There are two alternatives:

1. Fixed font representation. Reduce everything to normal glyphs, use 
string position corresponding to the beginning of an UTF-8 sequence.

2. Proportional font. Use a graphical user interface like GTK. The GTK 
text buffer has a data type (iterator) to indicate a place in the 
buffer, e.g. when a selection happens. These iterators are consistent 
with the glyphs as rendered on the screen and you can convert between 
them and string position.

> (For me, a combining character is not a character, the combination is. Unicode agrees, right?)

No, Unicode disagrees, e.g. É can be composed from E and acute accent. 
But it is advised just to ignore all this nonsense and consider:

    code point = character

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to read in a (long) UTF-8 file, incrementally?
  2021-11-16 12:36             ` Dmitry A. Kazakov
@ 2021-11-16 13:52               ` Marius Amado-Alves
  2021-11-16 20:23               ` Randy Brukardt
  1 sibling, 0 replies; 15+ messages in thread
From: Marius Amado-Alves @ 2021-11-16 13:52 UTC (permalink / raw)


> Simply ignore or reject decomposed characters.

Brilliant!

> 1. Fixed font representation. Reduce everything to normal glyphs, use 
> string position corresponding to the beginning of an UTF-8 sequence.

I am indeed resorting to byte position in UTF-8 files as the character position. Treating UTF-8 entities as the strings that they are:-)

(Not dealing with fonts nor graphics yet, just plain text.)

Thanks a lot.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to read in a (long) UTF-8 file, incrementally?
  2021-11-16 11:55           ` Marius Amado-Alves
  2021-11-16 12:36             ` Dmitry A. Kazakov
@ 2021-11-16 15:25             ` Luke A. Guest
  2021-11-16 17:38             ` Vadim Godunko
  2 siblings, 0 replies; 15+ messages in thread
From: Luke A. Guest @ 2021-11-16 15:25 UTC (permalink / raw)


On 16/11/2021 11:55, Marius Amado-Alves wrote:
> I'm worried. I need the concept of character, for proper text processing. For example, I want to reference characters in a text file by their position. Any tips/references on how to deal with combining characters, or any other perturbating feature of Unicode, greatly appreciated.
> 
> (For me, a combining character is not a character, the combination is. Unicode agrees, right?)
> 

You can't. The concept of character is dead, the new concept are 
grapheme clusters.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to read in a (long) UTF-8 file, incrementally?
  2021-11-16 11:55           ` Marius Amado-Alves
  2021-11-16 12:36             ` Dmitry A. Kazakov
  2021-11-16 15:25             ` Luke A. Guest
@ 2021-11-16 17:38             ` Vadim Godunko
  2 siblings, 0 replies; 15+ messages in thread
From: Vadim Godunko @ 2021-11-16 17:38 UTC (permalink / raw)


On Tuesday, November 16, 2021 at 2:55:06 PM UTC+3, amado...@gmail.com wrote:
> I'm worried. I need the concept of character, for proper text processing. For example, I want to reference characters in a text file by their position. Any tips/references on how to deal with combining characters, or any other perturbating feature of Unicode, greatly appreciated. 
> 
> (For me, a combining character is not a character, the combination is. Unicode agrees, right?)

You can use VSS and Grapheme_Cluster_Iterator to lookup for grapheme cluster at given position and to obtain position of the grapheme cluster in the string (as well as UTF-8/UTF-16 code units).

However, concept of grapheme clusters doesn't handle special cases like tabulation stops; TAB is just single grapheme cluster.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to read in a (long) UTF-8 file, incrementally?
  2021-11-16 12:36             ` Dmitry A. Kazakov
  2021-11-16 13:52               ` Marius Amado-Alves
@ 2021-11-16 20:23               ` Randy Brukardt
  1 sibling, 0 replies; 15+ messages in thread
From: Randy Brukardt @ 2021-11-16 20:23 UTC (permalink / raw)


"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:sn08jf$pkq$1@gioia.aioe.org...
> On 2021-11-16 12:55, Marius Amado-Alves wrote:
>> I'm worried. I need the concept of character, for proper text processing.
>
> Simply ignore or reject decomposed characters.

Unicode calls that "requiing Normalization Form C". ("Form D" is all 
decomposed characters.) You'll note that what Ada compilers do with text not 
in Normalization Form C is implementation-defined; in particular, a compiler 
could reject such text.

My understanding is that various Internet standards also require 
Normalization Form C. For instance, web pages are supposed to always be in 
that format. Whether browsers actually enforce that is unknown (they should 
enforce a lot of stuff about web pages, but generally just try to muddle 
through, which causes all kinds of security issues).

                            Randy.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-11-16 20:23 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-02 17:42 How to read in a (long) UTF-8 file, incrementally? Marius Amado-Alves
2021-11-02 18:17 ` Dmitry A. Kazakov
2021-11-03  7:43 ` Vadim Godunko
2021-11-03  8:48 ` Luke A. Guest
2021-11-04 11:43   ` Marius Amado-Alves
2021-11-04 12:13     ` Dmitry A. Kazakov
2021-11-04 14:30     ` Luke A. Guest
2021-11-05 10:56       ` Marius Amado-Alves
2021-11-05 19:55         ` Simon Wright
2021-11-16 11:55           ` Marius Amado-Alves
2021-11-16 12:36             ` Dmitry A. Kazakov
2021-11-16 13:52               ` Marius Amado-Alves
2021-11-16 20:23               ` Randy Brukardt
2021-11-16 15:25             ` Luke A. Guest
2021-11-16 17:38             ` Vadim Godunko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox