strange behaviour of utf-8 files

comp.lang.ada
 help / color / mirror / Atom feed

* strange behaviour of utf-8 files
@ 2013-11-16 13:12 Stoik
  2013-11-16 13:34 ` Dmitry A. Kazakov
  0 siblings, 1 reply; 33+ messages in thread
From: Stoik @ 2013-11-16 13:12 UTC (permalink / raw)


I am using gps 5.2.1 with utf-8 encoding in the editor. I tried to write a simple routine to strip the diacritical marks from Polish texts. When executing a test program, I got the "translation_error" message, and it turned out that the string consisting of Polish letters was treated as double the proper length. You can try for yourself: with
s: string := "ó";
we get s'length=2. Where is the hook? Is it a compiler error, gps error, or my own one?
Stoik


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-16 13:12 strange behaviour of utf-8 files Stoik
@ 2013-11-16 13:34 ` Dmitry A. Kazakov
  2013-11-16 15:09   ` Stoik
  2013-11-16 15:12   ` Stoik
  0 siblings, 2 replies; 33+ messages in thread
From: Dmitry A. Kazakov @ 2013-11-16 13:34 UTC (permalink / raw)


On Sat, 16 Nov 2013 05:12:29 -0800 (PST), Stoik wrote:

> I am using gps 5.2.1 with utf-8 encoding in the editor. I tried to write a
> simple routine to strip the diacritical marks from Polish texts. When
> executing a test program, I got the "translation_error" message, and it
> turned out that the string consisting of Polish letters was treated as
> double the proper length. You can try for yourself: with
> s: string := "ó";
> we get s'length=2. Where is the hook? Is it a compiler error, gps error, or my own one?

Without source code it is impossible to say. But "ó" in UTF-8 is two
octets: 16#C3# 16#B3#. When packed into a string that must be 2 characters
long, considering octet=Character (which formally is not, but whatever).

P.S. I would not use Latin-1 or anything beyond 7-bit ASCII in the source
code in order to make it portable across different systems.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-16 13:34 ` Dmitry A. Kazakov
@ 2013-11-16 15:09   ` Stoik
  2013-11-16 15:55     ` Dmitry A. Kazakov
  2013-11-16 17:01     ` Georg Bauhaus
  2013-11-16 15:12   ` Stoik
  1 sibling, 2 replies; 33+ messages in thread
From: Stoik @ 2013-11-16 15:09 UTC (permalink / raw)


W dniu sobota, 16 listopada 2013 14:34:43 UTC+1 użytkownik Dmitry A. Kazakov napisał:
> On Sat, 16 Nov 2013 05:12:29 -0800 (PST), Stoik wrote:
> 
> 
> 
> > I am using gps 5.2.1 with utf-8 encoding in the editor. I tried to write a
> 
> > simple routine to strip the diacritical marks from Polish texts. When
> 
> > executing a test program, I got the "translation_error" message, and it
> 
> > turned out that the string consisting of Polish letters was treated as
> 
> > double the proper length. You can try for yourself: with
> 
> > s: string := "ó";
> 
> > we get s'length=2. Where is the hook? Is it a compiler error, gps error, or my own one?
> 
> 
> 
> Without source code it is impossible to say. But "ó" in UTF-8 is two
> 
> octets: 16#C3# 16#B3#. When packed into a string that must be 2 characters
> 
> long, considering octet=Character (which formally is not, but whatever).
> 
> 
> 
> P.S. I would not use Latin-1 or anything beyond 7-bit ASCII in the source
> 
> code in order to make it portable across different systems.
> 
> 
> 
> -- 
> 
> Regards,
> 
> Dmitry A. Kazakov
> 
> http://www.dmitry-kazakov.de

Thanks for the answer. Your advice is certainly sound, but not very satisfactory. The whole purpose of utf-8 is to make 
things portable across platforms. If the compiler cannot deal properly with the 
source code written in the utf-8 encoding, then the whole effort that went into
all the wide_ and wide_wide_ packages and the new packages that deal with various encodings is lost (all the Latin-x possibilities are useless anyway, at least on Windows platform). I am adjoining a trivial program which works differently according to the encoding (UTF-8 or ISO-8859-1) of the source code, printing 1 or 2 as the answer.

with ada.text_io; use ada.text_io;
procedure example is
   S : String := "ó";
begin
   Put_Line (S'Length'Img);
end;


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-16 13:34 ` Dmitry A. Kazakov
  2013-11-16 15:09   ` Stoik
@ 2013-11-16 15:12   ` Stoik
  2013-11-16 15:57     ` Dmitry A. Kazakov
  2013-11-16 20:06     ` Peter C. Chapin
  1 sibling, 2 replies; 33+ messages in thread
From: Stoik @ 2013-11-16 15:12 UTC (permalink / raw)


By the way, nothing changes if I use wide_character and wide_string instead of character and string. Even if character=octet, certainly wide_character is not an octet!


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-16 15:09   ` Stoik
@ 2013-11-16 15:55     ` Dmitry A. Kazakov
  2013-11-17 13:32       ` Georg Bauhaus
  2013-11-16 17:01     ` Georg Bauhaus
  1 sibling, 1 reply; 33+ messages in thread
From: Dmitry A. Kazakov @ 2013-11-16 15:55 UTC (permalink / raw)


On Sat, 16 Nov 2013 07:09:48 -0800 (PST), Stoik wrote:

> If the compiler cannot deal properly with the source code
> written in the utf-8 encoding,

The compiler can. I believe there are GCC switches which together with
locale control that.

> then the whole effort that went into all
> the wide_ and wide_wide_ packages and the new packages that deal with
> various encodings is lost (all the Latin-x possibilities are useless
> anyway, at least on Windows platform).

Not at all. Ada's String directly corresponds to the A-functions of Windows
API. Windows W-functions are UTF-16.

And the issue has nothing to do with the language. It is about using one
encoding in the editor and another with the compiler.

> with ada.text_io; use ada.text_io;
> procedure example is
>    S : String := "ó";
> begin
>    Put_Line (S'Length'Img);
> end;

As I said in order to avoid troubles, don't use anything but ASCII. Do
this:

  SMALL_LETTER_O_WITH_ACUTE_UTF8 : constant String :=
     Character'Val (16#C3#) & Character'Val (16#B3#);

  SMALL_LETTER_O_WITH_ACUTE_Latin1 : constant String :=
     (1 => Character'Val (16#F3#));

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-16 15:12   ` Stoik
@ 2013-11-16 15:57     ` Dmitry A. Kazakov
  2013-11-17 11:12       ` Stoik
  2013-11-16 20:06     ` Peter C. Chapin
  1 sibling, 1 reply; 33+ messages in thread
From: Dmitry A. Kazakov @ 2013-11-16 15:57 UTC (permalink / raw)


On Sat, 16 Nov 2013 07:12:20 -0800 (PST), Stoik wrote:

> By the way, nothing changes if I use wide_character and wide_string
> instead of character and string. Even if character=octet, certainly
> wide_character is not an octet!

String = Latin1
Wide_String = UCS-2

There is no built-in type for UTF-8, though customary one uses String for
it (and Wide_String for UTF-16).

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-16 15:09   ` Stoik
  2013-11-16 15:55     ` Dmitry A. Kazakov
@ 2013-11-16 17:01     ` Georg Bauhaus
  2013-11-17 10:38       ` Stoik
  1 sibling, 1 reply; 33+ messages in thread
From: Georg Bauhaus @ 2013-11-16 17:01 UTC (permalink / raw)


On 16.11.13 16:09, Stoik wrote:

> Thanks for the answer. Your advice is certainly sound, but not very satisfactory. The whole purpose of utf-8 is to make
> things portable across platforms. If the compiler cannot deal properly with the
> source code written in the utf-8 encoding, then the whole effort that went into
> all the wide_ and wide_wide_ packages and the new packages that deal with various encodings is lost (all the Latin-x possibilities are useless anyway, at least on Windows platform). I am adjoining a trivial program which works differently according to the encoding (UTF-8 or ISO-8859-1) of the source code, printing 1 or 2 as the answer.
>
> with ada.text_io; use ada.text_io;
> procedure example is
>     S : String := "ó";
> begin
>     Put_Line (S'Length'Img);
> end;

GNAT has two switches that affect its way of looking at
coded characters in source text:

for identifiers in source text, specify -gnatiC
  where C is one of the characters listed 3.2.10
  of the GNAT UG accompanying the compiler;

for the wide character encoding method, specify -gnatWE
  where E is one of the characters listed in the
  same document.

With switch -gnatW8, I get

$ ./example
  1
$

That is, the source text is understood to be encoded
in UTF-8, and 'ó' becomes Character'Val (243), viz. LC_O_Acute.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-16 15:12   ` Stoik
  2013-11-16 15:57     ` Dmitry A. Kazakov
@ 2013-11-16 20:06     ` Peter C. Chapin
  2013-11-17 10:34       ` Stoik
  2013-11-22  0:53       ` Randy Brukardt
  1 sibling, 2 replies; 33+ messages in thread
From: Peter C. Chapin @ 2013-11-16 20:06 UTC (permalink / raw)


On Sat, 16 Nov 2013, Stoik wrote:

> By the way, nothing changes if I use wide_character and wide_string 
> instead of character and string. Even if character=octet, certainly 
> wide_character is not an octet!

It sounds like you want something like

     function UTF8_String_To_Wide_String(S : String) return Wide_String;

UTF-8 is a variable length encoding and thus not the same beast as 
Wide_String. String literals are going to be encoded in the same manner as 
the rest of the source text, of course.

Peter


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-16 20:06     ` Peter C. Chapin
@ 2013-11-17 10:34       ` Stoik
  2013-11-22  0:53       ` Randy Brukardt
  1 sibling, 0 replies; 33+ messages in thread
From: Stoik @ 2013-11-17 10:34 UTC (permalink / raw)


W dniu sobota, 16 listopada 2013 21:06:28 UTC+1 użytkownik Peter C. Chapin napisał:
> On Sat, 16 Nov 2013, Stoik wrote:
> 
> 
> 
> > By the way, nothing changes if I use wide_character and wide_string 
> 
> > instead of character and string. Even if character=octet, certainly 
> 
> > wide_character is not an octet!
> 
> 
> 
> It sounds like you want something like
> 
> 
> 
>      function UTF8_String_To_Wide_String(S : String) return Wide_String;
> 
> 
> 
> UTF-8 is a variable length encoding and thus not the same beast as 
> 
> Wide_String. String literals are going to be encoded in the same manner as 
> 
> the rest of the source text, of course.
> 
> 
> 
> Peter

Thank you, I always use the switches, this time I forgot to add them :(
This solves the problem, everything works fine.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-16 17:01     ` Georg Bauhaus
@ 2013-11-17 10:38       ` Stoik
  0 siblings, 0 replies; 33+ messages in thread
From: Stoik @ 2013-11-17 10:38 UTC (permalink / raw)


W dniu sobota, 16 listopada 2013 18:01:07 UTC+1 użytkownik Georg Bauhaus napisał:
> On 16.11.13 16:09, Stoik wrote:
> 
> 
> 
> > Thanks for the answer. Your advice is certainly sound, but not very satisfactory. The whole purpose of utf-8 is to make
> 
> > things portable across platforms. If the compiler cannot deal properly with the
> 
> > source code written in the utf-8 encoding, then the whole effort that went into
> 
> > all the wide_ and wide_wide_ packages and the new packages that deal with various encodings is lost (all the Latin-x possibilities are useless anyway, at least on Windows platform). I am adjoining a trivial program which works differently according to the encoding (UTF-8 or ISO-8859-1) of the source code, printing 1 or 2 as the answer.
> 
> >
> 
> > with ada.text_io; use ada.text_io;
> 
> > procedure example is
> 
> >     S : String := "ó";
> 
> > begin
> 
> >     Put_Line (S'Length'Img);
> 
> > end;
> 
> 
> 
> GNAT has two switches that affect its way of looking at
> 
> coded characters in source text:
> 
> 
> 
> for identifiers in source text, specify -gnatiC
> 
>   where C is one of the characters listed 3.2.10
> 
>   of the GNAT UG accompanying the compiler;
> 
> 
> 
> for the wide character encoding method, specify -gnatWE
> 
>   where E is one of the characters listed in the
> 
>   same document.
> 
> 
> 
> With switch -gnatW8, I get
> 
> 
> 
> $ ./example
> 
>   1
> 
> $
> 
> 
> 
> That is, the source text is understood to be encoded
> 
> in UTF-8, and 'ó' becomes Character'Val (243), viz. LC_O_Acute.

Thank you for solving the problem, by mistake I have thanked another auther first.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-16 15:57     ` Dmitry A. Kazakov
@ 2013-11-17 11:12       ` Stoik
  2013-11-22  1:03         ` Randy Brukardt
  0 siblings, 1 reply; 33+ messages in thread
From: Stoik @ 2013-11-17 11:12 UTC (permalink / raw)

W dniu sobota, 16 listopada 2013 16:57:56 UTC+1 użytkownik Dmitry A. Kazakov napisał:
> On Sat, 16 Nov 2013 07:12:20 -0800 (PST), Stoik wrote:
> 
> 
> 
> > By the way, nothing changes if I use wide_character and wide_string
> 
> > instead of character and string. Even if character=octet, certainly
> 
> > wide_character is not an octet!
> 
> 
> 
> String = Latin1
> 
> Wide_String = UCS-2
> 
> 
> 
> There is no built-in type for UTF-8, though customary one uses String for
> 
> it (and Wide_String for UTF-16).
> 
> 
> 
> -- 
> 
> Regards,
> 
> Dmitry A. Kazakov
> 
> http://www.dmitry-kazakov.de

Thanks for your comments. It is obviously a question of having a different encoding in the editor and the compiler. I forgot to add the -gnatW8 switch to the compiler (this should be a default, I believe). Nevertheless, there still are some misunderstanding connected with string, wide_string and wide_wide_string. They do not correspond to any encodings, they just correspond to character repertoires of the encodings you mentioned. String to the first 256 characters from Unicode (or ISO-10646), wide_string to BMP, and wide_wide_string to the whole Unicode. In particular, wide_string can be encoded internally using any of utf-8,16,32, the programmer does not need to know anything about it. 

I do not believe one should avoid using characters from outside ASCII in the source code. I tried it in Python and Java with no problems whatsoever. Using some strange constants instead of usual glyphs for characters outside ASCII when using subprograms from ada.(wide_)strings.maps, for example to_mapping, would be gruesome. 

In any case, GNAT is prepared to deal with the problem properly, although the number of steps the user must remember about is a bit too high (setting environment variable charset to utf-8, choosing utf-8 in the source editor,adding -gnatW8 to the compiler switches and -W8 to pretty printer switches. And the UTF-8 is the only encoding that solves the problem of non-Latin1 characters at all.

Regards

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-16 15:55     ` Dmitry A. Kazakov
@ 2013-11-17 13:32       ` Georg Bauhaus
  2013-11-17 14:07         ` Dmitry A. Kazakov
  0 siblings, 1 reply; 33+ messages in thread
From: Georg Bauhaus @ 2013-11-17 13:32 UTC (permalink / raw)


On 16.11.13 16:55, Dmitry A. Kazakov wrote:
> As I said in order to avoid troubles, don't use anything but ASCII.

ASCII-ism is the soil in which dangerous bugs keep many things
from working.(*)

With an attitude of denial towards encoding basics, would anyone
ever approach *numbers* in the same way?  I doubt it.

The best medication against chronic character FUD is to

(a) see how some unambiguous encoding does work everywhere
     (e.g. the universally supported UTF-16)  (**),
(b) understand that single units of text and single octets
     are not in general isomorphic; this leads to bugs just
     as harmless or harmful as erroneous execution in the
     presence of not 'Valid,
(c) understand that maybe wasting 9 bits of 16 bit characters
     (or a few bits per octet sequence in UTF-8)
     is not worth mentioning these days, considering source text.

Part (b) will not come to be as long as most programmers are
fine thinking that text is always 7bit characters in real life.
If, instead, programmers start learning about further bits---
that Character is a type, not an encoding---integrating software
will start working better.

__
(*) A big one of these ASCII bugs yields Google's infrastructure
     stuck with Python 2.7.
(**) I understand that even the US Navy has officially started
     using more characters than ASCII. So, can I maintains hopes
     that GNAT will one day read source files that use UTF-NN, which
     GNAT does support?



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-17 13:32       ` Georg Bauhaus
@ 2013-11-17 14:07         ` Dmitry A. Kazakov
  2013-11-17 17:19           ` Dennis Lee Bieber
                             ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Dmitry A. Kazakov @ 2013-11-17 14:07 UTC (permalink / raw)

On Sun, 17 Nov 2013 14:32:55 +0100, Georg Bauhaus wrote:

> On 16.11.13 16:55, Dmitry A. Kazakov wrote:
>> As I said in order to avoid troubles, don't use anything but ASCII.
> 
> ASCII-ism is the soil in which dangerous bugs keep many things
> from working.(*)

On the contrary, it is a reasonable precaution against sloppy OSes (Linux,
Windows) incapable to handle encoding safely [*]. The OP just ran into
that. If he followed the advise he would never have any problems of this
kind.

Using full Unicode in source files is a recipe for bugs intractable for
many program readers, like ones who would not guess 'a' and 'а' different
letters.

-------
* Preventing a file encoded as X, being read and written as if it were
encoded as Y.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-17 14:07         ` Dmitry A. Kazakov
@ 2013-11-17 17:19           ` Dennis Lee Bieber
  2013-11-17 18:07             ` Dmitry A. Kazakov
  2013-11-17 19:05           ` Georg Bauhaus
  2013-11-18  0:34           ` Stoik
  2 siblings, 1 reply; 33+ messages in thread
From: Dennis Lee Bieber @ 2013-11-17 17:19 UTC (permalink / raw)


On Sun, 17 Nov 2013 15:07:18 +0100, "Dmitry A. Kazakov"
<mailbox@dmitry-kazakov.de> declaimed the following:

>On Sun, 17 Nov 2013 14:32:55 +0100, Georg Bauhaus wrote:
>
>> On 16.11.13 16:55, Dmitry A. Kazakov wrote:
>>> As I said in order to avoid troubles, don't use anything but ASCII.
>> 
>> ASCII-ism is the soil in which dangerous bugs keep many things
>> from working.(*)
>
>On the contrary, it is a reasonable precaution against sloppy OSes (Linux,
>Windows) incapable to handle encoding safely [*]. The OP just ran into
>that. If he followed the advise he would never have any problems of this
>kind.
>
>Using full Unicode in source files is a recipe for bugs intractable for
>many program readers, like ones who would not guess 'a' and '?' different
>letters.
>

	5-bit BAUDOT should be good enough for any programming!

<G>
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-17 17:19           ` Dennis Lee Bieber
@ 2013-11-17 18:07             ` Dmitry A. Kazakov
  0 siblings, 0 replies; 33+ messages in thread
From: Dmitry A. Kazakov @ 2013-11-17 18:07 UTC (permalink / raw)


On Sun, 17 Nov 2013 12:19:23 -0500, Dennis Lee Bieber wrote:

> On Sun, 17 Nov 2013 15:07:18 +0100, "Dmitry A. Kazakov"
> <mailbox@dmitry-kazakov.de> declaimed the following:
> 
>>On Sun, 17 Nov 2013 14:32:55 +0100, Georg Bauhaus wrote:
>>
>>> On 16.11.13 16:55, Dmitry A. Kazakov wrote:
>>>> As I said in order to avoid troubles, don't use anything but ASCII.
>>> 
>>> ASCII-ism is the soil in which dangerous bugs keep many things
>>> from working.(*)
>>
>>On the contrary, it is a reasonable precaution against sloppy OSes (Linux,
>>Windows) incapable to handle encoding safely [*]. The OP just ran into
>>that. If he followed the advise he would never have any problems of this
>>kind.
>>
>>Using full Unicode in source files is a recipe for bugs intractable for
>>many program readers, like ones who would not guess 'a' and '?' different
>>letters.
> 
> 	5-bit BAUDOT should be good enough for any programming!

On the other side, there exist an alphabet in which any eligible program is
just one symbol long. Both Unicode and ASCII are in between.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-17 14:07         ` Dmitry A. Kazakov
  2013-11-17 17:19           ` Dennis Lee Bieber
@ 2013-11-17 19:05           ` Georg Bauhaus
  2013-11-17 20:38             ` Dmitry A. Kazakov
  2013-11-18  0:34           ` Stoik
  2 siblings, 1 reply; 33+ messages in thread
From: Georg Bauhaus @ 2013-11-17 19:05 UTC (permalink / raw)

On 17.11.13 15:07, Dmitry A. Kazakov wrote:

>> ASCII-ism is the soil in which dangerous bugs keep many things
>> from working.(*)
>
> On the contrary, it is a reasonable precaution against sloppy OSes (Linux,
> Windows) incapable to handle encoding safely [*]. The OP just ran into
> that. If he followed the advise he would never have any problems of this
> kind.

> -------
> * Preventing a file encoded as X, being read and written as if it were
> encoded as Y.

Precaution? ASCII could just as well be EBDCI. When the OS's programming
interface does not suggest studying the file type, then the best
thing one can do reading a text file is to rely on the data---UTF-NN has
a BOM, which is better than nothing, and certainly is better than the
any 7bit (or 8bit) ambiguities.

It is unfortunate that 7bit engineers can't swallow their pride
and use extended files attributes available with all semi-modern
and modern file systems and archive formats.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-17 19:05           ` Georg Bauhaus
@ 2013-11-17 20:38             ` Dmitry A. Kazakov
  2013-11-18  8:38               ` Georg Bauhaus
  2013-11-18  8:44               ` Georg Bauhaus
  0 siblings, 2 replies; 33+ messages in thread
From: Dmitry A. Kazakov @ 2013-11-17 20:38 UTC (permalink / raw)

On Sun, 17 Nov 2013 20:05:26 +0100, Georg Bauhaus wrote:

> On 17.11.13 15:07, Dmitry A. Kazakov wrote:
> 
>>> ASCII-ism is the soil in which dangerous bugs keep many things
>>> from working.(*)
>>
>> On the contrary, it is a reasonable precaution against sloppy OSes (Linux,
>> Windows) incapable to handle encoding safely [*]. The OP just ran into
>> that. If he followed the advise he would never have any problems of this
>> kind.
> 
>> -------
>> * Preventing a file encoded as X, being read and written as if it were
>> encoded as Y.
> 
> Precaution? ASCII could just as well be EBDCI.

Firstly, EBCDIC is practically dead. Secondly, you simply cannot compile
any Ada program encoded in EBCDIC as if it were ASCII. No chance.

UTF-8 was intentionally designed to be compatible with ASCII, which is why
there is a trouble with Latin1 which also was an extension of ASCII.
Similarly if somebody used KOI-8 thinking it were Latin1 or UTF-8.

The problem is that the common part (ASCII) is sufficient for Ada
programming while the varying part is subtle enough to cause difficult to
detect bugs in string literals. Bugs that cannot be detected by the
compiler.

> It is unfortunate that 7bit engineers can't swallow their pride
> and use extended files attributes available with all semi-modern
> and modern file systems and archive formats.

What for? In oder to get silly bugs the OP did?

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-17 14:07         ` Dmitry A. Kazakov
  2013-11-17 17:19           ` Dennis Lee Bieber
  2013-11-17 19:05           ` Georg Bauhaus
@ 2013-11-18  0:34           ` Stoik
  2 siblings, 0 replies; 33+ messages in thread
From: Stoik @ 2013-11-18  0:34 UTC (permalink / raw)


W dniu niedziela, 17 listopada 2013 15:07:18 UTC+1 użytkownik Dmitry A. Kazakov napisał:
> On Sun, 17 Nov 2013 14:32:55 +0100, Georg Bauhaus wrote:
> 
> 
> 
> > On 16.11.13 16:55, Dmitry A. Kazakov wrote:
> 
> >> As I said in order to avoid troubles, don't use anything but ASCII.
> 
> > 
> 
> > ASCII-ism is the soil in which dangerous bugs keep many things
> 
> > from working.(*)
> 
> 
> 
> On the contrary, it is a reasonable precaution against sloppy OSes (Linux,
> 
> Windows) incapable to handle encoding safely [*]. The OP just ran into
> 
> that. If he followed the advise he would never have any problems of this
> 
> kind.

The advice is: do not use cars, they are often badly manufactured and drivers can be mad. Go on foot, far from busy roads! 

I suspect it is much better to press the companies to produce better OS'es and compilers. People do use various languages and/or scripts. And seeing dozens of packages for handling strange characters without the possibility of using them in a natural manner is a bit frustrating. 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-17 20:38             ` Dmitry A. Kazakov
@ 2013-11-18  8:38               ` Georg Bauhaus
  2013-11-18  9:01                 ` Dmitry A. Kazakov
  2013-11-18  8:44               ` Georg Bauhaus
  1 sibling, 1 reply; 33+ messages in thread
From: Georg Bauhaus @ 2013-11-18  8:38 UTC (permalink / raw)

On 17.11.13 21:38, Dmitry A. Kazakov wrote:
> The problem is that the common part (ASCII) is sufficient for Ada
> programming while the varying part is subtle enough to cause difficult to
> detect bugs in string literals. Bugs that cannot be detected by the
> compiler.

UTF-8 can actually be so checked (and is checked by typical implementations)
that accidentally mistaking some octets of a string literal for Latin-1
coded characters is impossible: this is a consequence of the design of
UTF-8, as you know: the {1}+0 prefix rules.

Actually, a compiler---GNAT having a helpful spell checker already---could
detect occurrences in string literals of

    String'(N   => Character'Val (195),
            N+1 => Character'Val (179))

as very likely being the valid UTF-8 sequence representing "ó". It will
then emit a warning saying that source text might be UTF-8 rather than
Latin-1, and suggest a compiler switch accordingly. Of course, the presence
of a BOM can add further support to this warning.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-17 20:38             ` Dmitry A. Kazakov
  2013-11-18  8:38               ` Georg Bauhaus
@ 2013-11-18  8:44               ` Georg Bauhaus
  2013-11-18 10:24                 ` Dmitry A. Kazakov
  1 sibling, 1 reply; 33+ messages in thread
From: Georg Bauhaus @ 2013-11-18  8:44 UTC (permalink / raw)


On 17.11.13 21:38, Dmitry A. Kazakov wrote:
>> It is unfortunate that 7bit engineers can't swallow their pride
>> >and use extended files attributes available with all semi-modern
>> >and modern file systems and archive formats.

> What for? In oder to get silly bugs the OP did?

In order to be able to integrate software (libraries, sources) that
use international characters.

Also, indirectly, in order to help programmers getting acquainted
with the effects of an 8th bit, and with encodings in general.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-18  8:38               ` Georg Bauhaus
@ 2013-11-18  9:01                 ` Dmitry A. Kazakov
  2013-11-18 10:06                   ` Georg Bauhaus
  0 siblings, 1 reply; 33+ messages in thread
From: Dmitry A. Kazakov @ 2013-11-18  9:01 UTC (permalink / raw)

On Mon, 18 Nov 2013 09:38:06 +0100, Georg Bauhaus wrote:

> On 17.11.13 21:38, Dmitry A. Kazakov wrote:
>> The problem is that the common part (ASCII) is sufficient for Ada
>> programming while the varying part is subtle enough to cause difficult to
>> detect bugs in string literals. Bugs that cannot be detected by the
>> compiler.
> 
> UTF-8 can actually be so checked (and is checked by typical implementations)

1. The share of illegal UTF-8 sequences is negligible. The one among Ada
programs is even less than that.

2. Latin1 sequences are all legal.

Now, carefully observe that the program in question was dealt with as if it
were encoded in Latin1. So much for your theory.

---------------
P.S. In order to make a point you should take a set of legal [and
practical] Ada programs encoded in X and then reinterpreted in Y. Then you
compare how many of them become:

1. illegal
2. remain legal keeping the semantics
3. remain legal breaking the semantics

The last case is the worst possible scenario, which the OP experienced.

P.P.S. Also important when dealing with the issue of keeping it sane ASCII,
Ada provides a standard package that defines Latin1 characters:

Characters.Latin_1 (RM A.3.3)

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-18  9:01                 ` Dmitry A. Kazakov
@ 2013-11-18 10:06                   ` Georg Bauhaus
  0 siblings, 0 replies; 33+ messages in thread
From: Georg Bauhaus @ 2013-11-18 10:06 UTC (permalink / raw)

On 18.11.13 10:01, Dmitry A. Kazakov wrote:

>> UTF-8 can actually be so checked (and is checked by typical implementations)
>
> 1. The share of illegal UTF-8 sequences is negligible. The one among Ada
> programs is even less than that.

The share of illegal UTF-8 sequences in source text stays low as
long as policies prevent use of anything but ASCII. But! OTOH,
the difficulty of adapting to use of limited character sets stays
high, unnerving, and costly.

(I know because the source text used here and elsewhere is full of
ASCII-sequences representing ubiquitous Unicode characters. These are
characters that users expect to see. If 1234 codes some common
international character, then having to write

   "abc \x{1234}"
all over the place is a PITA. The need to write

   "abc ["1234"]"
GNAT style does not change that.)

> 2. Latin1 sequences are all legal.

Legality of (only) almost all octets interpreted as Latin-1 characters
does not make the interpretation of string literals correct.
Correctness involves the problem specification, not just Ada.

Which is what matters most: The *user*, the raison d'être of programming,
is not really satisfied when legal programs will actually malfunction
because of legal ambiguity of legal octets. Would anyone be at ease with
similar ambiguity of number literals?

> Now, carefully observe that the program in question was dealt with as if it
> were encoded in Latin1. So much for your theory.

My theory involves programmers, foreign software, and users,
in addition to the mere formalism that you mention.

> ---------------
> P.S. In order to make a point you should take a set of legal [and
> practical] Ada programs encoded in X and then reinterpreted in Y. Then you
> compare how many of them become:

0. useful

> 1. illegal
> 2. remain legal keeping the semantics
> 3. remain legal breaking the semantics

Note that legality can always be established together with 0,
and automatically is, once programmers can easily  specify
character encoding to be something unambiguous.

The stubbornness of 7bit engineering in OSs and in other circumstances
calls for a

   pragma Source_Text_Encoding (...);

With this warning sign in place, both old and new generations of programmers
can do their job.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-18  8:44               ` Georg Bauhaus
@ 2013-11-18 10:24                 ` Dmitry A. Kazakov
  2013-11-18 13:05                   ` G.B.
  0 siblings, 1 reply; 33+ messages in thread
From: Dmitry A. Kazakov @ 2013-11-18 10:24 UTC (permalink / raw)


On Mon, 18 Nov 2013 09:44:05 +0100, Georg Bauhaus wrote:

> On 17.11.13 21:38, Dmitry A. Kazakov wrote:
>>> It is unfortunate that 7bit engineers can't swallow their pride
>>> >and use extended files attributes available with all semi-modern
>>> >and modern file systems and archive formats.
> 
>> What for? In oder to get silly bugs the OP did?
> 
> In order to be able to integrate software (libraries, sources) that
> use international characters.

Why cannot it be integrated without these bugs?

It is like saying that Ada programs must use System.Address in order to be
integrated with machine code. We don't want such kind of integration. That
is why we are using Ada instead of Assembler.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-18 10:24                 ` Dmitry A. Kazakov
@ 2013-11-18 13:05                   ` G.B.
  2013-11-18 15:25                     ` Dmitry A. Kazakov
  0 siblings, 1 reply; 33+ messages in thread
From: G.B. @ 2013-11-18 13:05 UTC (permalink / raw)

On 18.11.13 11:24, Dmitry A. Kazakov wrote:

>> In order to be able to integrate software (libraries, sources) that
>> use international characters.
>
> Why cannot it be integrated without these bugs?

Character literals are not bugs. Ada lacks means of
expressing programmer's intent here, that much is true.
Encoding could be specified by an aspect, just like 'Size.
The language is buggy here, when matched against ubiquitous
real world programming situations.

> It is like saying that Ada programs must use System.Address in order to be
> integrated with machine code.

Yes. Machine code without machine addresses would be magic.

> We don't want such kind of integration.

We can't say: we don't want character literals, or string literals.

People use international character literals. Compiling programs
that use international characters as per the Ada LRM
should work without much ado and without all the FUD-induced
avoidance, and without compiler difficulties.

To suggest only using ASCII is rather like suggesting
to not use FPT, arguing that using FPT leads to results that
can differ when switching from Intel to ARM or to PowerPC.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-18 13:05                   ` G.B.
@ 2013-11-18 15:25                     ` Dmitry A. Kazakov
  2013-11-18 15:51                       ` G.B.
  0 siblings, 1 reply; 33+ messages in thread
From: Dmitry A. Kazakov @ 2013-11-18 15:25 UTC (permalink / raw)

On Mon, 18 Nov 2013 14:05:45 +0100, G.B. wrote:

> Character literals are not bugs. Ada lacks means of
> expressing programmer's intent here, that much is true.
> Encoding could be specified by an aspect, just like 'Size.
> The language is buggy here, when matched against ubiquitous
> real world programming situations.

You are fundamentally wrong here. Encoding is not an aspect, encoding is a
type.

Compare:

123 is a literal of Integer, mod 341, Unsigned_16, ... types

"A" is a literal of String (Latin1), Wide_String (UCS-2), Wide_Wide_String
(UCS-4)

Ada can and surely must have UTF-8 and whatever other encoded strings,
characters and slices. The reason why this is not done, because of other
language problems irrelevant here. [It would cause combinatorial explosion
of standard libraries.]

Note, with all and any thinkable additions, the problem OP had will still
be present, because it has nothing to do with the language itself.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-18 15:25                     ` Dmitry A. Kazakov
@ 2013-11-18 15:51                       ` G.B.
  2013-11-18 17:34                         ` Dmitry A. Kazakov
  0 siblings, 1 reply; 33+ messages in thread
From: G.B. @ 2013-11-18 15:51 UTC (permalink / raw)

On 18.11.13 16:25, Dmitry A. Kazakov wrote:
> Compare:
>
> 123 is a literal of Integer, mod 341, Unsigned_16, ... types

Compare

   type My_Int is range 1 .. 10
     with Size => 16;

to

   type My_Int is range 1 .. 10
     with Size => 32;

Now

   type My_Char is range 'A' .. 'Z'
     with size => 31;

to

   type My_Char is range 'A' .. 'Z'
     with size => 15;

And

   type My_Float is digits 6 range 0.0 .. 1_000.0
     with Mode => Round_To_Nearest_Even;

These are representation issues. They direct the compiler
to choose (a) a number of bits and (b) a set of operations.
"+" will be affected by 'Size, though not at the level of
abstract operations.

I guess there is a view that says rounding is a type? So what:

An encoding aspect would just be a means that allows programmers
to say what they mean. It may not be as useful as -gnatW*, it may even
be confusing, but it may finally kick the * of compiler makers and
project leads, and make them sort out this encoding nonsense once
and forever. Without rhetoric. And *then* they can say that
specifying aspects is fundamentally wrong.

After all, this is about interpreting a bit pattern at compile
time, and supplying information about the bits explicitly should
help. It should help fixing C's underlying char* issues, too.

Even when using 7bit ASCII, there is *no* information in the
source that explicitly states what is meant in a given ASCII
String literal. So, saying that ASCII works is just accidentally
right. A good argument in a C camp.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-18 15:51                       ` G.B.
@ 2013-11-18 17:34                         ` Dmitry A. Kazakov
  0 siblings, 0 replies; 33+ messages in thread
From: Dmitry A. Kazakov @ 2013-11-18 17:34 UTC (permalink / raw)

On Mon, 18 Nov 2013 16:51:50 +0100, G.B. wrote:

> On 18.11.13 16:25, Dmitry A. Kazakov wrote:
>> Compare:
>>
>> 123 is a literal of Integer, mod 341, Unsigned_16, ... types
> 
> Compare
> 
>    type My_Int is range 1 .. 10
>      with Size => 16;
> 
> to
> 
>    type My_Int is range 1 .. 10
>      with Size => 32;
> 
> 
> Now
> 
>    type My_Char is range 'A' .. 'Z'
>      with size => 31;
> 
> to
> 
>    type My_Char is range 'A' .. 'Z'
>      with size => 15;
> 
> 
> And
> 
>    type My_Float is digits 6 range 0.0 .. 1_000.0
>      with Mode => Round_To_Nearest_Even;
> 
> These are representation issues.

Representation is a property of a type. Whether representation is relevant
to the semantics of the type depends on the domain space. The type
determines the semantics and relevant aspects of the representation. Not
otherwise.

> I guess there is a view that says rounding is a type?

Certainly. Rounding behavior is determined by the semantics of some type
operations.

> So what:
> 
> An encoding aspect would just be a means that allows programmers
> to say what they mean.

If aspect maps to a distinct type then yes. But it is futile to talk about
aspects as "Ada's RM aspects" because it is unclear whether they are
related to the semantics or are implementation artefacts. Ada 2012 blurred
it beyond recognition.

> After all, this is about interpreting a bit pattern at compile
> time,

No. There is no bit patterns at compile time. The compiler operates on an
alphabet. Ada's alphabet in unambiguous: RM 2.1.

Consider an Ada source packed using LZH. Does that produce another program?
No, the program is same. You can use punched cards instead, or write it
down on a coaster, it is still the same program.

> Even when using 7bit ASCII, there is *no* information in the
> source that explicitly states what is meant in a given ASCII
> String literal.

Again, it has nothing to do with encoding. Which is why errors here are so
dangerous.

> So, saying that ASCII works is just accidentally
> right.

Nothing accidental in the case of UTF-8 reinterpreted as Latin1 and
conversely. It works because ASCII is an integral part of either. Thus an
Ada program written in ASCII is invariant to UTF-8, Latin1 and many other
8-bit encodings.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-16 20:06     ` Peter C. Chapin
  2013-11-17 10:34       ` Stoik
@ 2013-11-22  0:53       ` Randy Brukardt
  1 sibling, 0 replies; 33+ messages in thread
From: Randy Brukardt @ 2013-11-22  0:53 UTC (permalink / raw)


"Peter C. Chapin" <PChapin@vtc.vsc.edu> wrote in message 
news:alpine.DEB.2.02.1311161503000.6074@whirlwind...
> On Sat, 16 Nov 2013, Stoik wrote:
>
>> By the way, nothing changes if I use wide_character and wide_string 
>> instead of character and string. Even if character=octet, certainly 
>> wide_character is not an octet!
>
> It sounds like you want something like
>
>     function UTF8_String_To_Wide_String(S : String) return Wide_String;
>
> UTF-8 is a variable length encoding and thus not the same beast as 
> Wide_String. String literals are going to be encoded in the same manner as 
> the rest of the source text, of course.

Ada 2012 has Ada.Strings.UTF_Encodings for run-time encoding conversions. 
(See A.4.11.) We might be able to do better in the next version of Ada 
(whenever that is), but I wouldn't hold my breath.

                                Randy.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-17 11:12       ` Stoik
@ 2013-11-22  1:03         ` Randy Brukardt
  2013-11-22  3:02           ` Shark8
  0 siblings, 1 reply; 33+ messages in thread
From: Randy Brukardt @ 2013-11-22  1:03 UTC (permalink / raw)


"Stoik" <staszek.goldstein@gmail.com> wrote in message 
news:7464679c-6b98-4e23-a337-83b671473553@googlegroups.com...
> Thanks for your comments. It is obviously a question of having a different 
> encoding in the
> editor and the compiler. I forgot to add the -gnatW8 switch to the 
> compiler (this should be
> a default, I believe).

Ada 2012 requires compilers to accept UTF-8 source code. But given that Ada 
source code historically is Latin-1, it's very unlikely that compilers would 
change the default setting. The effect would be to break the compilation of 
much existing source, a step that most compiler vendors would never take.

Speaking as a vendor, Janus/Ada has a number of default switches that would 
never be the default choices today. But changing the defaults breaks 
*everyone's* build scripts; it's just so disruptive that it's not something 
that we would do unless there was no other choice. It makes command line use 
of compilers with an extensive history harder than we would like, but that's 
the price of having customers that go way back.

If UTF-8 files were somehow identified as such, we could have friendlier 
defaults -- but since the use of the BOM is optional (and discouraged in 
recent Unicode standards), and there are no encoding attributes in common 
file systems (Windows, Linux) -- there really isn't much that we can do. 
This is going to remain a mess for a long time to come, I fear.

                                          Randy.

P.S. Truth-in-advertising: Janus/Ada *only* takes Latin-1 input; it has no 
support for any other encoding (of course it supports Wide_String at 
runtime). That will have to change as we migrate to Ada 2012, but it 
probably will be a while before that happens (not a lot of demand).





^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-22  1:03         ` Randy Brukardt
@ 2013-11-22  3:02           ` Shark8
  2013-11-22 11:54             ` Georg Bauhaus
  2013-11-23  4:14             ` Randy Brukardt
  0 siblings, 2 replies; 33+ messages in thread
From: Shark8 @ 2013-11-22  3:02 UTC (permalink / raw)

On Thursday, November 21, 2013 6:03:29 PM UTC-7, Randy Brukardt wrote:
> 
> P.S. Truth-in-advertising: Janus/Ada *only* takes Latin-1 input; it has no 
> support for any other encoding (of course it supports Wide_String at 
> runtime). That will have to change as we migrate to Ada 2012, but it 
> probably will be a while before that happens (not a lot of demand).

Not a lot of demand for UTF-8, or not a lot of demand for Ada-2012 [from the customers]?

(Also, would having a package aspect declaring that the /contents/ are to be read as UTF-8 [or any recognized encoding] be a possible workable solution to this problem? -- then you could have a package of String-constants of the proper encoding.)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-22  3:02           ` Shark8
@ 2013-11-22 11:54             ` Georg Bauhaus
  2013-11-23  4:14             ` Randy Brukardt
  1 sibling, 0 replies; 33+ messages in thread
From: Georg Bauhaus @ 2013-11-22 11:54 UTC (permalink / raw)


On 22.11.13 04:02, Shark8 wrote:
> On Thursday, November 21, 2013 6:03:29 PM UTC-7, Randy Brukardt wrote:
>>
>> P.S. Truth-in-advertising: Janus/Ada *only* takes Latin-1 input; it has no
>> support for any other encoding (of course it supports Wide_String at
>> runtime). That will have to change as we migrate to Ada 2012, but it
>> probably will be a while before that happens (not a lot of demand).
>
> Not a lot of demand for UTF-8, or not a lot of demand for Ada-2012 [from the customers]?
>
> (Also, would having a package aspect declaring that the /contents/ are to be read as UTF-8 [or any recognized encoding] be a possible workable solution to this problem? -- then you could have a package of String-constants of the proper encoding.)

For literals, in general, I think that static expression
functions will be valuable. I wonder why these have not
yet been defined?

For example, an implementation such as Janus/Ada reads
string literals as Latin-1, and therefore, then, static
expression functions could test properties of the literal.
(Length checks being another useful, though less reliable
option.)

Then, when read as Latin-1, the literal String_3'("§ 1")
in the Subject parameter of

    Is_UTF_8 (First => 1, Subject => "§ 1")

would form part of a static expression that is checked at
compile time. In a static predicate, say.


package UTF_8_Checks is

    pragma Pure (UTF_8_Checks);

    --  (Not working statically, in current Ada.)

    --  If:
    --    - static functions include expression functions of only
    --      static expressions,
    --
    --  then function Is_UTF_8 below can test a string literal
    --  at compile time.

    U0 : constant := 0;
    U1 : constant := 2#1000_0000#;
    U2 : constant := 2#1100_0000#;
    U3 : constant := 2#1110_0000#;
    U4 : constant := 2#1111_0000#;
    U5 : constant := 2#1111_1000#;
    UX : constant := 255;

    subtype XString is String (1 .. 12)
       with Static_Predicate => XString'Last < Positive'Last;
    --  for string_literals of a static string subtype

    type XInteger is range 0 .. 255;

    function Is_UTF_8_Follow (C : Character) return Boolean is
       --  an octet that has its most significant bit set, but
       --  not the next one:
       (Character'Pos (C) in U1 .. U2 - 1);

    function Is_UTF_8 (First : Positive; Subject : XString) return Boolean is
       --  every sequence of characters from Subject is a valid UTF-8
       --  sequence, assuming code points up to 16#10_FFFF#.
      (if First > Subject'Last then True
       else
         (case XInteger (Character'Pos (Subject (First))) is
             when 0 .. U1 - 1 =>
                --  "ASCII 7 bit"
               Is_UTF_8 (First + 1, Subject),

             when U1 .. U2 - 1 =>
                --  handled by Is_UTF_8_Follow
                False,

             when U2 .. U3 - 1 =>
               (if First > Subject'Last - 1 then False
                else
                  (for all j in 1 .. 1 =>
                     Is_UTF_8_Follow (Subject (First + j)))
                  and
                  Is_UTF_8 (First + 2, Subject)),

             when U3 .. U4 - 1 =>
               (if First > Subject'Last - 2 then False
                else
                  (for all j in 1 .. 2 =>
                     Is_UTF_8_Follow (Subject (First + j)))
                  and
                  Is_UTF_8 (First + 3, Subject)),

             when U4 .. U5 - 1 =>
               (if First > Subject'Last - 3 then False
                else
                  (for all j in 1 .. 3 =>
                     Is_UTF_8_Follow (Subject (First + j)))
                  and
                  Is_UTF_8 (First + 4, Subject)),

             when U5 .. UX =>
                False));

end UTF_8_Checks;

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-22  3:02           ` Shark8
  2013-11-22 11:54             ` Georg Bauhaus
@ 2013-11-23  4:14             ` Randy Brukardt
  2013-12-06  2:17               ` Georg Bauhaus
  1 sibling, 1 reply; 33+ messages in thread
From: Randy Brukardt @ 2013-11-23  4:14 UTC (permalink / raw)


"Shark8" <onewingedshark@gmail.com> wrote in message 
news:672ce4f6-8c65-43b5-b04b-a7b858205af8@googlegroups.com...
> On Thursday, November 21, 2013 6:03:29 PM UTC-7, Randy Brukardt wrote:
>>
>> P.S. Truth-in-advertising: Janus/Ada *only* takes Latin-1 input; it has 
>> no
>> support for any other encoding (of course it supports Wide_String at
>> runtime). That will have to change as we migrate to Ada 2012, but it
>> probably will be a while before that happens (not a lot of demand).
>
> Not a lot of demand for UTF-8, or not a lot of demand for Ada-2012 [from 
> the customers]?

Not a lot of demand for UTF-8 or wide characters in general. As far as Ada 
2012 goes, if I want to use a feature, it somehow gets in the compiler. :-) 
Customer demand not required (but it always helps).

                             Randy.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: strange behaviour of utf-8 files
  2013-11-23  4:14             ` Randy Brukardt
@ 2013-12-06  2:17               ` Georg Bauhaus
  0 siblings, 0 replies; 33+ messages in thread
From: Georg Bauhaus @ 2013-12-06  2:17 UTC (permalink / raw)

On 23.11.13 05:14, Randy Brukardt wrote:
> "Shark8" <onewingedshark@gmail.com> wrote
>> Not a lot of demand for UTF-8, or not a lot of demand for Ada-2012 [from
>> the customers]?
>
> Not a lot of demand for UTF-8 or wide characters in general. As far as Ada
> 2012 goes, if I want to use a feature, it somehow gets in the compiler. :-)
> Customer demand not required (but it always helps).

Actually, programmers seem to suppress existing demand.

Equating "customers" to "consumers" of software for the moment
(who pays?), customers suffer from ASCII-fied communication in
ways that would not be accepted if written on paper.
I got a terribly malformed computer generated messages from no lesser
company than DHL (inspiring this follow up).

"??" in the mail text quoted below has obviously been put in place
of what was perfectly UTF-8 encoded character data. (In the mail's
source text, to be sure.)  The non-ASCII character is 'ü' (16#FC#)
in both cases (L.8, L.10):

+======================================================================+
Subject: Ihre Sendung wurde in eine FILIALE umgeleitet
MIME-Version: 1.0
Content-Type: text/plain; charset=ANSI_X3.4-1968
Content-Transfer-Encoding: 7bit

Guten Tag Herr Georg Bauhaus,

leider konnte Ihre Sendung  NICHT in die gew??nschte PACKSTATION eingestellt werden.

Die Sendung liegt f??r Sie in der FILIALE (...)
+======================================================================+

Ironically, the messages are produced using an industry standard Java
framework while Java's char data are not 7bit ASCII:

Message-ID: <...48667.JavaMail.ypqbson@HANPQ021>

These messages used to be o.K. in the past. Judging by the count of excess
spaces and long and empty lines in the message, I guess they are having some
competitive programming shop streamline their software.

Character set support can be a real issue when the use of ASCII leads
to misprints of addresses, or to ambiguity in legal documents. Consider
families
    Joseph Müller (16#FC#)
and
   Joseph Möller (16#F6#)
each owning a flat in the same house. If rendered

   Fam. Joseph M??ller
   X Str. 15
   ...

and

   Fam. Joseph M??ller
   X Str. 15
   ...

respectively, what is the postman to do?

Proper support for encoding all characters is a necessity!

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2013-12-06  2:17 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-16 13:12 strange behaviour of utf-8 files Stoik
2013-11-16 13:34 ` Dmitry A. Kazakov
2013-11-16 15:09   ` Stoik
2013-11-16 15:55     ` Dmitry A. Kazakov
2013-11-17 13:32       ` Georg Bauhaus
2013-11-17 14:07         ` Dmitry A. Kazakov
2013-11-17 17:19           ` Dennis Lee Bieber
2013-11-17 18:07             ` Dmitry A. Kazakov
2013-11-17 19:05           ` Georg Bauhaus
2013-11-17 20:38             ` Dmitry A. Kazakov
2013-11-18  8:38               ` Georg Bauhaus
2013-11-18  9:01                 ` Dmitry A. Kazakov
2013-11-18 10:06                   ` Georg Bauhaus
2013-11-18  8:44               ` Georg Bauhaus
2013-11-18 10:24                 ` Dmitry A. Kazakov
2013-11-18 13:05                   ` G.B.
2013-11-18 15:25                     ` Dmitry A. Kazakov
2013-11-18 15:51                       ` G.B.
2013-11-18 17:34                         ` Dmitry A. Kazakov
2013-11-18  0:34           ` Stoik
2013-11-16 17:01     ` Georg Bauhaus
2013-11-17 10:38       ` Stoik
2013-11-16 15:12   ` Stoik
2013-11-16 15:57     ` Dmitry A. Kazakov
2013-11-17 11:12       ` Stoik
2013-11-22  1:03         ` Randy Brukardt
2013-11-22  3:02           ` Shark8
2013-11-22 11:54             ` Georg Bauhaus
2013-11-23  4:14             ` Randy Brukardt
2013-12-06  2:17               ` Georg Bauhaus
2013-11-16 20:06     ` Peter C. Chapin
2013-11-17 10:34       ` Stoik
2013-11-22  0:53       ` Randy Brukardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox