Toy computational "benchmark" in Ada (new blog post)

comp.lang.ada
 help / color / mirror / Atom feed

* Toy computational "benchmark" in Ada (new blog post)
@ 2019-06-06 11:05 David Trudgett
  2019-06-06 17:48 ` Olivier Henley
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: David Trudgett @ 2019-06-06 11:05 UTC (permalink / raw)


For what it's worth, you may exercise your laughing gear perusing my toy example just for fun at: 

http://www.eclecticse.com.au/2019/06/procedural-map-reduce-and-parallelism.html

Any tips or suggestions from Ada old hands out there for other ideas to try would be gratefully considered! :-)

Enjoy!


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-06 11:05 Toy computational "benchmark" in Ada (new blog post) David Trudgett
@ 2019-06-06 17:48 ` Olivier Henley
  2019-06-06 23:14   ` David Trudgett
  2019-06-06 20:31 ` Jeffrey R. Carter
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Olivier Henley @ 2019-06-06 17:48 UTC (permalink / raw)

On Thursday, June 6, 2019 at 7:05:02 AM UTC-4, David Trudgett wrote:
> For what it's worth, you may exercise your laughing gear perusing my toy example just for fun at: 
> 
> http://www.eclecticse.com.au/2019/06/procedural-map-reduce-and-parallelism.html
> 
> Any tips or suggestions from Ada old hands out there for other ideas to try would be gratefully considered! :-)
> 
> Enjoy!

Interesting but did you run the C version on your machine alongside your Ada version?

In the text, from what I understand, you are comparing the Ada computation time to the reported OP C performance (17ms) and I fail to understand how the comparison can be valid if performed on different machines?   

Note: I tried to find the original C code but in vain and from my rapid reading of your blog post you do not mention having run that C code yourself.

Thank you,

Olivier

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-06 11:05 Toy computational "benchmark" in Ada (new blog post) David Trudgett
  2019-06-06 17:48 ` Olivier Henley
@ 2019-06-06 20:31 ` Jeffrey R. Carter
  2019-06-06 23:02   ` David Trudgett
  2019-06-07  1:42 ` johnscpg
  2019-06-08  1:14 ` johnscpg
  3 siblings, 1 reply; 27+ messages in thread
From: Jeffrey R. Carter @ 2019-06-06 20:31 UTC (permalink / raw)


On 6/6/19 1:05 PM, David Trudgett wrote:
> For what it's worth, you may exercise your laughing gear perusing my toy example just for fun at:
> 
> http://www.eclecticse.com.au/2019/06/procedural-map-reduce-and-parallelism.html

There are some easy things you can do that will probably improve the performance 
of the parallel version. If I could download the code I'd make the changes and 
see what effect they have, but there doesn't seem to be any way to do that.

-- 
Jeff Carter
"He nevere yet no vileynye ne sayde
In al his lyf unto no maner wight."
Canterbury Tales
156


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-06 20:31 ` Jeffrey R. Carter
@ 2019-06-06 23:02   ` David Trudgett
  2019-06-07  0:13     ` Paul Rubin
  2019-06-07 17:55     ` Jeffrey R. Carter
  0 siblings, 2 replies; 27+ messages in thread
From: David Trudgett @ 2019-06-06 23:02 UTC (permalink / raw)


Il giorno venerdì 7 giugno 2019 06:31:29 UTC+10, Jeffrey R. Carter ha scritto:
> 
> There are some easy things you can do that will probably improve the performance 
> of the parallel version. If I could download the code I'd make the changes and 
> see what effect they have, but there doesn't seem to be any way to do that.
> 
> -- 
> Jeff Carter
> "He nevere yet no vileynye ne sayde
> In al his lyf unto no maner wight."
> Canterbury Tales
> 156

Thanks, Jeff! I have put the code into a public-accessible Gitlab repo here:

https://gitlab.com/DataLinkDroid/ada-2012-procedural-map-reduce/tree/master/src

Cheers,
David


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-06 17:48 ` Olivier Henley
@ 2019-06-06 23:14   ` David Trudgett
  2019-06-06 23:27     ` Paul Rubin
  0 siblings, 1 reply; 27+ messages in thread
From: David Trudgett @ 2019-06-06 23:14 UTC (permalink / raw)


Il giorno venerdì 7 giugno 2019 03:48:19 UTC+10, Olivier Henley ha scritto:
> 
> Interesting but did you run the C version on your machine alongside your Ada version?
> 
> In the text, from what I understand, you are comparing the Ada computation time to the reported OP C performance (17ms) and I fail to understand how the comparison can be valid if performed on different machines?   
> 
> Note: I tried to find the original C code but in vain and from my rapid reading of your blog post you do not mention having run that C code yourself.
> 
> Thank you,
> 
> Olivier

Good point, Olivier. It would have been ideal to run the same C code on my machine, but alas! there are difficulties:

o I would have to run the example on Windows (maybe a particular version?)

o I would have to run Visual Studio C compiler (maybe a particular version?)

o I would have to research how to enable AVX2 instructions in that environment.

o I would have to guess the exact configurations, etc., used originally.

None of those points is "fun", which was the only point of the exercise, really.

Having said that, I do believe that the machine I ran this on, being quite a few years old by now, is probably within the ballpark of the original hardware. But that is difficult to say, because it is not just (or even mostly) a processor and clock speed that will determine the result of this exercise, but also RAM speed, cache speed and size, bus bandwidths, and lots of other things. (Which is why, as you suggest, it would have been ideal to run the same code on my setup.)

If it makes any difference (you may wish to try it yourself), I have made the Ada code available here:

https://gitlab.com/DataLinkDroid/ada-2012-procedural-map-reduce/tree/master/src

Regards,
David

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-06 23:14   ` David Trudgett
@ 2019-06-06 23:27     ` Paul Rubin
  2019-06-07  5:24       ` David Trudgett
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Rubin @ 2019-06-06 23:27 UTC (permalink / raw)


David Trudgett <dktrudgett@gmail.com> writes:
> Good point, Olivier. It would have been ideal to run the same C code
> on my machine,... I have made the Ada code available here: ...

Can you post the C code?  That will let people try both.

The Ada program really seems like a horrendous amount of code for such a
simple task.  I might try a few alternatives.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-06 23:02   ` David Trudgett
@ 2019-06-07  0:13     ` Paul Rubin
  2019-06-07  4:50       ` Paul Rubin
  2019-06-07  5:28       ` David Trudgett
  2019-06-07 17:55     ` Jeffrey R. Carter
  1 sibling, 2 replies; 27+ messages in thread
From: Paul Rubin @ 2019-06-07  0:13 UTC (permalink / raw)


David Trudgett <dktrudgett@gmail.com> writes:
> Thanks, Jeff! I have put the code into a public-accessible Gitlab repo here:

I managed to build this with GNAT 6.3.0 on an Intel i5-3570S (3.4 GHz)
with 16GB of memory.  Results:

    SINGLE-THREADED V2
    Data points    :  1024000000
    Sum of squares :  1.01064747680946E+10
    Duration (ms)  :  2456.406644100

    MULTI-THREADED V2
    Data points    :  1024000000
    Worker tasks   :  4
    Sum of squares :  1.01064749025115E+10
    Duration (ms)  :  689.120666560

I don't know what those ms durations measure.  Actual elapsed times were
2 min 5 sec for the single threaded version and 36.484s for the parallel
version.  The reported usermode cpu times were 2m3.2s for the single
threaded version and 2m18s (across the 4 cores) for the parallel version.

I'll see if I can code and time a C++ version.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-06 11:05 Toy computational "benchmark" in Ada (new blog post) David Trudgett
  2019-06-06 17:48 ` Olivier Henley
  2019-06-06 20:31 ` Jeffrey R. Carter
@ 2019-06-07  1:42 ` johnscpg
  2019-06-07  5:34   ` David Trudgett
  2019-06-08  1:14 ` johnscpg
  3 siblings, 1 reply; 27+ messages in thread
From: johnscpg @ 2019-06-07  1:42 UTC (permalink / raw)

On Thursday, June 6, 2019 at 12:05:02 PM UTC+1, David Trudgett wrote:
> For what it's worth, you may exercise your laughing gear perusing my toy example just for fun at: 
> 
> http://www.eclecticse.com.au/2019/06/procedural-map-reduce-and-parallelism.html
> 
> Any tips or suggestions from Ada old hands out there for other ideas to try would be gratefully considered! :-)
> 
> Enjoy!

On my machine I get a nice improvement over -O3 when I
take the arrays off the heap, and then use the following 2 flags:

   -march=native -funroll-loops

Modifying the programs is easy:

 --Values_Array : Values_Array_Access := new Values_Array_Type;
   Values_Array : Values_Array_Type;

In the parallel version, change the loop in the task body:

--       declare
--          Val : Float64 renames Values_Array (Idx);
--       begin
            My_Sum := My_Sum + Values_Array (Idx) ** 2;
--       end;

The  -funroll-loops  gave me a nice improvement on the parallel
program, less so on the serial version. (Makes no sense to me
at all!) If you are running in a Unix shell, you usually need
to tell the system if you're going to put giant arrays on the
stack. I type this on the command line: ulimit -s unlimited.

Jonathan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-07  0:13     ` Paul Rubin
@ 2019-06-07  4:50       ` Paul Rubin
  2019-06-07  5:41         ` David Trudgett
  2019-06-07  5:28       ` David Trudgett
  1 sibling, 1 reply; 27+ messages in thread
From: Paul Rubin @ 2019-06-07  4:50 UTC (permalink / raw)

Paul Rubin <no.email@nospam.invalid> writes:
> Actual elapsed times were 2 min 5 sec for the single threaded [Ada]
> version and 36.484s for the parallel version.  The reported usermode
> cpu times were 2m3.2s for the single threaded version and 2m18s
> (across the 4 cores) for the parallel version.
>
> I'll see if I can code and time a C++ version.

C++ version results: single threaded 45.73 seconds cpu, 47.36 sec elapsed.
Multi-threaded (4 threads): 84.34 sec cpu, 23.11 sec elapsed.

That's using GCC 6.03 with threading done by std::future's async
function.  So both versions are a fair amount faster than the Ada
version, but the threading speedup is nowhere near as good.  I wonder
what's going on with that.  At each of the 50 calculation runs, I launch
4 threads for slices of the array, then wait for them to complete and
sum the results, so there might be a little bit of idle time if the
threads don't all use the exact same amount of time.

I also want to try with transform_map_reduce and maybe the new Intel TBB
library with GCC 9.  It is possible that transform_map_reduce can use
Intel SIMD intrinsics but otherwise they can be called directly with
some hassle.  Also I can run on a machine with AVX512.

I guess the next thing after that would be OpenCL or CUDA and a graphics
card.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-06 23:27     ` Paul Rubin
@ 2019-06-07  5:24       ` David Trudgett
  2019-06-07  5:36         ` Paul Rubin
  0 siblings, 1 reply; 27+ messages in thread
From: David Trudgett @ 2019-06-07  5:24 UTC (permalink / raw)


Il giorno venerdì 7 giugno 2019 09:27:19 UTC+10, Paul Rubin ha scritto:
> 
> Can you post the C code?  That will let people try both.

Please see my response to Olivier. No doubt you hadn't seen that at the time you wrote the above.

> 
> The Ada program really seems like a horrendous amount of code for such a
> simple task.  I might try a few alternatives.

Ada is optimal for non-hello world programs, as this one is. Nothing can beat the Common Lisp program: 'hello-world! , which is a valid program executable in the REPL. The fact that Ada spells things out that other languages fail to do, is an advantage in real, long-lived applications.

But you obviously like C++ better, so that's fine.

Regards,
David


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-07  0:13     ` Paul Rubin
  2019-06-07  4:50       ` Paul Rubin
@ 2019-06-07  5:28       ` David Trudgett
  2019-06-07  5:57         ` Paul Rubin
  1 sibling, 1 reply; 27+ messages in thread
From: David Trudgett @ 2019-06-07  5:28 UTC (permalink / raw)


Il giorno venerdì 7 giugno 2019 10:14:18 UTC+10, Paul Rubin ha scritto:
> 
> I managed to build this with GNAT 6.3.0 on an Intel i5-3570S (3.4 GHz)
> with 16GB of memory.  Results:
> 
>     SINGLE-THREADED V2
>     Data points    :  1024000000
>     Sum of squares :  1.01064747680946E+10
>     Duration (ms)  :  2456.406644100
> 
>     MULTI-THREADED V2
>     Data points    :  1024000000
>     Worker tasks   :  4
>     Sum of squares :  1.01064749025115E+10
>     Duration (ms)  :  689.120666560
> 
Those times are amazingly slow. Perhaps you would like to post the disassembly of the inner loop to see what the issue is?


> I don't know what those ms durations measure.

This means that you did not read either the article or the source code.


>  Actual elapsed times were
> 2 min 5 sec for the single threaded version and 36.484s for the parallel
> version.  The reported usermode cpu times were 2m3.2s for the single
> threaded version and 2m18s (across the 4 cores) for the parallel version.
> 
> I'll see if I can code and time a C++ version.

Righto.

Cheers,
David


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-07  1:42 ` johnscpg
@ 2019-06-07  5:34   ` David Trudgett
  2019-06-08 10:17     ` David Trudgett
  0 siblings, 1 reply; 27+ messages in thread
From: David Trudgett @ 2019-06-07  5:34 UTC (permalink / raw)


Il giorno venerdì 7 giugno 2019 11:42:07 UTC+10, john...@googlemail.com ha scritto:
> 
> On my machine I get a nice improvement over -O3 when I
> take the arrays off the heap, and then use the following 2 flags:
> 
>    -march=native -funroll-loops

That's interesting. Thank you. I'll try that (and your mods below) over the weekend and see what the result is for me.

I thought the -O3 would unroll loops where appropriate. Is that not the case?

I assume that native arch means it will generate optimal instructions for the particular architecture on which the compile is running?

> 
> Modifying the programs is easy:
> 
>  --Values_Array : Values_Array_Access := new Values_Array_Type;
>    Values_Array : Values_Array_Type;
> 
> In the parallel version, change the loop in the task body:
> 
> --       declare
> --          Val : Float64 renames Values_Array (Idx);
> --       begin
>             My_Sum := My_Sum + Values_Array (Idx) ** 2;
> --       end;
> 
> The  -funroll-loops  gave me a nice improvement on the parallel
> program, less so on the serial version. (Makes no sense to me
> at all!) If you are running in a Unix shell, you usually need
> to tell the system if you're going to put giant arrays on the
> stack. I type this on the command line: ulimit -s unlimited.

Ah yes. I used the heap because I didn't want to use such a huge stack (and got the expected error message when I tried anyway). But I wonder why the heap should be any slower? I can't see any reason why it would be.


Regards,
David

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-07  5:24       ` David Trudgett
@ 2019-06-07  5:36         ` Paul Rubin
  0 siblings, 0 replies; 27+ messages in thread
From: Paul Rubin @ 2019-06-07  5:36 UTC (permalink / raw)


David Trudgett <dktrudgett@gmail.com> writes:
> But you obviously like C++ better, so that's fine.

I wouldn't say I like C++ better, I just wanted to benchmark it.  I got
the C++ working but it was pretty painful and there is a lot of code.
I've been hanging out here and maintaining some interest in Ada as a
refuge from the perils of C++.  But so far, C++ is easier for me to use
because I'm more familiar with it.

In your reply to Olivier about the C code, you mentioned various
problems in the way of your being able to compile and run it.  But if
you have a copy, I wonder if you could post it for the rest of us to try
even if you can't try it yourself.

I'd also like to try Rust (which I have never tried at all so far), and
Haskell's Accelerate library:

  https://hackage.haskell.org/package/accelerate

I haven't tried Accelerate yet but am an intermediate-level Haskell user
so with luck I can figure it out.

Julia is also supposed to be good for this sort of thing.  I thought of
trying Numpy but I figured that the obvious implementations would either
invoke the Python interpreter too much (slow) or else make extra copies
of the very large array (memory bloat), either way making it
uncompetitive with Ada and C++.  But, I've had only slight exposure to
numpy so far.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-07  4:50       ` Paul Rubin
@ 2019-06-07  5:41         ` David Trudgett
  2019-06-07  6:00           ` Paul Rubin
  0 siblings, 1 reply; 27+ messages in thread
From: David Trudgett @ 2019-06-07  5:41 UTC (permalink / raw)


Il giorno venerdì 7 giugno 2019 14:50:07 UTC+10, Paul Rubin ha scritto:
> 
> C++ version results: single threaded 45.73 seconds cpu, 47.36 sec elapsed.
> Multi-threaded (4 threads): 84.34 sec cpu, 23.11 sec elapsed.

You have not coded the same thing as the Ada code. Do you know any Ada at all?

> 
> That's using GCC 6.03 with threading done by std::future's async
> function.  So both versions are a fair amount faster than the Ada
> version, but the threading speedup is nowhere near as good.  I wonder
> what's going on with that.  At each of the 50 calculation runs, I launch
> 4 threads 

As you should know, the Ada code does not do this.


> for slices of the array, then wait for them to complete and
> sum the results, so there might be a little bit of idle time if the
> threads don't all use the exact same amount of time.

While eye-balling the Ada test runs, it seems that each of the eight tasks take virtually identical times. I didn't bother outputting the individual run times, however.

> 
> I also want to try with transform_map_reduce and maybe the new Intel TBB
> library with GCC 9.  It is possible that transform_map_reduce can use
> Intel SIMD intrinsics but otherwise they can be called directly with
> some hassle.  Also I can run on a machine with AVX512.
> 
> I guess the next thing after that would be OpenCL or CUDA and a graphics
> card.

Good luck with that! :-)

Cheers,
David


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-07  5:28       ` David Trudgett
@ 2019-06-07  5:57         ` Paul Rubin
  2019-06-07  6:21           ` David Trudgett
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Rubin @ 2019-06-07  5:57 UTC (permalink / raw)


David Trudgett <dktrudgett@gmail.com> writes:
> Those times are amazingly slow. Perhaps you would like to post the
> disassembly of the inner loop to see what the issue is?

I think this is it below.  I don't understand why it's accumulating the
values into a stack slot instead of a register, but I don't know the x86
all that well.

    .L17:
            addq    $1, %rax
    .LEHB4:
            movq    416(%r12), %rdx
            movsd   -8(%rdx,%rax,8), %xmm0
    .LEHE4:
            mulsd   %xmm0, %xmm0
            cmpq    %rax, %rbp
            addsd   8(%rsp), %xmm0
            movsd   %xmm0, 8(%rsp)
            jne     .L17
            jmp     .L19


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-07  5:41         ` David Trudgett
@ 2019-06-07  6:00           ` Paul Rubin
  2019-06-07  6:25             ` David Trudgett
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Rubin @ 2019-06-07  6:00 UTC (permalink / raw)

David Trudgett <dktrudgett@gmail.com> writes:
> You have not coded the same thing as the Ada code. Do you know any Ada
> at all?

Not much.  I followed the single threaded code pretty closely though,
then used what I think is the current idiomatic C++ method of doing this
stuff with threads, since C++ doesn't have Ada-like tasks.

>> At each of the 50 calculation runs, I launch 4 threads
> As you should know, the Ada code does not do this.

Yes I'll see if I can rework the C++ code to not launch new threads at
each run.  The total number of threads launched isn't all that large
though, and there are never more than 4 running concurrently.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-07  5:57         ` Paul Rubin
@ 2019-06-07  6:21           ` David Trudgett
  2019-06-07  6:22             ` Paul Rubin
  0 siblings, 1 reply; 27+ messages in thread
From: David Trudgett @ 2019-06-07  6:21 UTC (permalink / raw)


Il giorno venerdì 7 giugno 2019 15:57:36 UTC+10, Paul Rubin ha scritto:
> 
> I think this is it below.  I don't understand why it's accumulating the
> values into a stack slot instead of a register, but I don't know the x86
> all that well.
> 
>     .L17:
>             addq    $1, %rax
>     .LEHB4:
>             movq    416(%r12), %rdx
>             movsd   -8(%rdx,%rax,8), %xmm0
>     .LEHE4:
>             mulsd   %xmm0, %xmm0
>             cmpq    %rax, %rbp
>             addsd   8(%rsp), %xmm0
>             movsd   %xmm0, 8(%rsp)
>             jne     .L17
>             jmp     .L19

Instead, you should have something like:

.L5:
    movsd (%rax), %xmm0
    addq $8, %rax
    cmpq %rbx, %rax
    mulsd %xmm0, %xmm0
    addsd %xmm0, %xmm1
    jne .L5

It would appear you did not turn on optimisations since, as you noticed, it is directly using the stack frame variable rather than keeping it in a register.

Cheers,
David

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-07  6:21           ` David Trudgett
@ 2019-06-07  6:22             ` Paul Rubin
  2019-06-07  6:29               ` David Trudgett
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Rubin @ 2019-06-07  6:22 UTC (permalink / raw)


David Trudgett <dktrudgett@gmail.com> writes:
> It would appear you did not turn on optimisations since, as you
> noticed, it is directly using the stack frame variable rather than
> keeping it in a register.

I used your .gpr file which has -O3 in it.  Is there something else I
should do?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-07  6:00           ` Paul Rubin
@ 2019-06-07  6:25             ` David Trudgett
  2019-06-07  6:38               ` Paul Rubin
  0 siblings, 1 reply; 27+ messages in thread
From: David Trudgett @ 2019-06-07  6:25 UTC (permalink / raw)


Il giorno venerdì 7 giugno 2019 16:00:31 UTC+10, Paul Rubin ha scritto:

> David Trudgett writes:
> > You have not coded the same thing as the Ada code. Do you know any Ada
> > at all?
> 
> Not much.  I followed the single threaded code pretty closely though,
> then used what I think is the current idiomatic C++ method of doing this
> stuff with threads, since C++ doesn't have Ada-like tasks.

Okay. Even with the single threaded one, you will see that no setup time is included in the timing of the work.

> 
> >> At each of the 50 calculation runs, I launch 4 threads
> > As you should know, the Ada code does not do this.
> 
> Yes I'll see if I can rework the C++ code to not launch new threads at
> each run.  The total number of threads launched isn't all that large
> though, and there are never more than 4 running concurrently.

It doesn't matter if you don't time that part of the execution. But I'm guessing that would be hard in C++.

Cheers,
David



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-07  6:22             ` Paul Rubin
@ 2019-06-07  6:29               ` David Trudgett
  2019-06-07  6:42                 ` Paul Rubin
  0 siblings, 1 reply; 27+ messages in thread
From: David Trudgett @ 2019-06-07  6:29 UTC (permalink / raw)


Il giorno venerdì 7 giugno 2019 16:22:26 UTC+10, Paul Rubin ha scritto:
> David Trudgett writes:
> > It would appear you did not turn on optimisations since, as you
> > noticed, it is directly using the stack frame variable rather than
> > keeping it in a register.
> 
> I used your .gpr file which has -O3 in it.  Is there something else I
> should do?

If you used the project file, it should be optimised, but it obviously isn't, so I don't know what you did. Try GNAT Community 2019 instead, perhaps?

Cheers,
David

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-07  6:25             ` David Trudgett
@ 2019-06-07  6:38               ` Paul Rubin
  0 siblings, 0 replies; 27+ messages in thread
From: Paul Rubin @ 2019-06-07  6:38 UTC (permalink / raw)

David Trudgett <dktrudgett@gmail.com> writes:
>> Yes I'll see if I can rework the C++ code to not launch new threads at
>> each run.

> It doesn't matter if you don't time that part of the execution. But
> I'm guessing that would be hard in C++.

I used the C++ futures/async feature which in principle could use a
thread pool, but in GCC 6 I'm told that it launches a new thread for
each async.  Otherwise I used what I think is the same strategy as the
Ada code:

      for Worker in Worker_Range loop
         Tasks (Worker).Run
           (Start_Idx  => Idx_Ranges (Worker, Low),
            Finish_Idx => Idx_Ranges (Worker, High));
      end loop;

It looks like you have a static array of worker threads, and I'm
guessing that the .Run message some kind of event to a worker to tell it
to run the task body.  I didn't try to figure it all out though.  Does
Ada automatically create an event loop in each task to receive those
messages?  If yes, that's pretty cool.

Is there a request queue between the main program and each worker?  What
happens if it gets too large?  Is there a way to share a single queue
between a bunch of workers?

I read "Ada Distilled" some time back but iirc it didn't say much about
this.  I've been wanting to get a more detailed book.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-07  6:29               ` David Trudgett
@ 2019-06-07  6:42                 ` Paul Rubin
  0 siblings, 0 replies; 27+ messages in thread
From: Paul Rubin @ 2019-06-07  6:42 UTC (permalink / raw)


David Trudgett <dktrudgett@gmail.com> writes:
> If you used the project file, it should be optimised, but it obviously
> isn't, so I don't know what you did. Try GNAT Community 2019 instead,
> perhaps?

With -O3 removed from the .gpr, I see this:

    .L58:
            movq    -72(%rbp), %rax
            cmpq    %rax, -40(%rbp)
            jg      .L69
            movq    -40(%rbp), %rax
            movq    432(%rbx), %rdx
            leaq    0(,%rax,8), %rcx
            subq    $8, %rcx
            addq    %rcx, %rdx
            movq    %rdx, -80(%rbp)
            movq    432(%rbx), %rdx
            movsd   -8(%rdx,%rax,8), %xmm1
            movq    432(%rbx), %rdx
            movsd   -8(%rdx,%rax,8), %xmm0
    .LEHE26:
            mulsd   %xmm1, %xmm0
            movsd   -112(%rbp), %xmm1
            addsd   %xmm1, %xmm0
            movsd   %xmm0, -112(%rbp)
            addq    $1, -40(%rbp)
            jmp     .L58

I probably can't do much more tonight, but might try a newer GNAT
version in the coming days.  I'll also try GCC 9 for the C++ version.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-06 23:02   ` David Trudgett
  2019-06-07  0:13     ` Paul Rubin
@ 2019-06-07 17:55     ` Jeffrey R. Carter
  2019-06-08 11:00       ` David Trudgett
  1 sibling, 1 reply; 27+ messages in thread
From: Jeffrey R. Carter @ 2019-06-07 17:55 UTC (permalink / raw)


On 6/7/19 1:02 AM, David Trudgett wrote:
> 
> Thanks, Jeff! I have put the code into a public-accessible Gitlab repo here:

Thanks. Often entry calls create unnecessary overhead, but I made a version 
without them and it didn't make a significant difference (about 1 ms less for 
256M values with 4 processors).

-- 
Jeff Carter
"You couldn't catch clap in a brothel, silly English K...niggets."
Monty Python & the Holy Grail
19


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-06 11:05 Toy computational "benchmark" in Ada (new blog post) David Trudgett
                   ` (2 preceding siblings ...)
  2019-06-07  1:42 ` johnscpg
@ 2019-06-08  1:14 ` johnscpg
  2019-06-08 10:56   ` David Trudgett
  3 siblings, 1 reply; 27+ messages in thread
From: johnscpg @ 2019-06-08  1:14 UTC (permalink / raw)


>I thought the -O3 would unroll loops where appropriate. Is that not the case?

Not on gcc. Unrolling doesn't seem to help much though.

>I assume that native arch means it will generate optimal instructions for the >particular architecture on which the compile is running?

Sometimes it makes things worse! Though that's rare. Sometimes it helps a little. That's my experience, which is pretty limited.

>Ah yes. I used the heap because I didn't want to use such a huge stack (and got >the expected error message when I tried anyway). But I wonder why the heap >should be any slower? I can't see any reason why it would be.

CPUs and compilers are so complex now that I never know
for sure what's going on. The interesting thing here is
that the array is almost entirely in RAM, which makes floating
point desperately slow.

If you compile the 2 programs below with the -S switch,
and read the .s file, then you find that gcc produdes SSE code
for both the C and Ada programs.  In other words you
see instructions like:
   vmulsd  %xmm0, %xmm0, %xmm0
   vaddsd  %xmm0, %xmm1, %xmm1
That won't help much if fetching memory from RAM is too slow
to keep the multipliers busy. 

If you compile with the -mfpmath=387 switch, then no SSE code
is generated, and the running time is about the same. (On my
machine.)

When you compare programs in different languages, you need to
write them the same. See below! I get identical run times from
the two with all the compiler switches I try, as long as they
are the same compiler switches. You can try various combinations
of O2, O3, -mfpmath=387 etc:

  gnatmake -O3 -march=native -funroll-loops map.adb
  gcc -O3 -march=native -funroll-loops -march=native map.c

and remember to make room for the arrays on the stack. On the
bash shell, it's ulimit -s unlimited. On linux, timing
with 'time ./a.out' and 'time ./map' works ok, but run them
repeatedly, and remove any background processes, (like browsers!)

#include <stdio.h>
double main()
{
    int Calculation_Runs = 100;
    int Data_Points = 320000000;
    int i, j;
    double s;
    double v[Data_Points];

    for (i=0; i<Data_Points; i++){
      v[i] = 3.14159265358979323846;
    }

    for (j=0; j<Calculation_Runs; j++){
        for (i=0; i<Data_Points; i++){
          s = s + v[i] * v[i];
        }
    }
    printf("Sum = %f",s);
}

with Ada.Text_IO; use Ada.Text_IO;
procedure Map is
   Calculation_Runs : constant := 100;
   Data_Points : constant := 320_000_000;

   type Values_Index is range 1 .. Data_Points;
   type Float64 is digits 15;
   type Values_Array_Type is array (Values_Index) of Float64;
   Values_Array : Values_Array_Type;
   Sum : Float64 := 0.0;
begin
   for i in Values_Index loop
      Values_Array (i) := 3.14159265358979323846;
   end loop;

   for j in 1 .. Calculation_Runs loop
   for i in Values_Index loop
      Sum := Sum + Values_Array(i) * Values_Array(i);
   end loop;
   end loop;
   Put_Line ("Sum = " & Sum'Image);
end Map;



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-07  5:34   ` David Trudgett
@ 2019-06-08 10:17     ` David Trudgett
  0 siblings, 0 replies; 27+ messages in thread
From: David Trudgett @ 2019-06-08 10:17 UTC (permalink / raw)


Il giorno venerdì 7 giugno 2019 15:34:22 UTC+10, David Trudgett ha scritto:
> Il giorno venerdì 7 giugno 2019 11:42:07 UTC+10, john...@googlemail.com ha scritto:
> > 
> > On my machine I get a nice improvement over -O3 when I
> > take the arrays off the heap, and then use the following 2 flags:
> > 
> >    -march=native -funroll-loops
> 
> That's interesting. Thank you. I'll try that (and your mods below) over the weekend and see what the result is for me.
> 
> I thought the -O3 would unroll loops where appropriate. Is that not the case?
> 
> I assume that native arch means it will generate optimal instructions for the particular architecture on which the compile is running?
> 
> > 
> > Modifying the programs is easy:
> > 
> >  --Values_Array : Values_Array_Access := new Values_Array_Type;
> >    Values_Array : Values_Array_Type;
> > 
> > In the parallel version, change the loop in the task body:
> > 
> > --       declare
> > --          Val : Float64 renames Values_Array (Idx);
> > --       begin
> >             My_Sum := My_Sum + Values_Array (Idx) ** 2;
> > --       end;
> > 
> > The  -funroll-loops  gave me a nice improvement on the parallel
> > program, less so on the serial version. (Makes no sense to me
> > at all!) If you are running in a Unix shell, you usually need
> > to tell the system if you're going to put giant arrays on the
> > stack. I type this on the command line: ulimit -s unlimited.
> 
> Ah yes. I used the heap because I didn't want to use such a huge stack (and got the expected error message when I tried anyway). But I wonder why the heap should be any slower? I can't see any reason why it would be.
> 
> 

Okay, I have tried (a) using stack allocation instead of heap; and (b) arch=native compilation; and I have compared the resulting timings with the original. The timing results were as follows, and represent running the program three times in a row, and then averaging the reported run times, so that, in effect, the average is for 150 calculation runs all together (50 calc runs per program run).

Original program: 434.718 ms

Stack allocation: 435.667 ms

Native arch flag: 435.745 ms

As you can see, there is virtually no difference, and I did verify that the native architecture compilation did, in fact, use AVX instructions rather than SSE (but not AVX2).

It's interesting that AVX instructions did not cause any improvement in run time (it technically added 1 ms, but I expect that to be statistically insignificant).

Cheers,
David


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-08  1:14 ` johnscpg
@ 2019-06-08 10:56   ` David Trudgett
  0 siblings, 0 replies; 27+ messages in thread
From: David Trudgett @ 2019-06-08 10:56 UTC (permalink / raw)


Il giorno sabato 8 giugno 2019 11:14:06 UTC+10, john  ha scritto:
> >I thought the -O3 would unroll loops where appropriate. Is that not the case?
> 
> Not on gcc. Unrolling doesn't seem to help much though.

Yes, there is nothing to unroll, except a massive one billion iteration loop, which (a) would not be reasonable to unroll, and (b) would most likely cause a performance hit if it were, due to cache effects.

> 
> >I assume that native arch means it will generate optimal instructions for the >particular architecture on which the compile is running?
> 
> Sometimes it makes things worse! Though that's rare. Sometimes it helps a little. That's my experience, which is pretty limited.

In this case, I see that it generated AVX instructions instead of SSE, but there was no speed gain as a result.



> 
> >Ah yes. I used the heap because I didn't want to use such a huge stack (and got >the expected error message when I tried anyway). But I wonder why the heap >should be any slower? I can't see any reason why it would be.
> 
> CPUs and compilers are so complex now that I never know
> for sure what's going on. The interesting thing here is
> that the array is almost entirely in RAM, which makes floating
> point desperately slow.

With linear RAM access, I would expect the cache to be pre-populated/fetched by the predictive caching mechanisms of the CPU. The fact that the CPU is pegged at 100% during the (single threaded) calculation would seem to support this idea and indicate that RAM is supplying data at a fast enough rate. In the multithreaded version, each CPU was about 1% idle, presumably due to some SMP contention issues (maybe bus bandwidth limitations or something like that). (It is my understanding as a non-hardware specialist that it is usually RAM latency that is the real performance killer, and not the theoretical raw throughput potential, which is rarely achieved.) It does seem to me that processing 8GiB worth of floating point values, doing a multiply and add for each one in under half a second using SSE2 instructions, is pretty good, really.

> 
> If you compile the 2 programs below with the -S switch,
> and read the .s file, then you find that gcc produdes SSE code
> for both the C and Ada programs.  In other words you
> see instructions like:
>    vmulsd  %xmm0, %xmm0, %xmm0
>    vaddsd  %xmm0, %xmm1, %xmm1

Those are AVX instructions, I think. (SSE would be MULSD and ADDSD, as I understand it.)


> That won't help much if fetching memory from RAM is too slow
> to keep the multipliers busy. 
> 
> If you compile with the -mfpmath=387 switch, then no SSE code
> is generated, and the running time is about the same. (On my
> machine.)
> 
> When you compare programs in different languages, you need to
> write them the same. See below! I get identical run times from
> the two with all the compiler switches I try, as long as they
> are the same compiler switches. You can try various combinations
> of O2, O3, -mfpmath=387 etc:

Yeah. We are not really comparing languages here, though, but the instructions that are generated by the compiler. I'm sure that if we got GNAT to generate AVX2 or AVX512 instructions, then the performance would be same as the AVX2 code generated by the MS C compiler.

We have to bear in mind, though, that there are limited reasons to want to achieve that level of custom binary, because it is more often the case one would want a program to run on a variety of processors within the same family. Of course, a clever compiler could in theory perhaps compile several variants and choose between them at run time.

> 
>   gnatmake -O3 -march=native -funroll-loops map.adb
>   gcc -O3 -march=native -funroll-loops -march=native map.c
> 
> and remember to make room for the arrays on the stack. On the
> bash shell, it's ulimit -s unlimited. On linux, timing
> with 'time ./a.out' and 'time ./map' works ok, but run them
> repeatedly, and remove any background processes, (like browsers!)
> 
> #include <stdio.h>
> double main()
> {
>     int Calculation_Runs = 100;
>     int Data_Points = 320000000;
>     int i, j;
>     double s;
>     double v[Data_Points];
> 
>     for (i=0; i<Data_Points; i++){
>       v[i] = 3.14159265358979323846;
>     }
> 
>     for (j=0; j<Calculation_Runs; j++){
>         for (i=0; i<Data_Points; i++){
>           s = s + v[i] * v[i];
>         }
>     }
>     printf("Sum = %f",s);
> }
> 
> with Ada.Text_IO; use Ada.Text_IO;
> procedure Map is
>    Calculation_Runs : constant := 100;
>    Data_Points : constant := 320_000_000;
> 
>    type Values_Index is range 1 .. Data_Points;
>    type Float64 is digits 15;
>    type Values_Array_Type is array (Values_Index) of Float64;
>    Values_Array : Values_Array_Type;
>    Sum : Float64 := 0.0;
> begin
>    for i in Values_Index loop
>       Values_Array (i) := 3.14159265358979323846;
>    end loop;
> 
>    for j in 1 .. Calculation_Runs loop
>    for i in Values_Index loop
>       Sum := Sum + Values_Array(i) * Values_Array(i);
>    end loop;
>    end loop;
>    Put_Line ("Sum = " & Sum'Image);
> end Map;

Note that there is no timing in either of those versions, and so if you are using a shell timer (as in bash: "$ time ./myprog"), you are probably not getting a good resolution, and more importantly timing other things besides the "map reduce" calculation, which were specifically desired to be excluded.

Cheers,
David


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Toy computational "benchmark" in Ada (new blog post)
  2019-06-07 17:55     ` Jeffrey R. Carter
@ 2019-06-08 11:00       ` David Trudgett
  0 siblings, 0 replies; 27+ messages in thread
From: David Trudgett @ 2019-06-08 11:00 UTC (permalink / raw)


Il giorno sabato 8 giugno 2019 03:55:10 UTC+10, Jeffrey R. Carter ha scritto:
> On 6/7/19 1:02 AM, David Trudgett wrote:
> > 
> > Thanks, Jeff! I have put the code into a public-accessible Gitlab repo here:
> 
> Thanks. Often entry calls create unnecessary overhead, but I made a version 
> without them and it didn't make a significant difference (about 1 ms less for 
> 256M values with 4 processors).
> 
> -- 
> Jeff Carter
> "You couldn't catch clap in a brothel, silly English K...niggets."
> Monty Python & the Holy Grail
> 19

Yes, I don't think that the entry calls could have any measurable effect in this example, as there are only a few of them, and no contention, and the only queueing occurs on the Get_Results, which is expected and desired.

Regards,
David

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2019-06-08 11:00 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-06 11:05 Toy computational "benchmark" in Ada (new blog post) David Trudgett
2019-06-06 17:48 ` Olivier Henley
2019-06-06 23:14   ` David Trudgett
2019-06-06 23:27     ` Paul Rubin
2019-06-07  5:24       ` David Trudgett
2019-06-07  5:36         ` Paul Rubin
2019-06-06 20:31 ` Jeffrey R. Carter
2019-06-06 23:02   ` David Trudgett
2019-06-07  0:13     ` Paul Rubin
2019-06-07  4:50       ` Paul Rubin
2019-06-07  5:41         ` David Trudgett
2019-06-07  6:00           ` Paul Rubin
2019-06-07  6:25             ` David Trudgett
2019-06-07  6:38               ` Paul Rubin
2019-06-07  5:28       ` David Trudgett
2019-06-07  5:57         ` Paul Rubin
2019-06-07  6:21           ` David Trudgett
2019-06-07  6:22             ` Paul Rubin
2019-06-07  6:29               ` David Trudgett
2019-06-07  6:42                 ` Paul Rubin
2019-06-07 17:55     ` Jeffrey R. Carter
2019-06-08 11:00       ` David Trudgett
2019-06-07  1:42 ` johnscpg
2019-06-07  5:34   ` David Trudgett
2019-06-08 10:17     ` David Trudgett
2019-06-08  1:14 ` johnscpg
2019-06-08 10:56   ` David Trudgett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox