447 lines
22 KiB
Plaintext
447 lines
22 KiB
Plaintext
|
|
See the top level README for information on where to find the
|
|
schematic and programmers reference manual for the ARM processor
|
|
on the raspberry pi. Also find information on how to load and run
|
|
these programs.
|
|
|
|
This example serves many purposes. First it is a real world library/
|
|
application demonstrated in a bare-metal embedded format. This example
|
|
as with extest demonstrates the use of the mmu so that the data cache
|
|
can be safely turned on. One of the main goals though is that over the
|
|
years all to often based on questions in person or in forums I have come
|
|
to believe that many programmers live under the myth that there is a one
|
|
to one relationship between high level code and the compiled binary, if
|
|
you want to change the performance of you code you need to change your
|
|
code. That is no truth of any kind to that. Likewise a belief that any
|
|
two compilers can or should be expected to produce the same compiled
|
|
machine code from the same high level source code. Likewise, that is
|
|
no reason to expect such a thing. Think about it this way, take 100
|
|
programmers, 100 students, whatever. Give them the same programming
|
|
assignment, a homework project or semester project, whatever, do you end
|
|
up with all 100 producing the exact same source code? No, of course not
|
|
you might have solutions that fall into defineable categories, but you
|
|
are going to see many solutions. No different here, each compiler is
|
|
created by different people for different reasons with different skills
|
|
and different goals. And the output is not exactly the same. Sure for
|
|
some simple programs, there may be an ideal solution and two compilers
|
|
might converge on that solution, but on average, esp as the project
|
|
gets large, the solutions will vary.
|
|
|
|
What I have here is the zlib library, in the Linux world it is on a
|
|
par with gcc as far as being a cornerstone holding the whole thing up
|
|
in the air. Not unlike gcc you really dont want to actually look at
|
|
the code. The jumping around from longs to ints, and how variables are
|
|
declared is not consistent, basically it makes your compiler work a
|
|
little harder at least. I have some text that I am going to compress
|
|
and then uncompress, making a nice self-checking test. And going to
|
|
compare three compilers using different compiler options, and going
|
|
to turn on and off the caches.
|
|
|
|
With respect to the zlib licence this zlib library is not mine, I have
|
|
though made modifications to it. I have commented out the includes
|
|
so that it wont interfere with trying to compile bare-metal (just makes
|
|
it easier to know what I have to manage and not manage). I do not use
|
|
a C library so there are a few C library functions that had to be
|
|
implemented malloc, calloc, memset, memcpy, free. Very very simple
|
|
malloc, free does nothing, this is a once through test, no need to actually
|
|
manage memmory allocation just need to give the code what it asks for.
|
|
Fortunately the raspberry pi is well endowed with memory.
|
|
|
|
I have two versions of gcc, 4.6.1 from the CodeSourcery folks (now
|
|
mentor graphics). And 4.7.0 built from sources using the build script
|
|
in the buildgcc directory. The third is clang/llvm built using the
|
|
build script in the buildgcc directory. This is a completely separate
|
|
compiler from gcc, different goals, different construction. Thanks to
|
|
the Apple iPhone and Apple the llvm compiler tools have had a big boost
|
|
in quality, the code produced is approaching the speed of gcc. In tests
|
|
I did a while back other compilers, pay-for compilers, blew gcc out
|
|
of the water. Gcc is an average good compiler for many targets, but not
|
|
great at any particular target. Other compilers are better.
|
|
|
|
For gcc going to play with the obvious optimizations -Ox, a few of them
|
|
to show they actually do something. Also leave the default arm arch
|
|
setting and specify the proper architecture specifically to see what that
|
|
does. There are many many knobs you can play with, I dont even need
|
|
to mess with this many combinations to completely demonstrate that
|
|
the same compiler or different compilers can create binaries that execute
|
|
at different speeds from the same high level source code.
|
|
|
|
So the gcc knobs are -O1, -O2 and -O3 plus in the code commenting and
|
|
uncommenting defines that enable the instruction cache, mmu and data
|
|
cache (need the mmu on to enable the data cache).
|
|
|
|
COPS0 = -Wall -O1 -nostdlib -nostartfiles -ffreestanding
|
|
COPS1 = -Wall -O2 -nostdlib -nostartfiles -ffreestanding
|
|
COPS2 = -Wall -O3 -nostdlib -nostartfiles -ffreestanding
|
|
COPS3 = -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mcpu=arm1176jzf-s
|
|
COPS4 = -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mcpu=arm1176jzf-s -mtune=arm1176jzf-s
|
|
COPS5 = -Wall -O3 -nostdlib -nostartfiles -ffreestanding -mcpu=arm1176jzf-s -mtune=arm1176jzf-s
|
|
|
|
arm-none-eabi-gcc --version
|
|
arm-none-eabi-gcc (Sourcery CodeBench Lite 2011.09-69) 4.6.1
|
|
Copyright (C) 2011 Free Software Foundation, Inc.
|
|
This is free software; see the source for copying conditions. There is NO
|
|
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|
|
|
|
Using the 4.6.1 compiler, going to just dive in. The code compresses
|
|
some known data. The decompresses some known data. The data has
|
|
been pre-computed on the host computer to know what the expected
|
|
compressed result should be. A system timer is used to time how long
|
|
it takes to compress then decompress, then it checks the results after
|
|
the timer has been sampled. So in addition to looking at the time
|
|
result need to also make sure the test actually passed.
|
|
|
|
cops icache mmu dcache
|
|
|
|
COPS0 no no no 0xD0D08A
|
|
COPS1 no no no 0xB92BA4
|
|
COPS2 no no no 0xA410E1
|
|
COPS3 no no no 0xB4DC8A
|
|
COPS4 no no no 0xB4D11B
|
|
COPS5 no no no 0xA450DB
|
|
|
|
COPS0 yes no no 0x9DC2DF
|
|
COPS1 yes no no 0x8F5ECF
|
|
COPS2 yes no no 0x8832F2
|
|
COPS3 yes no no 0x8F9C79
|
|
COPS4 yes no no 0x8F9ED4
|
|
COPS5 yes no no 0x8AE077
|
|
|
|
COPS3 yes yes no 0x174FA4
|
|
COPS4 yes yes no 0x175336
|
|
COPS5 yes yes no 0x162750
|
|
|
|
COPS3 yes yes yes 0x176068
|
|
COPS4 yes yes yes 0x175CB0
|
|
COPS5 yes yes yes 0x162590
|
|
|
|
The first interesting thing we see is that even though the data cache
|
|
is not enabled in the control register, it is obviously on in some
|
|
form or fashion. Other interesting things are that optimization -O3 when
|
|
specifying the actual processor vs -O3 using a generic ARMv5 or whatever
|
|
the generic arm binary was a little faster. The reason is more
|
|
complicated than just architecture, will get to that in a bit.
|
|
|
|
Note that these results were all collected by hand so there is a possibility
|
|
for human error. Ideally you will use this info to learn from and not
|
|
actually care about the specific numbers.
|
|
|
|
arm-none-eabi-gcc (GCC) 4.7.0
|
|
Copyright (C) 2012 Free Software Foundation, Inc.
|
|
This is free software; see the source for copying conditions. There is NO
|
|
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|
|
|
|
COPS0 no no no 0xD1360B
|
|
COPS1 no no no 0xB7683E
|
|
COPS2 no no no 0xA83E6F
|
|
COPS3 no no no 0xBB1436
|
|
COPS4 no no no 0xBB0B60
|
|
COPS5 no no no 0xA6CF10
|
|
|
|
COPS0 yes no no 0x9CFAAD
|
|
COPS1 yes no no 0x8DA3D1
|
|
COPS2 yes no no 0x84B7C5
|
|
COPS3 yes no no 0x8E6FDB
|
|
COPS4 yes no no 0x8E73A6
|
|
COPS5 yes no no 0x86156D
|
|
|
|
COPS3 yes yes no 0x17EA80
|
|
COPS4 yes yes no 0x17FA4B
|
|
COPS5 yes yes no 0x15B210
|
|
|
|
COPS3 yes yes yes 0x17F0C8
|
|
COPS4 yes yes yes 0x17FB53
|
|
COPS5 yes yes yes 0x15B55E
|
|
|
|
|
|
arm-none-eabi-gcc (GCC) 4.8.1
|
|
Copyright (C) 2013 Free Software Foundation, Inc.
|
|
This is free software; see the source for copying conditions. There is NO
|
|
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|
|
|
|
COPS0 no no no 0xD48A9F
|
|
COPS1 no no no 0xB994DC
|
|
COPS2 no no no memset
|
|
COPS3 no no no 0xC4CEBC
|
|
COPS4 no no no 0xC4D880
|
|
COPS5 no no no memset
|
|
|
|
COPS0 yes no no 0x9D6225
|
|
COPS1 yes no no 0x90D014
|
|
COPS2 yes no no memset
|
|
COPS3 yes no no 0x921680
|
|
COPS4 yes no no 0x920FF1
|
|
COPS5 yes no no memset
|
|
|
|
COPS3 yes yes no 0x194A48
|
|
COPS4 yes yes no 0x19379E
|
|
COPS5 yes yes no memset
|
|
|
|
COPS3 yes yes yes 0x1948C2
|
|
COPS4 yes yes yes 0x19409E
|
|
COPS5 yes yes yes memset
|
|
|
|
memset means:
|
|
|
|
twain.o: In function `xmemset':
|
|
twain.c:(.text+0x1dc): undefined reference to `memset'
|
|
twain.o: In function `xcalloc':
|
|
twain.c:(.text+0x23c): undefined reference to `memset'
|
|
trees.o: In function `build_tree':
|
|
trees.c:(.text+0x1028): undefined reference to `memset'
|
|
make: *** [twain.gcc.elf] Error 1
|
|
|
|
dont want to deal with that right now.
|
|
|
|
|
|
|
|
Some things to talk about at this point...First off I was using jtag
|
|
to load and run these programs. Think about this for a second when
|
|
you have the instruction cache on and the data cache off. If you stop
|
|
the ARM and use data transactions to load data into ram. Not cached,
|
|
whatever instructions were in the cache when you stopped are still there
|
|
instructions laying around. If you have changed the program you want
|
|
to run then the instructions in the instruction cache do not match
|
|
the new program. If you start up, are now mixing two programs and
|
|
nothing good comes from that. Although my start_l1cache init invalidates
|
|
the whole cache, and I stop the instruction cache when the program ends
|
|
there is evidence if you run the same program over and over again that
|
|
the i cache is playing a role. Ideally doing something like this with
|
|
some sort of bootloader be it serial or using jtag (using data cycles
|
|
to place instructions) you want to completely stop the icache and
|
|
clean it before starting again, if that doesnt appear to be working then
|
|
just power cycle between each test, come up the same way every time.
|
|
|
|
The mmu and data cache is worse than the icache, just power cycle between
|
|
each test. This stop, reload, stop, reload is not normal use of a
|
|
system, normal development sure, but how much do you put into your code
|
|
for something that is not runtime? Another factor is during development
|
|
you need to make sure you are actually initializing everything. Say
|
|
the 12th build of the afternoon set some bit somewhere to make something
|
|
work. You didnt think that code was working (because some other bit
|
|
somewhere else was not set) so you remove it. Later, without power cycles
|
|
you find that other bit, now you think you have it figured out. Eventually
|
|
you power cycle and it doesnt work. One thing your code must do is
|
|
be able to come up out of a reset. Being able to re-run hot is not
|
|
as important. might save development time sure, and may have some
|
|
value, but nowhere near as important as power on initialization.
|
|
|
|
If you read up on the raspberry pi you know that the memory is SDRAM
|
|
which is at its core DRAM, the first thing to know about DRAM is that
|
|
it has to be refreshed. Think of each bit as a rechargeable battery
|
|
if you want to remember that that bit is charged you have to every
|
|
so often give it a little boost, if it discharges too much you might
|
|
not notice that the bit has changed. Well you cant access that memory
|
|
from the processor while the DRAM controller is refreshing memory.
|
|
Basically DRAM performance is not deterministic, it changes a little.
|
|
Which means if you run the same benchmark several times you should not
|
|
be surprised if the results are not the same, even bare metal like this
|
|
where there are no interrupts and no other code running. Now you can
|
|
make dram deterministic if you access it slow enough to insure that
|
|
you get the result in the same number of clock cycles every time.
|
|
For all we know the gpu or other parts of the chip may be sharing
|
|
a bus or interfering with each other affecting performance and certainly
|
|
making it a bit random. If you run the same binary over and over
|
|
you will see that it varies some, but not a huge amount, so that
|
|
is good it is somewhat consistent as far as this code/task goes.
|
|
if you were to run on say a microcontroller or even a gameboy advance
|
|
or other systems where the flash and ram access times are a known
|
|
number of clocks every time, then you should at worst only see a
|
|
difference of one count either way. For some of the runs below
|
|
I ran them multiple times to show the times were not exactly the same.
|
|
|
|
Another topic related to doing this kind of benchmarking is how caches
|
|
work. Quite simply if you think about it programs tend to run a number
|
|
of instructions in a row before they branch, and when you do stuff
|
|
with data you either tend to reuse the same data or access data in
|
|
chunks or in some linear order. The cache will basically read a number
|
|
of words at a time from slow memory to fast memory (the fast memory is
|
|
very deterministic BTW) so that if you happen to read that instruction
|
|
again in the relatively near future you dont have to suffer a slow
|
|
memory cycle but can have a relatively fast memory cycle. Certainly
|
|
if you have a few instructions in a row, the first one in the cache line
|
|
(the chunk of ram the cache fetches in one shot) causes the cache
|
|
line to be read, really really slow, but the second and third are
|
|
really really fast, if that code is used again, say in a loop, before
|
|
it is kicked out of the cache to make room for other code, then it
|
|
remains really fast until it finally gets replaced with other code.
|
|
So lets pretend that our cache has a cache line of 4 instructions and
|
|
is aligned on a 4 word boundary, say for example 0x1000, 0x1004, 0x1008
|
|
and 0x100C all being in the same cache line. Now lets say we have
|
|
two instructions that are called often, someting branches to the first
|
|
one it does something then the second one is a branch elsewhere, maybe
|
|
not the most efficient, but it happens. If the first instruction
|
|
is at address 0x1000, then when you branch to it if there is a cache
|
|
miss then it reads the four instructions from main memory, really
|
|
slow, then the second instruction is really fast but we basically
|
|
read 4 instructions to execute two (lets not think about the
|
|
prefetch right now). Now if I were to remove one instruction somewhere
|
|
just before this code, say I optimized one instruction out of something
|
|
even the startup code. Well, it can actually cause many changes when
|
|
you link but lets assume what it does to these instructions is put
|
|
one at 0xFFC and the other at 0x1000. Now every time I hit these
|
|
two instructions I have to read two cache lines one at 0xFF0 and the
|
|
other at 0x1000, I have to read 8 instructions to execute 2. If those
|
|
two instructions are used often enough and they happen to line up with
|
|
other code that is used often enough to evict one or both of these cache
|
|
lines then for these two instructions the cache can be making life worse
|
|
if we were to put back a nop somewhere to move these two back into
|
|
the same cache line, focusing on those two only our performance would
|
|
improve. Of course it is not at all that simple, you have a big program
|
|
there are tons of cache lines across the program that are evicting each
|
|
other and you have this alignment problem all over the place, by optimizing
|
|
one group of instructions by aligning them you might move another or
|
|
many other groups so they are not optimally aligned. The bottom line
|
|
to this very long subject is that by simply adding and removing nops
|
|
to the beginning of the program, and re-compiling and re-linking you
|
|
move the instructions relative to their addresses and change this
|
|
cache alignment and as a result change the performance, even if the
|
|
instructions were all position independent and identical but moved over
|
|
one address location the peformance of your program, even with a
|
|
deterministic memory system will vary. Lets try, the fifth column is
|
|
the number of nops added to vectors.s. This sample ships with three
|
|
you can add or remove them at will. With these numbers the changes
|
|
are actually less than one percent, but despite that you can still
|
|
see that something is going on, and that something has to do with how
|
|
things line up relative to the cache lines for the instruction cache.
|
|
|
|
COPS4 yes no no 0 0x8E7665 0x8E711F 0x8E73CB
|
|
COPS4 yes no no 1 0x8E735E
|
|
COPS4 yes no no 2 0x8E29A6 0x8E2DB9
|
|
COPS4 yes no no 3 0x8E220D
|
|
COPS4 yes no no 4 0x8E2859
|
|
COPS4 yes no no 5 0x8E6691 0x8E68CB
|
|
COPS4 yes no no 6 0x8E6ACC
|
|
COPS4 yes no no 7 0x8E7713 0x8E7786
|
|
COPS4 yes no no 8 0x8E735A
|
|
|
|
Another topic is the mmu. How does an mmu work? Ideally an mmu is
|
|
there to take a virtual address, the address the processor thinks it
|
|
is using, and convert that to a physical address, the real address
|
|
in memory. You can do things for example like have your linux
|
|
program be compiled for the same address, all linux programs think they
|
|
start at the same address, but it is virtual. One program may think
|
|
it is running code at 0x9000 but it is really 0x12345678, another
|
|
running at 0x9000 might really be 0x200100. the mmu also helps with
|
|
protecting programs from each other and other things. You can also
|
|
tune what is cached or not, in our case here, we dont want hardware
|
|
registers like the timer and uart to be cached, so we use the mmu to
|
|
mark that address space as non-cached and our programs memory space
|
|
as cached. In this case the virtual address and physical address are
|
|
the same to make things easier. Now how does the mmu do what it does?
|
|
It has tables, think of it as nested arrays (an array whose index is
|
|
another array with some other index). In addition to all the cache
|
|
business going on when you go to have a memory cycle there is another
|
|
kind of cache in the mmu that remembers some small number of virtual
|
|
to physical address conversions. If your new address is not in that
|
|
list then it has to compute an offset in the first level of the mmu
|
|
table, and perform that memory cycle. waiting...waiting... then some
|
|
of the bits in that first table tell you where to go in the next
|
|
table, so you compute another address and do another slow memory read.
|
|
Now, finally you can actually go after the thing you were first looking
|
|
for three memory cycles later. This is repeated for almost everything.
|
|
Just like the repetitive nature of things makes the caches additional
|
|
reads (fetching a whole cache line even if I only need one item) faster
|
|
overall, the cache like table in the mmu cuts down on the constant
|
|
table lookups to smooth things out. As the results show by enabling
|
|
the mmu with the data cache on a program like this which is loop heavy
|
|
and data heavy runs quite a bit faster when the data cache is on.
|
|
|
|
clang version 3.0 (branches/release_30 152644)
|
|
Target: x86_64-unknown-linux-gnu
|
|
Thread model: posix
|
|
|
|
LLCOPS0 no no no 0xDF11BB
|
|
LLCOPS1 no no no 0xDEF420
|
|
|
|
LLCOPS0 yes no no 0xABA6AC
|
|
LLCOPS1 yes no no 0xAB97C6
|
|
|
|
LLCOPS0 yes yes no 0x1A49FE
|
|
LLCOPS1 yes yes no 0x19F911
|
|
|
|
|
|
clang version 3.3 (branches/release_33 189603)
|
|
Target: x86_64-unknown-linux-gnu
|
|
Thread model: posix
|
|
|
|
|
|
LLCOPS0 no no no 0xE6EF36
|
|
LLCOPS1 no no no 0xF550AC
|
|
|
|
LLCOPS0 yes no no 0xAC25D7
|
|
LLCOPS1 yes no no 0xAC2B1C
|
|
|
|
LLCOPS0 yes yes no 0x1CA6C5
|
|
LLCOPS1 yes yes no 0x1C4F53
|
|
|
|
LLCOPS0 yes yes yes 0x1CA16C
|
|
LLCOPS1 yes yes yes 0x1C5B56
|
|
|
|
|
|
|
|
A simple experiment to show the mmu overhead. Changing the code from this
|
|
|
|
if(add_one(0x00000000,0x0000|8|4)) return(1);
|
|
if(add_one(0x00100000,0x0000|8|4)) return(1);
|
|
if(add_one(0x00200000,0x0000|8|4)) return(1);
|
|
|
|
to this
|
|
|
|
if(add_one(0x00000000,0x0000)) return(1);
|
|
if(add_one(0x00100000,0x0000)) return(1);
|
|
if(add_one(0x00200000,0x0000)) return(1);
|
|
|
|
Which disables the data cache for our program space, basically no data
|
|
cache anywhere. The program goes from 0x19Fxx to 0xE5Bxxx which is
|
|
slower than the slowest clang program by quite a bit 885% slower.
|
|
|
|
Llvm and clang are getting better, not quite caught up to gcc, but
|
|
compared to version 27 for example against whatever the gcc was at the
|
|
time it is converging. The best clang time is 20% slower than the
|
|
best gcc time for this particular benchmark.
|
|
|
|
As mentioned at the beginning this is demonstrating that the same
|
|
source code compled with different compiler options using the same
|
|
compiler and different versions of the same compiler and differen
|
|
compilers are showing dramatically different results. The worst
|
|
clang time is 858% slower (8.5 times slower) than the fastests clang
|
|
time. The worst gcc time is 964% slower than the fastest gcc time.
|
|
A newer version of gcc is not automatically producing faster code,
|
|
nor is it making the same speed code, something is changing and it is
|
|
not always better. Newer doesnt mean better. We also saw what the
|
|
mmu does, what the cache does, that tiny changes to the location of the
|
|
same sets of instructions can do, etc. Notice that we didnt have to
|
|
do any disassembly and comparison to understand that there are definitely
|
|
differences running the same source code on the same computer. When
|
|
you go to tomshardware and look at benchmarks, those are a single binary
|
|
run one time one way on top of an operating system. Yes the hard drive
|
|
maybe different and everything else held the same, but did they run that
|
|
test 100 times and average or one time? Had they run it more than once
|
|
what is the run to run difference on the same system? Is that difference
|
|
greater than say the same system with a different hard drive? it probably
|
|
says somewhere in the fine print, the point is though that you need to
|
|
understand the nature of benchmarking.
|
|
|
|
Another thing that is hopefully obvious now. Take any other library
|
|
or program, basically change the source code, and the peformance changes
|
|
again. There are no doubt programs that llvm/clang is better at
|
|
compiling that a benchmark like this would show, take the same code compile
|
|
it several different ways, different compilers, etc. And see how they
|
|
play out, you will find code llvm is good at and code that gcc is good
|
|
at and some programs may run at X number of instructions per second
|
|
on average and another Y number of instructions per second on average
|
|
on the same hardware.
|
|
|
|
What this benchmark has shown is that the same source code compiled with
|
|
different compilers and different settings produces dramatically different
|
|
performance results which implies that the code generated is not the
|
|
same, there is no one to one relationship between high level programs
|
|
and the machine code generated. We also got to play with the caches
|
|
and mmu in a simplified fashion.
|
|
|
|
(yes this is too wordy and I need to proof/rewrite)
|
|
based on new discoveries, changed the start address to 0x8000 for the
|
|
whole thing so that messes with cache lines and results have some
|
|
differences to the numbers above (for reasons described above).
|