183 lines
7.9 KiB
Plaintext
183 lines
7.9 KiB
Plaintext
|
|
See the top level README for information on where to find the
|
|
schematic and programmers reference manual for the ARM processor
|
|
on the raspberry pi. Also find information on how to load and run
|
|
these programs.
|
|
|
|
This example serves many purposes. First it is a real world library/
|
|
application demonstrated in a bare-metal embedded format. This example
|
|
as with extest demonstrates the use of the mmu so that the data cache
|
|
can be safely turned on. One of the main goals though is that over the
|
|
years all to often based on questions in person or in forums I have come
|
|
to believe that many programmers live under the myth that there is a one
|
|
to one relationship between high level code and the compiled binary, if
|
|
you want to change the performance of you code you need to change your
|
|
code. That is no truth of any kind to that. Likewise a belief that any
|
|
two compilers can or should be expected to produce the same compiled
|
|
machine code from the same high level source code. Likewise, that is
|
|
no reason to expect such a thing. Think about it this way, take 100
|
|
programmers, 100 students, whatever. Give them the same programming
|
|
assignment, a homework project or semester project, whatever, do you end
|
|
up with all 100 producing the exact same source code? No, of course not
|
|
you might have solutions that fall into defineable categories, but you
|
|
are going to see many solutions. No different here, each compiler is
|
|
created by different people for different reasons with different skills
|
|
and different goals. And the output is not exactly the same. Sure for
|
|
some simple programs, there may be an ideal solution and two compilers
|
|
might converge on that solution, but on average, esp as the project
|
|
gets large, the solutions will vary.
|
|
|
|
What I have here is the zlib library, in the Linux world it is on a
|
|
par with gcc as far as being a cornerstone holding the whole thing up
|
|
in the air. Not unlike gcc you really dont want to actually look at
|
|
the code. The jumping around from longs to ints, and how variables are
|
|
declared is not consistent, basically it makes your compiler work a
|
|
little harder at least. I have some text that I am going to compress
|
|
and then uncompress, making a nice self-checking test. And going to
|
|
compare three compilers using different compiler options, and going
|
|
to turn on and off the caches.
|
|
|
|
With respect to the zlib licence this zlib library is not mine, I have
|
|
though made modifications to it. I have commented out the includes
|
|
so that it wont interfere with trying to compile bare-metal (just makes
|
|
it easier to know what I have to manage and not manage). I do not use
|
|
a C library so there are a few C library functions that had to be
|
|
implemented malloc, calloc, memset, memcpy, free. Very very simple
|
|
malloc, free does nothing, this is a once through test, no need to actually
|
|
manage memmory allocation just need to give the code what it asks for.
|
|
Fortunately the raspberry pi is well endowed with memory.
|
|
|
|
I have two versions of gcc, 4.6.1 from the CodeSourcery folks (now
|
|
mentor graphics). And 4.7.0 built from sources using the build script
|
|
in the buildgcc directory. The third is clang/llvm built using the
|
|
build script in the buildgcc directory. This is a completely separate
|
|
compiler from gcc, different goals, different construction. Thanks to
|
|
the Apple iPhone and Apple the llvm compiler tools have had a big boost
|
|
in quality, the code produced is approaching the speed of gcc. In tests
|
|
I did a while back other compilers, pay-for compilers, blew gcc out
|
|
of the water. Gcc is an average good compiler for many targets, but not
|
|
great at any particular target. Other compilers are better.
|
|
|
|
For gcc going to play with the obvious optimizations -Ox, a few of them
|
|
to show they actually do something. Also leave the default arm arch
|
|
setting and specify the proper architecture specifically to see what that
|
|
does. There are many many knobs you can play with, I dont even need
|
|
to mess with this many combinations to completely demonstrate that
|
|
the same compiler or different compilers can create binaries that execute
|
|
at different speeds from the same high level source code.
|
|
|
|
So the gcc knobs are -O1, -O2 and -O3 plus in the code commenting and
|
|
uncommenting defines that enable the instruction cache, mmu and data
|
|
cache (need the mmu on to enable the data cache).
|
|
|
|
COPS0 = -Wall -O1 -nostdlib -nostartfiles -ffreestanding
|
|
COPS1 = -Wall -O2 -nostdlib -nostartfiles -ffreestanding
|
|
COPS2 = -Wall -O3 -nostdlib -nostartfiles -ffreestanding
|
|
COPS3 = -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mcpu=arm1176jzf-s
|
|
COPS4 = -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mcpu=arm1176jzf-s -mtune=arm1176jzf-s
|
|
COPS5 = -Wall -O3 -nostdlib -nostartfiles -ffreestanding -mcpu=arm1176jzf-s -mtune=arm1176jzf-s
|
|
|
|
arm-none-eabi-gcc --version
|
|
arm-none-eabi-gcc (Sourcery CodeBench Lite 2011.09-69) 4.6.1
|
|
Copyright (C) 2011 Free Software Foundation, Inc.
|
|
This is free software; see the source for copying conditions. There is NO
|
|
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|
|
|
|
Using the 4.6.1 compiler, going to just dive in. The code compresses
|
|
some known data. The decompresses some known data. The data has
|
|
been pre-computed on the host computer to know what the expected
|
|
compressed result should be. A system timer is used to time how long
|
|
it takes to compress then decompress, then it checks the results after
|
|
the timer has been sampled. So in addition to looking at the time
|
|
result need to also make sure the test actually passed.
|
|
|
|
cops icache mmu dcache
|
|
|
|
COPS0 no no no 0xD0D08A
|
|
COPS1 no no no 0xB92BA4
|
|
COPS2 no no no 0xA410E1
|
|
COPS3 no no no 0xB4DC8A
|
|
COPS4 no no no 0xB4D11B
|
|
COPS5 no no no 0xA450DB
|
|
|
|
COPS0 yes no no 0x9DC2DF
|
|
COPS1 yes no no 0x8F5ECF
|
|
COPS2 yes no no 0x8832F2
|
|
COPS3 yes no no 0x8F9C79
|
|
COPS4 yes no no 0x8F9ED4
|
|
COPS5 yes no no 0x8AE077
|
|
|
|
COPS3 yes yes no 0x174FA4
|
|
COPS4 yes yes no 0x175336
|
|
COPS5 yes yes no 0x162750
|
|
|
|
COPS3 yes yes yes 0x176068
|
|
COPS4 yes yes yes 0x175CB0
|
|
COPS5 yes yes yes 0x162590
|
|
|
|
The first interesting thing we see is that even though the data cache
|
|
is not enabled in the control register, it is obviously on in some
|
|
form or fashion. Other interesting things are that optimization -O3 when
|
|
specifying the actual processor vs -O3 using a generic ARMv5 or whatever
|
|
the generic arm binary was a little faster. The reason is more
|
|
complicated than just architecture, will get to that in a bit.
|
|
|
|
Note that these results were all collected by hand so there is a possibility
|
|
for human error. Ideally you will use this info to learn from and not
|
|
actually care about the specific numbers.
|
|
|
|
arm-none-eabi-gcc (GCC) 4.7.0
|
|
Copyright (C) 2012 Free Software Foundation, Inc.
|
|
This is free software; see the source for copying conditions. There is NO
|
|
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|
|
|
|
COPS0 no no no 0xD1360B
|
|
COPS1 no no no 0xB7683E
|
|
COPS2 no no no 0xA83E6F
|
|
COPS3 no no no 0xBB1436
|
|
COPS4 no no no 0xBB0B60
|
|
COPS5 no no no 0xA6CF10
|
|
|
|
COPS0 yes no no 0x9CFAAD
|
|
COPS1 yes no no 0x8DA3D1
|
|
COPS2 yes no no 0x84B7C5
|
|
COPS3 yes no no 0x8E6FDB
|
|
COPS4 yes no no 0x8E73A6
|
|
COPS5 yes no no 0x86156D
|
|
|
|
COPS3 yes yes no 0x17EA80
|
|
COPS4 yes yes no 0x17FA4B
|
|
COPS5 yes yes no 0x15B210
|
|
|
|
COPS3 yes yes yes 0x17F0C8
|
|
COPS4 yes yes yes 0x17FB53
|
|
COPS5 yes yes yes 0x15B55E
|
|
|
|
work in progress...you are here...
|
|
|
|
|
|
COPS4 yes no no 0 0x8E7665 0x8E711F 0x8E73CB
|
|
COPS4 yes no no 1 0x8E735E
|
|
COPS4 yes no no 2 0x8E29A6 0x8E2DB9
|
|
COPS4 yes no no 3 0x8E220D
|
|
COPS4 yes no no 4 0x8E2859
|
|
COPS4 yes no no 5 0x8E6691 0x8E68CB
|
|
COPS4 yes no no 6 0x8E6ACC
|
|
COPS4 yes no no 7 0x8E7713 0x8E7786
|
|
COPS4 yes no no 8 0x8E735A
|
|
|
|
clang version 3.0 (branches/release_30 152644)
|
|
Target: x86_64-unknown-linux-gnu
|
|
Thread model: posix
|
|
|
|
LLCOPS0 no no no 0xDF11BB
|
|
LLCOPS1 no no no 0xDEF420
|
|
|
|
LLCOPS0 yes no no 0xABA6AC
|
|
LLCOPS1 yes no no 0xAB97C6
|
|
|
|
LLCOPS0 yes yes no 0x1A49FE
|
|
LLCOPS1 yes yes no 0x19F911
|
|
|
|
|