See the top level README for information on where to find the schematic and programmers reference manual for the ARM processor on the raspberry pi. Also find information on how to load and run these programs. This example serves many purposes. First it is a real world library/ application demonstrated in a bare-metal embedded format. This example as with extest demonstrates the use of the mmu so that the data cache can be safely turned on. One of the main goals though is that over the years all to often based on questions in person or in forums I have come to believe that many programmers live under the myth that there is a one to one relationship between high level code and the compiled binary, if you want to change the performance of you code you need to change your code. That is no truth of any kind to that. Likewise a belief that any two compilers can or should be expected to produce the same compiled machine code from the same high level source code. Likewise, that is no reason to expect such a thing. Think about it this way, take 100 programmers, 100 students, whatever. Give them the same programming assignment, a homework project or semester project, whatever, do you end up with all 100 producing the exact same source code? No, of course not you might have solutions that fall into defineable categories, but you are going to see many solutions. No different here, each compiler is created by different people for different reasons with different skills and different goals. And the output is not exactly the same. Sure for some simple programs, there may be an ideal solution and two compilers might converge on that solution, but on average, esp as the project gets large, the solutions will vary. What I have here is the zlib library, in the Linux world it is on a par with gcc as far as being a cornerstone holding the whole thing up in the air. Not unlike gcc you really dont want to actually look at the code. The jumping around from longs to ints, and how variables are declared is not consistent, basically it makes your compiler work a little harder at least. I have some text that I am going to compress and then uncompress, making a nice self-checking test. And going to compare three compilers using different compiler options, and going to turn on and off the caches. With respect to the zlib licence this zlib library is not mine, I have though made modifications to it. I have commented out the includes so that it wont interfere with trying to compile bare-metal (just makes it easier to know what I have to manage and not manage). I do not use a C library so there are a few C library functions that had to be implemented malloc, calloc, memset, memcpy, free. Very very simple malloc, free does nothing, this is a once through test, no need to actually manage memmory allocation just need to give the code what it asks for. Fortunately the raspberry pi is well endowed with memory. I have two versions of gcc, 4.6.1 from the CodeSourcery folks (now mentor graphics). And 4.7.0 built from sources using the build script in the buildgcc directory. The third is clang/llvm built using the build script in the buildgcc directory. This is a completely separate compiler from gcc, different goals, different construction. Thanks to the Apple iPhone and Apple the llvm compiler tools have had a big boost in quality, the code produced is approaching the speed of gcc. In tests I did a while back other compilers, pay-for compilers, blew gcc out of the water. Gcc is an average good compiler for many targets, but not great at any particular target. Other compilers are better. For gcc going to play with the obvious optimizations -Ox, a few of them to show they actually do something. Also leave the default arm arch setting and specify the proper architecture specifically to see what that does. There are many many knobs you can play with, I dont even need to mess with this many combinations to completely demonstrate that the same compiler or different compilers can create binaries that execute at different speeds from the same high level source code. So the gcc knobs are -O1, -O2 and -O3 plus in the code commenting and uncommenting defines that enable the instruction cache, mmu and data cache (need the mmu on to enable the data cache). COPS0 = -Wall -O1 -nostdlib -nostartfiles -ffreestanding COPS1 = -Wall -O2 -nostdlib -nostartfiles -ffreestanding COPS2 = -Wall -O3 -nostdlib -nostartfiles -ffreestanding COPS3 = -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mcpu=arm1176jzf-s COPS4 = -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mcpu=arm1176jzf-s -mtune=arm1176jzf-s COPS5 = -Wall -O3 -nostdlib -nostartfiles -ffreestanding -mcpu=arm1176jzf-s -mtune=arm1176jzf-s arm-none-eabi-gcc --version arm-none-eabi-gcc (Sourcery CodeBench Lite 2011.09-69) 4.6.1 Copyright (C) 2011 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Using the 4.6.1 compiler, going to just dive in. The code compresses some known data. The decompresses some known data. The data has been pre-computed on the host computer to know what the expected compressed result should be. A system timer is used to time how long it takes to compress then decompress, then it checks the results after the timer has been sampled. So in addition to looking at the time result need to also make sure the test actually passed. cops icache mmu dcache COPS0 no no no 0xD0D08A COPS1 no no no 0xB92BA4 COPS2 no no no 0xA410E1 COPS3 no no no 0xB4DC8A COPS4 no no no 0xB4D11B COPS5 no no no 0xA450DB COPS0 yes no no 0x9DC2DF COPS1 yes no no 0x8F5ECF COPS2 yes no no 0x8832F2 COPS3 yes no no 0x8F9C79 COPS4 yes no no 0x8F9ED4 COPS5 yes no no 0x8AE077 COPS3 yes yes no 0x174FA4 COPS4 yes yes no 0x175336 COPS5 yes yes no 0x162750 COPS3 yes yes yes 0x176068 COPS4 yes yes yes 0x175CB0 COPS5 yes yes yes 0x162590 The first interesting thing we see is that even though the data cache is not enabled in the control register, it is obviously on in some form or fashion. Other interesting things are that optimization -O3 when specifying the actual processor vs -O3 using a generic ARMv5 or whatever the generic arm binary was a little faster. The reason is more complicated than just architecture, will get to that in a bit. Note that these results were all collected by hand so there is a possibility for human error. Ideally you will use this info to learn from and not actually care about the specific numbers. arm-none-eabi-gcc (GCC) 4.7.0 Copyright (C) 2012 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. COPS0 no no no 0xD1360B COPS1 no no no 0xB7683E COPS2 no no no 0xA83E6F COPS3 no no no 0xBB1436 COPS4 no no no 0xBB0B60 COPS5 no no no 0xA6CF10 COPS0 yes no no 0x9CFAAD COPS1 yes no no 0x8DA3D1 COPS2 yes no no 0x84B7C5 COPS3 yes no no 0x8E6FDB COPS4 yes no no 0x8E73A6 COPS5 yes no no 0x86156D COPS3 yes yes no 0x17EA80 COPS4 yes yes no 0x17FA4B COPS5 yes yes no 0x15B210 COPS3 yes yes yes 0x17F0C8 COPS4 yes yes yes 0x17FB53 COPS5 yes yes yes 0x15B55E work in progress...you are here... COPS4 yes no no 0 0x8E7665 0x8E711F 0x8E73CB COPS4 yes no no 1 0x8E735E COPS4 yes no no 2 0x8E29A6 0x8E2DB9 COPS4 yes no no 3 0x8E220D COPS4 yes no no 4 0x8E2859 COPS4 yes no no 5 0x8E6691 0x8E68CB COPS4 yes no no 6 0x8E6ACC COPS4 yes no no 7 0x8E7713 0x8E7786 COPS4 yes no no 8 0x8E735A clang version 3.0 (branches/release_30 152644) Target: x86_64-unknown-linux-gnu Thread model: posix LLCOPS0 no no no 0xDF11BB LLCOPS1 no no no 0xDEF420 LLCOPS0 yes no no 0xABA6AC LLCOPS1 yes no no 0xAB97C6 LLCOPS0 yes yes no 0x1A49FE LLCOPS1 yes yes no 0x19F911