raspberrypi/twain/README


See the top level README for information on where to find the
schematic and programmers reference manual for the ARM processor
on the raspberry pi.  Also find information on how to load and run
these programs.

This example serves many purposes. First it is a real world library/
application demonstrated in a bare-metal embedded format.  This example
as with extest demonstrates the use of the mmu so that the data cache
can be safely turned on.  One of the main goals though is that over the
years all to often based on questions in person or in forums I have come
to believe that many programmers live under the myth that there is a one
to one relationship between high level code and the compiled binary, if
you want to change the performance of you code you need to change your
code.  That is no truth of any kind to that.  Likewise a belief that any
two compilers can or should be expected to produce the same compiled
machine code from the same high level source code.  Likewise, that is
no reason to expect such a thing.  Think about it this way, take 100
programmers, 100 students, whatever.  Give them the same programming
assignment, a homework project or semester project, whatever, do you end
up with all 100 producing the exact same source code?  No, of course not
you might have solutions that fall into defineable categories, but you
are going to see many solutions.  No different here, each compiler is
created by different people for different reasons with different skills
and different goals.  And the output is not exactly the same.  Sure for
some simple programs, there may be an ideal solution and two compilers
might converge on that solution, but on average, esp as the project
gets large, the solutions will vary.

What I have here is the zlib library, in the Linux world it is on a
par with gcc as far as being a cornerstone holding the whole thing up
in the air.  Not unlike gcc you really dont want to actually look at
the code.  The jumping around from longs to ints, and how variables are
declared is not consistent, basically it makes your compiler work a
little harder at least.  I have some text that I am going to compress
and then uncompress, making a nice self-checking test.  And going to
compare three compilers using different compiler options, and going
to turn on and off the caches.

With respect to the zlib licence this zlib library is not mine, I have
though made modifications to it.  I have commented out the includes
so that it wont interfere with trying to compile bare-metal (just makes
it easier to know what I have to manage and not manage).  I do not use
a C library so there are a few C library functions that had to be
implemented malloc, calloc, memset, memcpy, free.  Very very simple
malloc, free does nothing, this is a once through test, no need to actually
manage memmory allocation just need to give the code what it asks for.
Fortunately the raspberry pi is well endowed with memory.

I have two versions of gcc, 4.6.1 from the CodeSourcery folks (now
mentor graphics).  And 4.7.0 built from sources using the build script
in the buildgcc directory.  The third is clang/llvm built using the
build script in the buildgcc directory.  This is a completely separate
compiler from gcc, different goals, different construction.  Thanks to
the Apple iPhone and Apple the llvm compiler tools have had a big boost
in quality, the code produced is approaching the speed of gcc.  In tests
I did a while back other compilers, pay-for compilers, blew gcc out
of the water.  Gcc is an average good compiler for many targets, but not
great at any particular target.  Other compilers are better.

For gcc going to play with the obvious optimizations -Ox, a few of them
to show they actually do something.  Also leave the default arm arch
setting and specify the proper architecture specifically to see what that
does.  There are many many knobs you can play with, I dont even need
to mess with this many combinations to completely demonstrate that
the same compiler or different compilers can create binaries that execute
at different speeds from the same high level source code.

So the gcc knobs are -O1, -O2 and -O3 plus in the code commenting and
uncommenting defines that enable the instruction cache, mmu and data
cache (need the mmu on to enable the data cache).

COPS0 = -Wall -O1 -nostdlib -nostartfiles -ffreestanding
COPS1 = -Wall -O2 -nostdlib -nostartfiles -ffreestanding
COPS2 = -Wall -O3 -nostdlib -nostartfiles -ffreestanding
COPS3 = -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mcpu=arm1176jzf-s
COPS4 = -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mcpu=arm1176jzf-s -mtune=arm1176jzf-s
COPS5 = -Wall -O3 -nostdlib -nostartfiles -ffreestanding -mcpu=arm1176jzf-s -mtune=arm1176jzf-s

arm-none-eabi-gcc --version
arm-none-eabi-gcc (Sourcery CodeBench Lite 2011.09-69) 4.6.1
Copyright (C) 2011 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Using the 4.6.1 compiler, going to just dive in.  The code compresses
some known data.  The decompresses some known data.  The data has
been pre-computed on the host computer to know what the expected
compressed result should be.  A system timer is used to time how long
it takes to compress then decompress, then it checks the results after
the timer has been sampled.  So in addition to looking at the time
result need to also make sure the test actually passed.

cops icache mmu dcache

COPS0  no no no     0xD0D08A
COPS1  no no no     0xB92BA4
COPS2  no no no     0xA410E1
COPS3  no no no     0xB4DC8A
COPS4  no no no     0xB4D11B
COPS5  no no no     0xA450DB

COPS0  yes no no    0x9DC2DF
COPS1  yes no no    0x8F5ECF
COPS2  yes no no    0x8832F2
COPS3  yes no no    0x8F9C79
COPS4  yes no no    0x8F9ED4
COPS5  yes no no    0x8AE077

COPS3  yes yes no   0x174FA4
COPS4  yes yes no   0x175336
COPS5  yes yes no   0x162750

COPS3  yes yes yes  0x176068
COPS4  yes yes yes  0x175CB0
COPS5  yes yes yes  0x162590

The first interesting thing we see is that even though the data cache
is not enabled in the control register, it is obviously on in some
form or fashion.  Other interesting things are that optimization -O3 when
specifying the actual processor vs -O3 using a generic ARMv5 or whatever
the generic arm binary was a little faster.  The reason is more
complicated than just architecture, will get to that in a bit.

Note that these results were all collected by hand so there is a possibility
for human error.  Ideally you will use this info to learn from and not
actually care about the specific numbers.

arm-none-eabi-gcc (GCC) 4.7.0
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

COPS0  no no no     0xD1360B
COPS1  no no no     0xB7683E
COPS2  no no no     0xA83E6F
COPS3  no no no     0xBB1436
COPS4  no no no     0xBB0B60
COPS5  no no no     0xA6CF10

COPS0  yes no no    0x9CFAAD
COPS1  yes no no    0x8DA3D1
COPS2  yes no no    0x84B7C5
COPS3  yes no no    0x8E6FDB
COPS4  yes no no    0x8E73A6
COPS5  yes no no    0x86156D

COPS3  yes yes no   0x17EA80
COPS4  yes yes no   0x17FA4B
COPS5  yes yes no   0x15B210

COPS3  yes yes yes  0x17F0C8
COPS4  yes yes yes  0x17FB53
COPS5  yes yes yes  0x15B55E


arm-none-eabi-gcc (GCC) 4.8.1
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

COPS0  no no no     0xD48A9F
COPS1  no no no     0xB994DC
COPS2  no no no     memset
COPS3  no no no     0xC4CEBC
COPS4  no no no     0xC4D880
COPS5  no no no     memset

COPS0  yes no no    0x9D6225
COPS1  yes no no    0x90D014
COPS2  yes no no    memset
COPS3  yes no no    0x921680
COPS4  yes no no    0x920FF1
COPS5  yes no no    memset

COPS3  yes yes no   0x194A48
COPS4  yes yes no   0x19379E
COPS5  yes yes no   memset

COPS3  yes yes yes  0x1948C2
COPS4  yes yes yes  0x19409E
COPS5  yes yes yes  memset

memset means:

twain.o: In function `xmemset':
twain.c:(.text+0x1dc): undefined reference to `memset'
twain.o: In function `xcalloc':
twain.c:(.text+0x23c): undefined reference to `memset'
trees.o: In function `build_tree':
trees.c:(.text+0x1028): undefined reference to `memset'
make: *** [twain.gcc.elf] Error 1

dont want to deal with that right now.


Some things to talk about at this point...First off I was using jtag
to load and run these programs.  Think about this for a second when
you have the instruction cache on and the data cache off.  If you stop
the ARM and use data transactions to load data into ram.  Not cached,
whatever instructions were in the cache when you stopped are still there
instructions laying around.  If you have changed the program you want
to run then the instructions in the instruction cache do not match
the new program.  If you start up, are now mixing two programs and
nothing good comes from that.  Although my start_l1cache init invalidates
the whole cache, and I stop the instruction cache when the program ends
there is evidence if you run the same program over and over again that
the i cache is playing a role.  Ideally doing something like this with
some sort of bootloader be it serial or using jtag (using data cycles
to place instructions) you want to completely stop the icache and
clean it before starting again, if that doesnt appear to be working then
just power cycle between each test, come up the same way every time.

The mmu and data cache is worse than the icache, just power cycle between
each test.  This stop, reload, stop, reload is not normal use of a
system, normal development sure, but how much do you put into your code
for something that is not runtime?  Another factor is during development
you need to make sure you are actually initializing everything.  Say
the 12th build of the afternoon set some bit somewhere to make something
work.  You didnt think that code was working (because some other bit
somewhere else was not set) so you remove it.  Later, without power cycles
you find that other bit, now you think you have it figured out.  Eventually
you power cycle and it doesnt work.  One thing your code must do is
be able to come up out of a reset.  Being able to re-run hot is not
as important. might save development time sure, and may have some
value, but nowhere near as important as power on initialization.

If you read up on the raspberry pi you know that the memory is SDRAM
which is at its core DRAM, the first thing to know about DRAM is that
it has to be refreshed.  Think of each bit as a rechargeable battery
if you want to remember that that bit is charged you have to every
so often give it a little boost, if it discharges too much you might
not notice that the bit has changed.  Well you cant access that memory
from the processor while the DRAM controller is refreshing memory.
Basically DRAM performance is not deterministic, it changes a little.
Which means if you run the same benchmark several times you should not
be surprised if the results are not the same, even bare metal like this
where there are no interrupts and no other code running.  Now you can
make dram deterministic if you access it slow enough to insure that
you get the result in the same number of clock cycles every time.
For all we know the gpu or other parts of the chip may be sharing
a bus or interfering with each other affecting performance and certainly
making it a bit random.  If you run the same binary over and over
you will see that it varies some, but not a huge amount, so that
is good it is somewhat consistent as far as this code/task goes.
if you were to run on say a microcontroller or even a gameboy advance
or other systems where the flash and ram access times are a known
number of clocks every time, then you should at worst only see a
difference of one count either way.  For some of the runs below
I ran them multiple times to show the times were not exactly the same.

Another topic related to doing this kind of benchmarking is how caches
work.  Quite simply if you think about it programs tend to run a number
of instructions in a row before they branch, and when you do stuff
with data you either tend to reuse the same data or access data in
chunks or in some linear order.  The cache will basically read a number
of words at a time from slow memory to fast memory (the fast memory is
very deterministic BTW) so that if you happen to read that instruction
again in the relatively near future you dont have to suffer a slow
memory cycle but can have a relatively fast memory cycle.  Certainly
if you have a few instructions in a row, the first one in the cache line
(the chunk of ram the cache fetches in one shot) causes the cache
line to be read, really really slow, but the second and third are
really really fast, if that code is used again, say in a loop, before
it is kicked out of the cache to make room for other code, then it
remains really fast until it finally gets replaced with other code.
So lets pretend that our cache has a cache line of 4 instructions and
is aligned on a 4 word boundary, say for example 0x1000, 0x1004, 0x1008
and 0x100C all being in the same cache line.  Now lets say we have
two instructions that are called often, someting branches to the first
one it does something then the second one is a branch elsewhere, maybe
not the most efficient, but it happens.  If the first instruction
is at address 0x1000, then when you branch to it if there is a cache
miss then it reads the four instructions from main memory, really
slow, then the second instruction is really fast but we basically
read 4 instructions to execute two (lets not think about the
prefetch right now).  Now if I were to remove one instruction somewhere
just before this code, say I optimized one instruction out of something
even the startup code.  Well, it can actually cause many changes when
you link but lets assume what it does to these instructions is put
one at 0xFFC and the other at 0x1000.  Now every time I hit these
two instructions I have to read two cache lines one at 0xFF0 and the
other at 0x1000, I have to read 8 instructions to execute 2.  If those
two instructions are used often enough and they happen to line up with
other code that is used often enough to evict one or both of these cache
lines then for these two instructions the cache can be making life worse
if we were to put back a nop somewhere to move these two back into
the same cache line, focusing on those two only our performance would
improve.  Of course it is not at all that simple, you have a big program
there are tons of cache lines across the program that are evicting each
other and you have this alignment problem all over the place, by optimizing
one group of instructions by aligning them you might move another or
many other groups so they are not optimally aligned.  The bottom line
to this very long subject is that by simply adding and removing nops
to the beginning of the program, and re-compiling and re-linking you
move the instructions relative to their addresses and change this
cache alignment and as a result change the performance, even if the
instructions were all position independent and identical but moved over
one address location the peformance of your program, even with a
deterministic memory system will vary.  Lets try, the fifth column is
the number of nops added to vectors.s.  This sample ships with three
you can add or remove them at will.  With these numbers the changes
are actually less than one percent, but despite that you can still
see that something is going on, and that something has to do with how
things line up relative to the cache lines for the instruction cache.

COPS4  yes no no 0  0x8E7665    0x8E711F    0x8E73CB
COPS4  yes no no 1  0x8E735E
COPS4  yes no no 2  0x8E29A6    0x8E2DB9
COPS4  yes no no 3  0x8E220D
COPS4  yes no no 4  0x8E2859
COPS4  yes no no 5  0x8E6691    0x8E68CB
COPS4  yes no no 6  0x8E6ACC
COPS4  yes no no 7  0x8E7713    0x8E7786
COPS4  yes no no 8  0x8E735A

Another topic is the mmu.  How does an mmu work?  Ideally an mmu is
there to take a virtual address, the address the processor thinks it
is using, and convert that to a physical address, the real address
in memory.  You can do things for example like have your linux
program be compiled for the same address, all linux programs think they
start at the same address, but it is virtual.  One program may think
it is running code at 0x9000 but it is really 0x12345678, another
running at 0x9000 might really be 0x200100.  the mmu also helps with
protecting programs from each other and other things.  You can also
tune what is cached or not, in our case here, we dont want hardware
registers like the timer and uart to be cached, so we use the mmu to
mark that address space as non-cached and our programs memory space
as cached.  In this case the virtual address and physical address are
the same to make things easier.  Now how does the mmu do what it does?
It has tables, think of it as nested arrays (an array whose index is
another array with some other index).  In addition to all the cache
business going on when you go to have a memory cycle there is another
kind of cache in the mmu that remembers some small number of virtual
to physical address conversions.  If your new address is not in that
list then it has to compute an offset in the first level of the mmu
table, and perform that memory cycle. waiting...waiting... then some
of the bits in that first table tell you where to go in the next
table, so you compute another address and do another slow memory read.
Now, finally you can actually go after the thing you were first looking
for three memory cycles later.  This is repeated for almost everything.
Just like the repetitive nature of things makes the caches additional
reads (fetching a whole cache line even if I only need one item) faster
overall, the cache like table in the mmu cuts down on the constant
table lookups to smooth things out.  As the results show by enabling
the mmu with the data cache on a program like this which is loop heavy
and data heavy runs quite a bit faster when the data cache is on.

clang version 3.0 (branches/release_30 152644)
Target: x86_64-unknown-linux-gnu
Thread model: posix

LLCOPS0 no no no    0xDF11BB
LLCOPS1 no no no    0xDEF420

LLCOPS0 yes no no   0xABA6AC
LLCOPS1 yes no no   0xAB97C6

LLCOPS0 yes yes no  0x1A49FE
LLCOPS1 yes yes no  0x19F911


clang version 3.3 (branches/release_33 189603)
Target: x86_64-unknown-linux-gnu
Thread model: posix


LLCOPS0 no no no    0xE6EF36
LLCOPS1 no no no    0xF550AC

LLCOPS0 yes no no   0xAC25D7
LLCOPS1 yes no no   0xAC2B1C

LLCOPS0 yes yes no  0x1CA6C5
LLCOPS1 yes yes no  0x1C4F53

LLCOPS0 yes yes yes 0x1CA16C
LLCOPS1 yes yes yes 0x1C5B56


A simple experiment to show the mmu overhead.  Changing the code from this

    if(add_one(0x00000000,0x0000|8|4)) return(1);
    if(add_one(0x00100000,0x0000|8|4)) return(1);
    if(add_one(0x00200000,0x0000|8|4)) return(1);

to this

    if(add_one(0x00000000,0x0000)) return(1);
    if(add_one(0x00100000,0x0000)) return(1);
    if(add_one(0x00200000,0x0000)) return(1);

Which disables the data cache for our program space, basically no data
cache anywhere.  The program goes from 0x19Fxx to 0xE5Bxxx which is
slower than the slowest clang program by quite a bit 885% slower.

Llvm and clang are getting better, not quite caught up to gcc, but
compared to version 27 for example against whatever the gcc was at the
time it is converging.  The best clang time is 20% slower than the
best gcc time for this particular benchmark.

As mentioned at the beginning this is demonstrating that the same
source code compled with different compiler options using the same
compiler and different versions of the same compiler and differen
compilers are showing dramatically different results.  The worst
clang time is 858% slower (8.5 times slower) than the fastests clang
time.  The worst gcc time is 964% slower than the fastest gcc time.
A newer version of gcc is not automatically producing faster code,
nor is it making the same speed code, something is changing and it is
not always better.  Newer doesnt mean better.  We also saw what the
mmu does, what the cache does, that tiny changes to the location of the
same sets of instructions can do, etc.  Notice that we didnt have to
do any disassembly and comparison to understand that there are definitely
differences running the same source code on the same computer.  When
you go to tomshardware and look at benchmarks, those are a single binary
run one time one way on top of an operating system.  Yes the hard drive
maybe different and everything else held the same, but did they run that
test 100 times and average or one time?  Had they run it more than once
what is the run to run difference on the same system?  Is that difference
greater than say the same system with a different hard drive?  it probably
says somewhere in the fine print, the point is though that you need to
understand the nature of benchmarking.

Another thing that is hopefully obvious now.  Take any other library
or program, basically change the source code, and the peformance changes
again.  There are no doubt programs that llvm/clang is better at
compiling that a benchmark like this would show, take the same code compile
it several different ways, different compilers, etc.  And see how they
play out, you will find code llvm is good at and code that gcc is good
at and some programs may run at X number of instructions per second
on average and another Y number of instructions per second on average
on the same hardware.

What this benchmark has shown is that the same source code compiled with
different compilers and different settings produces dramatically different
performance results which implies that the code generated is not the
same, there is no one to one relationship between high level programs
and the machine code generated.  We also got to play with the caches
and mmu in a simplified fashion.

(yes this is too wordy and I need to proof/rewrite)
based on new discoveries, changed the start address to 0x8000 for the
whole thing so that messes with cache lines and results have some
differences to the numbers above (for reasons described above).