From f4b7737ae356615c12d47f3dadb40402248db7af Mon Sep 17 00:00:00 2001
From: David Welch <dwelch@dwelch.com>
Date: Tue, 5 Jun 2012 23:45:43 -0400
Subject: [PATCH] added a lot to the twain README, need to go through it and
 prune/edit.

---
 twain/README  | 201 +++++++++++++++++++++++++++++++++++++++++++++++++-
 twain/twain.c |   2 +-
 2 files changed, 201 insertions(+), 2 deletions(-)

diff --git a/twain/README b/twain/README
index 37b8d57..3b83cb2 100644
--- a/twain/README
+++ b/twain/README
@@ -153,8 +153,115 @@ COPS3  yes yes yes  0x17F0C8
 COPS4  yes yes yes  0x17FB53
 COPS5  yes yes yes  0x15B55E
 
-work in progress...you are here...
+Some things to talk about at this point...First off I was using jtag
+to load and run these programs.  Think about this for a second when
+you have the instruction cache on and the data cache off.  If you stop
+the ARM and use data transactions to load data into ram.  Not cached,
+whatever instructions were in the cache when you stopped are still there
+instructions laying around.  If you have changed the program you want
+to run then the instructions in the instruction cache do not match
+the new program.  If you start up, are now mixing two programs and
+nothing good comes from that.  Although my start_l1cache init invalidates
+the whole cache, and I stop the instruction cache when the program ends
+there is evidence if you run the same program over and over again that
+the i cache is playing a role.  Ideally doing something like this with
+some sort of bootloader be it serial or using jtag (using data cycles
+to place instructions) you want to completely stop the icache and
+clean it before starting again, if that doesnt appear to be working then
+just power cycle between each test, come up the same way every time.
 
+The mmu and data cache is worse than the icache, just power cycle between
+each test.  This stop, reload, stop, reload is not normal use of a
+system, normal development sure, but how much do you put into your code
+for something that is not runtime?  Another factor is during development
+you need to make sure you are actually initializing everything.  Say
+the 12th build of the afternoon set some bit somewhere to make something
+work.  You didnt think that code was working (because some other bit
+somewhere else was not set) so you remove it.  Later, without power cycles
+you find that other bit, now you think you have it figured out.  Eventually
+you power cycle and it doesnt work.  One thing your code must do is
+be able to come up out of a reset.  Being able to re-run hot is not
+as important. might save development time sure, and may have some
+value, but nowhere near as important as power on initialization.
+
+If you read up on the raspberry pi you know that the memory is SDRAM
+which is at its core DRAM, the first thing to know about DRAM is that
+it has to be refreshed.  Think of each bit as a rechargeable battery
+if you want to remember that that bit is charged you have to every
+so often give it a little boost, if it discharges too much you might
+not notice that the bit has changed.  Well you cant access that memory
+from the processor while the DRAM controller is refreshing memory.
+Basically DRAM performance is not deterministic, it changes a little.
+Which means if you run the same benchmark several times you should not
+be surprised if the results are not the same, even bare metal like this
+where there are no interrupts and no other code running.  Now you can
+make dram deterministic if you access it slow enough to insure that
+you get the result in the same number of clock cycles every time.
+For all we know the gpu or other parts of the chip may be sharing
+a bus or interfering with each other affecting performance and certainly
+making it a bit random.  If you run the same binary over and over
+you will see that it varies some, but not a huge amount, so that
+is good it is somewhat consistent as far as this code/task goes.
+if you were to run on say a microcontroller or even a gameboy advance
+or other systems where the flash and ram access times are a known
+number of clocks every time, then you should at worst only see a
+difference of one count either way.  For some of the runs below
+I ran them multiple times to show the times were not exactly the same.
+
+Another topic related to doing this kind of benchmarking is how caches
+work.  Quite simply if you think about it programs tend to run a number
+of instructions in a row before they branch, and when you do stuff
+with data you either tend to reuse the same data or access data in
+chunks or in some linear order.  The cache will basically read a number
+of words at a time from slow memory to fast memory (the fast memory is
+very deterministic BTW) so that if you happen to read that instruction
+again in the relatively near future you dont have to suffer a slow
+memory cycle but can have a relatively fast memory cycle.  Certainly
+if you have a few instructions in a row, the first one in the cache line
+(the chunk of ram the cache fetches in one shot) causes the cache
+line to be read, really really slow, but the second and third are
+really really fast, if that code is used again, say in a loop, before
+it is kicked out of the cache to make room for other code, then it
+remains really fast until it finally gets replaced with other code.
+So lets pretend that our cache has a cache line of 4 instructions and
+is aligned on a 4 word boundary, say for example 0x1000, 0x1004, 0x1008
+and 0x100C all being in the same cache line.  Now lets say we have
+two instructions that are called often, someting branches to the first
+one it does something then the second one is a branch elsewhere, maybe
+not the most efficient, but it happens.  If the first instruction
+is at address 0x1000, then when you branch to it if there is a cache
+miss then it reads the four instructions from main memory, really
+slow, then the second instruction is really fast but we basically
+read 4 instructions to execute two (lets not think about the
+prefetch right now).  Now if I were to remove one instruction somewhere
+just before this code, say I optimized one instruction out of something
+even the startup code.  Well, it can actually cause many changes when
+you link but lets assume what it does to these instructions is put
+one at 0xFFC and the other at 0x1000.  Now every time I hit these
+two instructions I have to read two cache lines one at 0xFF0 and the
+other at 0x1000, I have to read 8 instructions to execute 2.  If those
+two instructions are used often enough and they happen to line up with
+other code that is used often enough to evict one or both of these cache
+lines then for these two instructions the cache can be making life worse
+if we were to put back a nop somewhere to move these two back into
+the same cache line, focusing on those two only our performance would
+improve.  Of course it is not at all that simple, you have a big program
+there are tons of cache lines across the program that are evicting each
+other and you have this alignment problem all over the place, by optimizing
+one group of instructions by aligning them you might move another or
+many other groups so they are not optimally aligned.  The bottom line
+to this very long subject is that by simply adding and removing nops
+to the beginning of the program, and re-compiling and re-linking you
+move the instructions relative to their addresses and change this
+cache alignment and as a result change the performance, even if the
+instructions were all position independent and identical but moved over
+one address location the peformance of your program, even with a
+deterministic memory system will vary.  Lets try, the fifth column is
+the number of nops added to vectors.s.  This sample ships with three
+you can add or remove them at will.  With these numbers the changes
+are actually less than one percent, but despite that you can still
+see that something is going on, and that something has to do with how
+things line up relative to the cache lines for the instruction cache.
 
 COPS4  yes no no 0  0x8E7665    0x8E711F    0x8E73CB
 COPS4  yes no no 1  0x8E735E
@@ -166,6 +273,38 @@ COPS4  yes no no 6  0x8E6ACC
 COPS4  yes no no 7  0x8E7713    0x8E7786
 COPS4  yes no no 8  0x8E735A
 
+Another topic is the mmu.  How does an mmu work?  Ideally an mmu is
+there to take a virtual address, the address the processor thinks it
+is using, and convert that to a physical address, the real address
+in memory.  You can do things for example like have your linux
+program be compiled for the same address, all linux programs think they
+start at the same address, but it is virtual.  One program may think
+it is running code at 0x9000 but it is really 0x12345678, another
+running at 0x9000 might really be 0x200100.  the mmu also helps with
+protecting programs from each other and other things.  You can also
+tune what is cached or not, in our case here, we dont want hardware
+registers like the timer and uart to be cached, so we use the mmu to
+mark that address space as non-cached and our programs memory space
+as cached.  In this case the virtual address and physical address are
+the same to make things easier.  Now how does the mmu do what it does?
+It has tables, think of it as nested arrays (an array whose index is
+another array with some other index).  In addition to all the cache
+business going on when you go to have a memory cycle there is another
+kind of cache in the mmu that remembers some small number of virtual
+to physical address conversions.  If your new address is not in that
+list then it has to compute an offset in the first level of the mmu
+table, and perform that memory cycle. waiting...waiting... then some
+of the bits in that first table tell you where to go in the next
+table, so you compute another address and do another slow memory read.
+Now, finally you can actually go after the thing you were first looking
+for three memory cycles later.  This is repeated for almost everything.
+Just like the repetitive nature of things makes the caches additional
+reads (fetching a whole cache line even if I only need one item) faster
+overall, the cache like table in the mmu cuts down on the constant
+table lookups to smooth things out.  As the results show by enabling
+the mmu with the data cache on a program like this which is loop heavy
+and data heavy runs quite a bit faster when the data cache is on.
+
 clang version 3.0 (branches/release_30 152644)
 Target: x86_64-unknown-linux-gnu
 Thread model: posix
@@ -179,4 +318,64 @@ LLCOPS1 yes no no   0xAB97C6
 LLCOPS0 yes yes no  0x1A49FE
 LLCOPS1 yes yes no  0x19F911
 
+A simple experiment to show the mmu overhead.  Changing the code from this
 
+    if(add_one(0x00000000,0x0000|8|4)) return(1);
+    if(add_one(0x00100000,0x0000|8|4)) return(1);
+    if(add_one(0x00200000,0x0000|8|4)) return(1);
+
+to this
+
+    if(add_one(0x00000000,0x0000)) return(1);
+    if(add_one(0x00100000,0x0000)) return(1);
+    if(add_one(0x00200000,0x0000)) return(1);
+
+Which disables the data cache for our program space, basically no data
+cache anywhere.  The program goes from 0x19Fxx to 0xE5Bxxx which is
+slower than the slowest clang program by quite a bit 885% slower.
+
+Llvm and clang are getting better, not quite caught up to gcc, but
+compared to version 27 for example against whatever the gcc was at the
+time it is converging.  The best clang time is 20% slower than the
+best gcc time for this particular benchmark.
+
+As mentioned at the beginning this is demonstrating that the same
+source code compled with different compiler options using the same
+compiler and different versions of the same compiler and differen
+compilers are showing dramatically different results.  The worst
+clang time is 858% slower (8.5 times slower) than the fastests clang
+time.  The worst gcc time is 964% slower than the fastest gcc time.
+A newer version of gcc is not automatically producing faster code,
+nor is it making the same speed code, something is changing and it is
+not always better.  Newer doesnt mean better.  We also saw what the
+mmu does, what the cache does, that tiny changes to the location of the
+same sets of instructions can do, etc.  Notice that we didnt have to
+do any disassembly and comparison to understand that there are definitely
+differences running the same source code on the same computer.  When
+you go to tomshardware and look at benchmarks, those are a single binary
+run one time one way on top of an operating system.  Yes the hard drive
+maybe different and everything else held the same, but did they run that
+test 100 times and average or one time?  Had they run it more than once
+what is the run to run difference on the same system?  Is that difference
+greater than say the same system with a different hard drive?  it probably
+says somewhere in the fine print, the point is though that you need to
+understand the nature of benchmarking.
+
+Another thing that is hopefully obvious now.  Take any other library
+or program, basically change the source code, and the peformance changes
+again.  There are no doubt programs that llvm/clang is better at
+compiling that a benchmark like this would show, take the same code compile
+it several different ways, different compilers, etc.  And see how they
+play out, you will find code llvm is good at and code that gcc is good
+at and some programs may run at X number of instructions per second
+on average and another Y number of instructions per second on average
+on the same hardware.
+
+What this benchmark has shown is that the same source code compiled with
+different compilers and different settings produces dramatically different
+performance results which implies that the code generated is not the
+same, there is no one to one relationship between high level programs
+and the machine code generated.  We also got to play with the caches
+and mmu in a simplified fashion.
+
+(yes this is too wordy and I need to proof/rewrite)
diff --git a/twain/twain.c b/twain/twain.c
index e729f09..f1b2f45 100644
--- a/twain/twain.c
+++ b/twain/twain.c
@@ -3,7 +3,7 @@
 //-------------------------------------------------------------------------
 #define ICACHE
 #define MMU
-//#define DCACHE
+#define DCACHE
 
 extern void PUT32 ( unsigned int, unsigned int );
 extern unsigned int GET32 ( unsigned int );