diff --git a/boards/pizero/asmdelay/README b/boards/pizero/asmdelay/README index 3631993..2626bee 100644 --- a/boards/pizero/asmdelay/README +++ b/boards/pizero/asmdelay/README @@ -20,7 +20,7 @@ this or not? Here is the punch line min max difference -00016DDE 003E025D 003C947F +00016DDE 003E025D 003C947F Yes! The minimum is 0.71 clocks per loop on average, less than one clock per instruction! How is that possible? @@ -55,9 +55,9 @@ to adjust its alignment. Using the disassembly of the loop in start.s 0000802c : - 802c: e2500001 subs r0, r0, #1 - 8030: 1afffffd bne 802c - 8034: e12fff1e bx lr + 802c: e2500001 subs r0, r0, #1 + 8030: 1afffffd bne 802c + 8034: e12fff1e bx lr We can see the raw instructions, the conditional branch is pc relative not absolute, basically position independent so can be used as is. @@ -72,178 +72,926 @@ didnt know there was a prefetch flush you needed to do. I went way overboard and used flushes and dmbs and dsbs liberally, needed or not. Prefetch flush made it so that the pi worked. -should I dive into this or not? hmm... +--- + +Cache. For what we care about here a cache is a relatively small +amount of memory that is faster than the main memory. Being smaller it +can only hold some things. Ideally it holds the things you are using +more than once, or, since programs tend to do things linearly programs +run instructions sequentially at least for a little while before needing +to do a branch. When we read data, parsing strings, etc we often (enough) +will read memory in order for at least a little while. So the cache +has tables (tags) used to know what is in cache, read transactions marked +as cacheable are compared against those tags to see if the answer is in +the cache, if so then the processor does not have to wait as long as +cache is faster than main memory. If there is a cache miss meaning +the item is not in cache, the the cache will do a read, but it does +not necessarily read just the item you want, it reads the amount of +memory needed to fill a "cache line". A cache line being an aligned +amount of data, often larger than a normal sized access, the idea +being is as above, if you are executing code you often have linear +chunks, if you are reading data to processot you often have linear +chunks. So if you were to read two things back to back that are in +the same cache line, the first one if there is a miss is pretty slow +the whole line has to be read in, the line is not grossly inneficient +with respect to a read from main memory, probably slower than a smaller +size, but probably faster than multiple separate reads to gather the +same amount. So the first item read in a line is slow the second is +significantly faster, so even if you read two things you might be +faster than if you had no cache. + +This example is not doing anything with data, not anything that +matters as far as the performance test. As shown above there is a two +instruction loop, these are instructions and instructions, when +the (instruction) cache is enabled, will be marked as cacheable when +fetched. So the first interesting thing we see is one of these two +loops. + + invalidate_l1cache(); + for(ra=0;ra<4;ra++) + { + beg=GET32(ARM_TIMER_CNT); + ASMDELAY(10); + end=GET32(ARM_TIMER_CNT); + hexstring(end-beg); + } + +The invalidate basically erases the cache in the sense that it forgets +all the tags. We run this loop 10 times, so the first time it fetches +those instructions it comes from main memory. The remaining 9 times +ideally come from cache. The outer loop runs 4 times, without an +invalidate so all 10 ASMDELAY loops are ideally cached. Assuming that + +0000004A +00000031 +00000031 +00000031 + +00000041 +00000031 +00000031 +00000031 + + +0x31 = 49 +49 / 10 = 4.9 + +we are averaging 4.9 clocks per loop for the cached loops. + +0x4A = 74 + +4.9 * 9 = 44 +74 - 44 = 30 + +So based on those assumptions that first time through the loop took 30 +clocks. + +0x41 = 65 +65 - 44 = 21 + +For some reason the second time the first pass is faster. This could +be as simple as dram accesses are not deterministic, also we are +sharing the dram with the GPU so maybe there was contention for that +resource there and one took longer. + +00045C3F +00045C28 +00045C27 +00045C28 + +Note so far we are talking about the L1 cache inside the ARM core. We +see here with 0x20000 loops it perhaps appears that the first pass +was a little longer than the rest. + +2.18 on average for the latter loops. If you work the math that +first pass first instruction fetch was 25.18 ticks. And that is +on par with the the 10 loop experiments. Just much more dramatic +with the fewer loops, which matters depending on what you are doing. +Timing a hundred thousand times through this loop is done to get an +average, doing it multiple times hopes to erase the first fetch, that +or say do a million or a billion loops so the first loop time get +swampled by the average. But if you are wanting to use this code +as a timed loop of say a few times through the loop, it is important +to know what the best and worst times are for whatever you are doing +if you are bit banging i2c or spi or whatever, and you cannot go faster +than some time period, you need to determine the best possible loop +time and use that as the tuning value, because for that bit banging +you can usually go slower, up to several times slower, but cannot +go faster even once. + +Our understaning from the Broadcom ARM manual for this part, the only +public one we have for the original pi processor which is the same one +in the pi-zero, says that the ARM address space above 0xC0000000 is +uncached. There is a cache outside the ARM but in front of the dram +in theory the cache is shared between us and the GPU but who knows? +But like any other cache, esp one like this that likely does not care +about ARM instruction fetches from data reads, should be caching +our instruction reads all the time. And I dont know how to invalidate. + +So these initial loops + +0019F158 +0019F149 +0019F0FE +0019F142 +0019F1C6 + +One would have expected the first to be slower than the others. Perhaps +code that preceeded this caused the cache to fill, perhaps we can +create experiments to get a feel about the 0xC0000000 being uncached and +assuming that the 0x00000000 arm space we are using is cached. Pretty +easy to write a small program that writes to some offset in our memory +around 0x00000000 say 0x00001000 for example, then read 0x40001000 and +0x80001000 and 0xC0001000 and you will see the same value you wrote +to 0x00001000, demonstrating that at least as far as the ARM address +space is it does wrap around. + +Note this is using the ARM TIMER which blinker03 shows is 250MHz based +the ARM is in theory going 1000MHz. So four processor clocks per timer +tick. + +So based on what we saw so far we would assume that once in instruction +cache then we always get the same performance yes? Well then why does +this happen (and why did I do this test) + +C0006000 0005B72B 0005B72B 0005B72B 00000000 +C0006000 0005B6F1 0005B6F1 0005B72B 0000003A +C000601C 0005B731 0005B6F1 0005B731 00000040 +C0006058 0005B732 0005B6F1 0005B732 00000041 +C0006078 0005B73B 0005B6F1 0005B73B 0000004A + +What this is telling us for at least the range I tried, with the +instructions most likely in cache our loop time still varies between +0005B6F1 and 0005B73B a difference of 0000004A clocks. That is not +a lot but run this test again and again and you will see these strange +boundaries where the timing changes. How is this possible it is only +two instructions the only thing, in theory, that is changing is what +addresses they live in. + +Well think about this, this is a pipelined processor, a pipeline is +basically the same as an assembly line instead of one employee or +set of employees putting together a product like a car in one place +all the tools and parts have to weave around each other to get in +to that location where the car is. Instead if you were to move the +car from station to station, each station performs one or a few +relatively simple tests, putting the tires on, mounting the doors, +etc. The tools for that station and no others are in that station +and the supplies for that station are fed to that station faster than +the assembly line is moving, on average. Well you can make the car +a little faster than keeping it stationary, but you can AVERAGE +significantly more cars over time. It may take you an hour to build +one car from beginning to end, but the factor may pump out a new car +every so many seconds. A processor pipeline is similar to that the +steps are broken out and performed per clock so that for linear code +the average is much faster than only operating on one instruction at +a time. + +Processors like this do not have the old fashioned bus like the 8088/86 +for example. Where you sent out address and data if it is a write, you +asserted a write signal or a read signal and some enables, then the memory +responded the next clock cycle, the whole system was running at a speed +that did not exceepd what the processor nor the SRAM could do. At some +point the thought of adding wait states, you could add slower ram or +peripherals that couldnt keep up all the time but let you try to keep +running, so some sort of wait scheme was added to allow a peripheral to +say please wait. What we use now with the AMBA/AXI/AHB busses on ARMs +is a whole different strategy, it takes a few clock cycles even for +the simplest thing, the L1 cache is buried in the core and doesnt +necessarily need as many clocks as the edge of the core. The AXI bus +will say I would like to do a read, it is an instruction fetch, here +is the address, here is how much data I want, and here is a transaction +id. The ARM has the ability to keep multiple transactions in flight +it might perform a data read generated by code doing a data read then +the next cycle start an instruction fetch. Eventually the memory or +peripherals respond, and that feeds back into the AXI bus and the +tag associated is put on the return bus along with the data. I mentioned +you specify the size. The bus might be 64 bits wide or 32 bits wide +and the size is likely in units of 32 bits for a processor like this, +so in theory you can do a 1 word read, a 2 word read a 3 word read, etc +on up to probably a number like 8 words per read. If you have a 64 bit +bus and depending on how it is designed, often it is based on 64 bit +width alignments, a two word read and a one word read might take the +same amount of time. But two one word reads should take longer than +two one word reads aligned or not. Busses like this once they have +the data ready they deliver it every clock cycle. So an 8 word read +there is the opening clock cycles to ask for the transaction. Then +some time passes as the data is located and/or gathered then when +it starts coming back it takes 4 clock cycles assuming aligned. Had +it been 6 words, 64 bit aligned, then the difference between 6 and 8 +is ideally one clock cycle. But three or four 2 word transactions +should take longer you have the up front transaction handshake. + +So why bother to go through that? Well the pipeline only works if we +can keep it fed with instructions. So the pipeline is some depth +which can change from one core design/architecture to another and may +or may not be documented outside ARM. And one would expect that the +logic fetches enough instructions to feed the pipe. And one would +expect that fetches are transactions of multiple instructions, say +for example 4 words per fetch, and that is probably on 4 word aligned +boundaries. So say we branch to address 0x1000, and lets pretend +there is a 6 deep pipeline. One would expect the logic to bang out +two 4 word instruction fetches one at address 0x1000 and one at address +0x1010. As those instructions roll in it starts to feed the pipe, +there would also need to be some storage to hold those 8 instructions +as they land, a cache or prefetch buffer or whatever you want to +call it. Once we get through either 4 or maybe 6 of those instructions +and so far none of them being branches, one would expect the logic +to then do a 4 word fetch to keep the prefetch buffer or pipe full. +Once there is a branch then it starts all over, one or two immediate +fetches to start to fill the pipe up again. One would expect that +even with two instructions in a loop, the logic would still need to +perform those two fetches per loop. But since all of this happens +inside the chip we cant see it, without all the legal stuff to gain +access to an ARM core and the tools to simulate we are not going +to know for sure what is going on. Stealing from the term cache line +I like to call these fetches fetch lines. Just like the situation +where you have even a two instruction loop that lands on the last and +first words of a cache line you would need to read two cache lines to +cover the fetching of those two instructions where earlier in first +cache line only one cache line read is needed. We should be able to +see situations where we hit that two cache line boundary and likewise +we whould be able to see the effects of fetch lines and where the +branches land, sometimes needing to fetch an extra fetch line. + +So even with the cache enabled and filled something is happening when +we branch to address 0xC000601C extra fetch transactions are needed +likewise there is sensitivity at the other addresses. I wouldnt get +worked up over a one timer tick difference necessarily that could be +due to the non-deterministic nature of using something like dram and +sharing resources with another processor where every so often we may +have to wait longer for something. + +C0006000 0005B72B 0005B72B 0005B72B 00000000 +C0006000 0005B6F1 0005B6F1 0005B72B 0000003A +C000601C 0005B731 0005B6F1 0005B731 00000040 +C0006058 0005B732 0005B6F1 0005B732 00000041 +C0006078 0005B73B 0005B6F1 0005B73B 0000004A + +We see from the first pass at 0x6000 to the second our time got faster +that is likely the filling of the cache. After that point we never +get faster as we saw up above with the early tests. The code only +triggers a print if min or max changes. So the rest of these output +lines are due to the max getting bigger. + +Branch prediction. Think about that processor pipeline, each step +can do some stuff, but doesnt do everything, otherwise what is the +point? So even for our simple loop we have a subtraction and then +a branch that relies on the result of that subtraction. And that branch +if taken "flushes" the pipe meaning we just toss those instructions +but it takes time to first fetch the new instructions at the +branch destination. And then serially feed the pipe (assuming a +serial pipeline, read on), all that time the processing part of the +processor, the assembly line, is idle until instructions start moving +in and moving from one stage in the pipe to another. Branch prediction +is looking at instructions that are not yet to the execution stage +to see if they are branches and see if we can determine if they are +going to happen. If we have say a 5 instruction pipeline A,B,C,D,E +and A is where instructions enter the pipe, and E is the last step when +we are finished with it. And lets say D is where we would normally +figure out this is a branch and then act on that. If we were to look +at stage A and see that it is a branch, even better an unconditional +branch, and we also by looking at the instructions at B and C and D +that they are not unconditional branches, they might be branches but +maybe not unconditional. We might want to start an instruction fetch +for the branch destination when the branch is going into B, saving us +two clock cycles in starting that fetch. Now we could also have a +design that during A be it an unconditional or conditional branch +start a fetch, there would be a lot of unused fetch bandwith going on +but depending on our cache to processor performance main ram to +cache performance, we might end up going faster overall, perhaps like +our little test case that unconditional branch eventually happens, so +fetching every branch we see would put that code ideally in the cache +much earlier. It is likely that the logic is not going to start +fetches for every single possible branch that might happen, the logic +is going to want more complication to not have too many fetches, if +we save a clock or few here and there, and not cost more clocks than +we save, then it is a win, so what if we cant accurately predict +everything. So using our A,B,C,D,E model above. If we see that +A is a conditional branch that relies on flags, and our logic is +smart enough to see that B and C do not have instructions that affect +flags, and D has one that affects flags, then it is possible that +as D completes and the conditional branch in A moves into the B stage +we could know at that time if the branch is going to happen and if so +we can start fetching the branch destination, if we can determine +the branch destination, depends on the instruction set. It could be +an unconditional bx r1, but the instruction in B is a load of r1 +so we cant figure out where to branch until we finish that load or +move or whatever. So what if we were to start adding nops in our loop + + +ASMDELAY: + subs r0,r0,#1 + nop + ... + nop + bne ASMDELAY + bx lr + +Eventually we would have so many that the pipeline is full of nops +the thing that determines the branch and the thing that does the +branch are not in the pipe at the same time. But with nops we at +least can hope/insure that once the pipe is full of these nops and then +the branch comes in, when that branch reaches the magic point in +the pipe everything in front of it is a nop, so the branch predictor +should have everything it needs to fetch early. Now add to this +the herky jerky fetching do to fetch lines and cache lines. + +The first batch is with cache but with branch prediciton disabled. + +C0006000 0005B72B 0005B72B 0005B72B 00000000 +C0006000 0005B6F1 0005B6F1 0005B72B 0000003A +C000601C 0005B731 0005B6F1 0005B731 00000040 +C0006058 0005B732 0005B6F1 0005B732 00000041 +C0006078 0005B73B 0005B6F1 0005B73B 0000004A +00051078 +00051878 + +this batch is with branch prediction enabled. + +C0006000 00016E12 00016E12 00016E12 00000000 +C0006000 00016DDE 00016DDE 00016E12 00000034 +C0006004 000224E4 00016DDE 000224E4 0000B706 +C000601C 000224F0 00016DDE 000224F0 0000B712 + +Much faster, much faster than expected. + +And yes if you are doing the math, we are well within the realm of +it taking fewer clocks than we have instructions per loop, so we are +executing two instructions in theory less time that it takes to +execute one. This processor is super scaler, meaning it has multiple +execution units. The pipeline has forks in it. The instructions +coming in the front door are examined and sorted into separate lines +as with branch prediciton this is not perfect, but the idea is to try +to sort out instructions that dont have to happen in a certain order +for example if we were to throw a useful instruction, but one that +doesnt affect our loop: + +ASMDELAY: + subs r0,r0,#1 + add r3,r3,#3 + bne ASMDELAY + bx lr + +Ideally the logic will determine that the subs modifies flags that +the bne needs so the bne must wait for the subs to complete far enough +to start to execute the bne. The add is not using the result of the subs +nor is it affecting the bne, so ideally it gets sorted out into a +separate execution pipe and it can possibly execute at the same time +that the subs happens or maybe even before in a more compilicated +loop. Pipeline implementations are also deep in the processor, +something that likely changes or improves from one architecture to +another as years go by and new designs come out. (ARMv4 to ARMv5 +to ARMv6 and so on). It may be that every instruction is dealt +out like cards to different execution pipes, but there are tags of +some sort associated with them so that the execution pipes can talk +to each other to say "you cant do that one until I am finished", but +pipes that dont have that baggage can push that instruction through +as fast as they can. So in a super scaler I would expect to be able +to insert that add in there and not see a performance hit other than +the cost of the extra fetch clock cycles. But if I were to instead +insert: + +ASMDELAY: + subs r0,r0,#1 + and r0,r0,#0xFF + bne ASMDELAY + bx lr + +The processor cannot figure out that I am never using r3, so it has +to do that add instruction, but the add has to wait for the subtract +and the bne has to wait for both the subtract and the and, now obviously +this loop cannot count down more than 255, so not enough counts for +our experiments, but demonstrates the relationships that a super scaler +processor looks for. Like branch prediction, not expected in any +way to be perfect, but if you can sometimes save one or a few clocks +here and there those clocks will add up. + +I did not do this here, but you could also do some performance tests by +adding that bunch of nops + +ASMDELAY: + subs r0,r0,#1 + nop + ... + nop + bne ASMDELAY + bx lr + +And pushing the difference between fetch performance and execution +performance, can also see if there are any herky jerky motions related +to fetching and how the prefetch feeds the pipe, etc. + +Without actually seeing (in simulation) how the processor works per +clock we can only guess as to what is going on by performing experiments +like this. + +So I think we can see in this example the L1 caching, the first fetch +through the loop having to go to main memory, which is dram so pretty +slow, and then the rest of the loops fetching from the L1 cache which +is the fastest/closest memory we have to the processor core. Even +with the code in cache we can see differences based on the alignment +of the loop, and we can see differences with the branch prediction on. + + +00016DDE 003E025D 003C947F + +The fastest 0x20000 count loop was 00016DDE or 0.71 timer ticks per +loop on average. And the worst 31 clocks per loop on average. + +The first dump below is based on having no config.txt, again this +is a raspberry pi zero. + + +config.txt contains +DISABLE_L2CACHE=1 + +some subtle changes, but not as much to note. + + +Now changing the arm frequency to 250Mhz is quite useful as the +timer we are using and the arm clock are in theory the same not +necessarily in phase or anything but both are 250Mhz, so we dont have +four processor clocks per timer tick. + +So the dump after this one is with the reduced arm clock, see comments +there 12345678 12345678 12345678 12345678 12345678 -0019F158 -0019F149 -0019F0FE -0019F142 -0019F1C6 -00045C3F -00045C28 -00045C27 -00045C28 -0000004A -00000031 -00000031 -00000031 -00000041 -00000031 -00000031 -00000031 -C0000000 C0000000 C0000000 C0000000 -00050078 -00050078 -C0006000 002200D2 002200D2 002200D2 00000000 -C0006000 002200A6 002200A6 002200D2 0000002C -C0006000 00220145 002200A6 00220145 0000009F -C0006008 00220173 002200A6 00220173 000000CD -C0006010 00280096 002200A6 00280096 0005FFF0 -C0006010 00280104 002200A6 00280104 0006005E -C000601C 003E015C 002200A6 003E015C 001C00B6 -C000601C 003E01AA 002200A6 003E01AA 001C0104 -C000602C 0022009D 0022009D 003E01AA 001C010D -C000603C 003E01BC 0022009D 003E01BC 001C011F -C000603C 003E0211 0022009D 003E0211 001C0174 -C0006060 0022005E 0022005E 003E0211 001C01B3 -C00060FC 003E024D 0022005E 003E024D 001C01EF -00050078 -00050878 -C0006000 001E0119 001E0119 001E0119 00000000 -C0006000 001E00FB 001E00FB 001E0119 0000001E -C0006000 001E00C0 001E00C0 001E0119 00000059 -C0006004 00200101 001E00C0 00200101 00020041 -C0006008 001E00AD 001E00AD 00200101 00020054 -C000600C 0020015F 001E00AD 0020015F 000200B2 -C0006010 001E00A0 001E00A0 0020015F 000200BF -C0006014 00200177 001E00A0 00200177 000200D7 -C000601C 003C010A 001E00A0 003C010A 001E006A -C000601C 003C01C0 001E00A0 003C01C0 001E0120 -C0006028 001E008D 001E008D 003C01C0 001E0133 -C000603C 003C01EC 001E008D 003C01EC 001E015F -C0006040 001E0065 001E0065 003C01EC 001E0187 -C000605C 003C0252 001E0065 003C0252 001E01ED -C000609C 003C0258 001E0065 003C0258 001E01F3 -C00060B0 001E0064 001E0064 003C0258 001E01F4 -00050878 -00050078 -C0006000 0005B72B 0005B72B 0005B72B 00000000 -C0006000 0005B6F1 0005B6F1 0005B72B 0000003A -C000601C 0005B731 0005B6F1 0005B731 00000040 -C0006058 0005B732 0005B6F1 0005B732 00000041 -C0006078 0005B73B 0005B6F1 0005B73B 0000004A -00051078 -00051878 -C0006000 00016E12 00016E12 00016E12 00000000 -C0006000 00016DDE 00016DDE 00016E12 00000034 -C0006004 000224E4 00016DDE 000224E4 0000B706 -C000601C 000224F0 00016DDE 000224F0 0000B712 -00051878 -00051078 -80000000 80000000 80000000 80000000 -00050078 -00050078 -80006000 002200E1 002200E1 002200E1 00000000 -80006000 002200C5 002200C5 002200E1 0000001C -80006000 002200B8 002200B8 002200E1 00000029 -80006000 002200E7 002200B8 002200E7 0000002F -80006004 002200E9 002200B8 002200E9 00000031 -80006004 002200AE 002200AE 002200E9 0000003B -80006004 0022018A 002200AE 0022018A 000000DC -80006008 00220075 00220075 0022018A 00000115 -8000600C 0022005F 0022005F 0022018A 0000012B -80006010 00280105 0022005F 00280105 000600A6 -8000601C 003E0168 0022005F 003E0168 001C0109 -8000601C 003E01B7 0022005F 003E01B7 001C0158 -8000603C 003E024B 0022005F 003E024B 001C01EC -800060FC 003E025A 0022005F 003E025A 001C01FB -00050078 -00050878 -80006000 001E00B2 001E00B2 001E00B2 00000000 -80006000 001E00CD 001E00B2 001E00CD 0000001B -80006000 001E0158 001E00B2 001E0158 000000A6 -80006004 00200102 001E00B2 00200102 00020050 -80006004 0020010F 001E00B2 0020010F 0002005D -80006004 002001FC 001E00B2 002001FC 0002014A -80006008 001E006F 001E006F 002001FC 0002018D -80006008 001E005C 001E005C 002001FC 000201A0 -8000601C 003C0161 001E005C 003C0161 001E0105 -8000601C 003C0267 001E005C 003C0267 001E020B -8000603C 003C026C 001E005C 003C026C 001E0210 -80006048 001E005B 001E005B 003C026C 001E0211 -00050878 -00050078 -80006000 0005B711 0005B711 0005B711 00000000 -80006000 0005B6F3 0005B6F3 0005B711 0000001E -80006004 0005B721 0005B6F3 0005B721 0000002E -80006018 0005B732 0005B6F3 0005B732 0000003F -80006018 0005B6F1 0005B6F1 0005B732 00000041 -80006058 0005B733 0005B6F1 0005B733 00000042 -00051078 -00051878 -80006000 00016E0A 00016E0A 00016E0A 00000000 -80006000 00016DDF 00016DDF 00016E0A 0000002B -80006000 00016DDE 00016DDE 00016E0A 0000002C -80006004 000224E4 00016DDE 000224E4 0000B706 -8000601C 000224F0 00016DDE 000224F0 0000B712 -00051878 -00051078 -40000000 40000000 40000000 40000000 -00050078 -00050078 -40006000 002200C8 002200C8 002200C8 00000000 -40006000 00220118 002200C8 00220118 00000050 -40006004 002200BB 002200BB 00220118 0000005D -40006004 00220190 002200BB 00220190 000000D5 -40006008 002200A2 002200A2 00220190 000000EE -4000600C 00220073 00220073 00220190 0000011D -40006010 0028009C 00220073 0028009C 00060029 -40006010 002800AF 00220073 002800AF 0006003C -40006010 002800BC 00220073 002800BC 00060049 -40006014 002800DD 00220073 002800DD 0006006A -4000601C 003E014D 00220073 003E014D 001C00DA -4000601C 003E015F 00220073 003E015F 001C00EC -4000601C 003E0175 00220073 003E0175 001C0102 -4000601C 003E0255 00220073 003E0255 001C01E2 -4000603C 003E025D 00220073 003E025D 001C01EA -400060AC 0022005F 0022005F 003E025D 001C01FE -00050078 -00050878 -40006000 001E010C 001E010C 001E010C 00000000 -40006000 001E0109 001E0109 001E010C 00000003 -40006000 001E00DD 001E00DD 001E010C 0000002F -40006004 002000D4 001E00DD 002000D4 0001FFF7 -40006004 00200103 001E00DD 00200103 00020026 -40006004 00200196 001E00DD 00200196 000200B9 -40006008 001E00AD 001E00AD 00200196 000200E9 -40006010 001E007C 001E007C 00200196 0002011A -4000601C 003C025F 001E007C 003C025F 001E01E3 -40006020 001E0073 001E0073 003C025F 001E01EC -40006020 001E006F 001E006F 003C025F 001E01F0 -4000603C 003C0267 001E006F 003C0267 001E01F8 -40006040 001E0069 001E0069 003C0267 001E01FE -400060B0 001E0066 001E0066 003C0267 001E0201 -400060D0 001E0057 001E0057 003C0267 001E0210 -00050878 -00050078 -40006000 0005B712 0005B712 0005B712 00000000 -40006000 0005B6F3 0005B6F3 0005B712 0000001F -40006000 0005B6F1 0005B6F1 0005B712 00000021 -40006008 0005B716 0005B6F1 0005B716 00000025 -4000600C 0005B71E 0005B6F1 0005B71E 0000002D -40006018 0005B729 0005B6F1 0005B729 00000038 -4000601C 0005B72F 0005B6F1 0005B72F 0000003E -4000605C 0005B730 0005B6F1 0005B730 0000003F -40006078 0005B733 0005B6F1 0005B733 00000042 -00051078 -00051878 -40006000 00016E0A 00016E0A 00016E0A 00000000 -40006000 00016DDE 00016DDE 00016E0A 0000002C -40006004 000224E5 00016DDE 000224E5 0000B707 -4000601C 000224F0 00016DDE 000224F0 0000B712 -4000603C 000224F2 00016DDE 000224F2 0000B714 -00051878 -00051078 -00016DDE 003E025D 003C947F -12345678 +0019F158 +0019F149 +0019F0FE +0019F142 +0019F1C6 +00045C3F +00045C28 +00045C27 +00045C28 +0000004A +00000031 +00000031 +00000031 +00000041 +00000031 +00000031 +00000031 +C0000000 C0000000 C0000000 C0000000 +00050078 +00050078 +C0006000 002200D2 002200D2 002200D2 00000000 +C0006000 002200A6 002200A6 002200D2 0000002C +C0006000 00220145 002200A6 00220145 0000009F +C0006008 00220173 002200A6 00220173 000000CD +C0006010 00280096 002200A6 00280096 0005FFF0 +C0006010 00280104 002200A6 00280104 0006005E +C000601C 003E015C 002200A6 003E015C 001C00B6 +C000601C 003E01AA 002200A6 003E01AA 001C0104 +C000602C 0022009D 0022009D 003E01AA 001C010D +C000603C 003E01BC 0022009D 003E01BC 001C011F +C000603C 003E0211 0022009D 003E0211 001C0174 +C0006060 0022005E 0022005E 003E0211 001C01B3 +C00060FC 003E024D 0022005E 003E024D 001C01EF +00050078 +00050878 +C0006000 001E0119 001E0119 001E0119 00000000 +C0006000 001E00FB 001E00FB 001E0119 0000001E +C0006000 001E00C0 001E00C0 001E0119 00000059 +C0006004 00200101 001E00C0 00200101 00020041 +C0006008 001E00AD 001E00AD 00200101 00020054 +C000600C 0020015F 001E00AD 0020015F 000200B2 +C0006010 001E00A0 001E00A0 0020015F 000200BF +C0006014 00200177 001E00A0 00200177 000200D7 +C000601C 003C010A 001E00A0 003C010A 001E006A +C000601C 003C01C0 001E00A0 003C01C0 001E0120 +C0006028 001E008D 001E008D 003C01C0 001E0133 +C000603C 003C01EC 001E008D 003C01EC 001E015F +C0006040 001E0065 001E0065 003C01EC 001E0187 +C000605C 003C0252 001E0065 003C0252 001E01ED +C000609C 003C0258 001E0065 003C0258 001E01F3 +C00060B0 001E0064 001E0064 003C0258 001E01F4 +00050878 +00050078 +C0006000 0005B72B 0005B72B 0005B72B 00000000 +C0006000 0005B6F1 0005B6F1 0005B72B 0000003A +C000601C 0005B731 0005B6F1 0005B731 00000040 +C0006058 0005B732 0005B6F1 0005B732 00000041 +C0006078 0005B73B 0005B6F1 0005B73B 0000004A +00051078 +00051878 +C0006000 00016E12 00016E12 00016E12 00000000 +C0006000 00016DDE 00016DDE 00016E12 00000034 +C0006004 000224E4 00016DDE 000224E4 0000B706 +C000601C 000224F0 00016DDE 000224F0 0000B712 +00051878 +00051078 +80000000 80000000 80000000 80000000 +00050078 +00050078 +80006000 002200E1 002200E1 002200E1 00000000 +80006000 002200C5 002200C5 002200E1 0000001C +80006000 002200B8 002200B8 002200E1 00000029 +80006000 002200E7 002200B8 002200E7 0000002F +80006004 002200E9 002200B8 002200E9 00000031 +80006004 002200AE 002200AE 002200E9 0000003B +80006004 0022018A 002200AE 0022018A 000000DC +80006008 00220075 00220075 0022018A 00000115 +8000600C 0022005F 0022005F 0022018A 0000012B +80006010 00280105 0022005F 00280105 000600A6 +8000601C 003E0168 0022005F 003E0168 001C0109 +8000601C 003E01B7 0022005F 003E01B7 001C0158 +8000603C 003E024B 0022005F 003E024B 001C01EC +800060FC 003E025A 0022005F 003E025A 001C01FB +00050078 +00050878 +80006000 001E00B2 001E00B2 001E00B2 00000000 +80006000 001E00CD 001E00B2 001E00CD 0000001B +80006000 001E0158 001E00B2 001E0158 000000A6 +80006004 00200102 001E00B2 00200102 00020050 +80006004 0020010F 001E00B2 0020010F 0002005D +80006004 002001FC 001E00B2 002001FC 0002014A +80006008 001E006F 001E006F 002001FC 0002018D +80006008 001E005C 001E005C 002001FC 000201A0 +8000601C 003C0161 001E005C 003C0161 001E0105 +8000601C 003C0267 001E005C 003C0267 001E020B +8000603C 003C026C 001E005C 003C026C 001E0210 +80006048 001E005B 001E005B 003C026C 001E0211 +00050878 +00050078 +80006000 0005B711 0005B711 0005B711 00000000 +80006000 0005B6F3 0005B6F3 0005B711 0000001E +80006004 0005B721 0005B6F3 0005B721 0000002E +80006018 0005B732 0005B6F3 0005B732 0000003F +80006018 0005B6F1 0005B6F1 0005B732 00000041 +80006058 0005B733 0005B6F1 0005B733 00000042 +00051078 +00051878 +80006000 00016E0A 00016E0A 00016E0A 00000000 +80006000 00016DDF 00016DDF 00016E0A 0000002B +80006000 00016DDE 00016DDE 00016E0A 0000002C +80006004 000224E4 00016DDE 000224E4 0000B706 +8000601C 000224F0 00016DDE 000224F0 0000B712 +00051878 +00051078 +40000000 40000000 40000000 40000000 +00050078 +00050078 +40006000 002200C8 002200C8 002200C8 00000000 +40006000 00220118 002200C8 00220118 00000050 +40006004 002200BB 002200BB 00220118 0000005D +40006004 00220190 002200BB 00220190 000000D5 +40006008 002200A2 002200A2 00220190 000000EE +4000600C 00220073 00220073 00220190 0000011D +40006010 0028009C 00220073 0028009C 00060029 +40006010 002800AF 00220073 002800AF 0006003C +40006010 002800BC 00220073 002800BC 00060049 +40006014 002800DD 00220073 002800DD 0006006A +4000601C 003E014D 00220073 003E014D 001C00DA +4000601C 003E015F 00220073 003E015F 001C00EC +4000601C 003E0175 00220073 003E0175 001C0102 +4000601C 003E0255 00220073 003E0255 001C01E2 +4000603C 003E025D 00220073 003E025D 001C01EA +400060AC 0022005F 0022005F 003E025D 001C01FE +00050078 +00050878 +40006000 001E010C 001E010C 001E010C 00000000 +40006000 001E0109 001E0109 001E010C 00000003 +40006000 001E00DD 001E00DD 001E010C 0000002F +40006004 002000D4 001E00DD 002000D4 0001FFF7 +40006004 00200103 001E00DD 00200103 00020026 +40006004 00200196 001E00DD 00200196 000200B9 +40006008 001E00AD 001E00AD 00200196 000200E9 +40006010 001E007C 001E007C 00200196 0002011A +4000601C 003C025F 001E007C 003C025F 001E01E3 +40006020 001E0073 001E0073 003C025F 001E01EC +40006020 001E006F 001E006F 003C025F 001E01F0 +4000603C 003C0267 001E006F 003C0267 001E01F8 +40006040 001E0069 001E0069 003C0267 001E01FE +400060B0 001E0066 001E0066 003C0267 001E0201 +400060D0 001E0057 001E0057 003C0267 001E0210 +00050878 +00050078 +40006000 0005B712 0005B712 0005B712 00000000 +40006000 0005B6F3 0005B6F3 0005B712 0000001F +40006000 0005B6F1 0005B6F1 0005B712 00000021 +40006008 0005B716 0005B6F1 0005B716 00000025 +4000600C 0005B71E 0005B6F1 0005B71E 0000002D +40006018 0005B729 0005B6F1 0005B729 00000038 +4000601C 0005B72F 0005B6F1 0005B72F 0000003E +4000605C 0005B730 0005B6F1 0005B730 0000003F +40006078 0005B733 0005B6F1 0005B733 00000042 +00051078 +00051878 +40006000 00016E0A 00016E0A 00016E0A 00000000 +40006000 00016DDE 00016DDE 00016E0A 0000002C +40006004 000224E5 00016DDE 000224E5 0000B707 +4000601C 000224F0 00016DDE 000224F0 0000B712 +4000603C 000224F2 00016DDE 000224F2 0000B714 +00051878 +00051078 +00016DDE 003E025D 003C947F +12345678 + +config.txt contains +DISABLE_L2CACHE=1 + +Nothing major to note + +config.txt contains +arm_freq=250 + +00040046 0062022A 005E01E4 + +At best 2 clocks per loop and worst 49 clocks per loop. + +12345678 12345678 12345678 12345678 12345678 +002F4E30 +002F4E98 +002F4E13 +002F4E27 +002F4E08 +000C3558 +000C3526 +000C3525 +000C3526 +000000A2 +00000075 +00000075 +00000075 +0000008E +00000075 +00000075 +00000075 +C0000000 C0000000 C0000000 C0000000 +00050078 +00050078 +C0006000 003A00E4 003A00E4 003A00E4 00000000 +C0006000 003A011E 003A00E4 003A011E 0000003A +C0006000 003A00DE 003A00DE 003A011E 00000040 +C0006004 003A01A2 003A00DE 003A01A2 000000C4 +C0006004 003A00C1 003A00C1 003A01A2 000000E1 +C000600C 003E012A 003A00C1 003E012A 00040069 +C000601C 00620180 003A00C1 00620180 002800BF +C000601C 00620193 003A00C1 00620193 002800D2 +C0006038 003A00BD 003A00BD 00620193 002800D6 +C000603C 006201D5 003A00BD 006201D5 00280118 +C0006040 003A00BB 003A00BB 006201D5 0028011A +C0006078 003A00B7 003A00B7 006201D5 0028011E +C00060DC 00620209 003A00B7 00620209 00280152 +C00060E4 003A00B5 003A00B5 00620209 00280154 +00050078 +00050878 +C0006000 002E010F 002E010F 002E010F 00000000 +C0006000 002E0151 002E010F 002E0151 00000042 +C0006000 002E00E7 002E00E7 002E0151 0000006A +C0006004 0034013E 002E00E7 0034013E 00060057 +C0006004 00340192 002E00E7 00340192 000600AB +C0006008 002E00E2 002E00E2 00340192 000600B0 +C0006010 002E00E1 002E00E1 00340192 000600B1 +C000601C 005C0144 002E00E1 005C0144 002E0063 +C000601C 005C01BC 002E00E1 005C01BC 002E00DB +C000601C 005C01E3 002E00E1 005C01E3 002E0102 +C0006020 002E00D8 002E00D8 005C01E3 002E010B +C0006020 002E00CE 002E00CE 005C01E3 002E0115 +C0006030 002E00C6 002E00C6 005C01E3 002E011D +C000605C 005C0203 002E00C6 005C0203 002E013D +C0006060 002E00C0 002E00C0 005C0203 002E0143 +C0006078 002E00BF 002E00BF 005C0203 002E0144 +00050878 +00050078 +C0006000 00100072 00100072 00100072 00000000 +C0006000 0010002B 0010002B 00100072 00000047 +C0006000 0010002A 0010002A 00100072 00000048 +C0006018 00100079 0010002A 00100079 0000004F +C0006018 00100029 00100029 00100079 00000050 +C0006038 0010007D 00100029 0010007D 00000054 +C00060B8 0010007F 00100029 0010007F 00000056 +00051078 +00051878 +C0006000 0004008C 0004008C 0004008C 00000000 +C0006000 00040047 00040047 0004008C 00000045 +C0006000 00040046 00040046 0004008C 00000046 +C0006004 0006008C 00040046 0006008C 00020046 +C000601C 0006009C 00040046 0006009C 00020056 +C000609C 0006009D 00040046 0006009D 00020057 +00051878 +00051078 +80000000 80000000 80000000 80000000 +00050078 +00050078 +80006000 003A00F2 003A00F2 003A00F2 00000000 +80006000 003A012B 003A00F2 003A012B 00000039 +80006000 003A00D4 003A00D4 003A012B 00000057 +80006004 003A0130 003A00D4 003A0130 0000005C +80006004 003A00CA 003A00CA 003A0130 00000066 +80006004 003A0147 003A00CA 003A0147 0000007D +80006008 003A00BF 003A00BF 003A0147 00000088 +8000600C 003E010D 003A00BF 003E010D 0004004E +8000600C 003E019B 003A00BF 003E019B 000400DC +80006018 003A00BC 003A00BC 003E019B 000400DF +8000601C 0062017F 003A00BC 0062017F 002800C3 +8000601C 0062022A 003A00BC 0062022A 0028016E +80006038 003A00AE 003A00AE 0062022A 0028017C +00050078 +00050878 +80006000 002E00FB 002E00FB 002E00FB 00000000 +80006000 002E0145 002E00FB 002E0145 0000004A +80006000 002E00EA 002E00EA 002E0145 0000005B +80006004 00340113 002E00EA 00340113 00060029 +80006004 00340132 002E00EA 00340132 00060048 +80006008 002E00E4 002E00E4 00340132 0006004E +80006010 002E00BE 002E00BE 00340132 00060074 +80006014 00340145 002E00BE 00340145 00060087 +8000601C 005C018D 002E00BE 005C018D 002E00CF +8000601C 005C01E4 002E00BE 005C01E4 002E0126 +8000603C 005C0217 002E00BE 005C0217 002E0159 +80006040 002E00BD 002E00BD 005C0217 002E015A +80006068 002E00BC 002E00BC 005C0217 002E015B +800060DC 005C022A 002E00BC 005C022A 002E016E +00050878 +00050078 +80006000 00100060 00100060 00100060 00000000 +80006000 0010002B 0010002B 00100060 00000035 +80006004 00100061 0010002B 00100061 00000036 +80006008 0010002A 0010002A 00100061 00000037 +8000600C 00100062 0010002A 00100062 00000038 +80006018 00100076 0010002A 00100076 0000004C +8000601C 00100078 0010002A 00100078 0000004E +80006058 0010007D 0010002A 0010007D 00000053 +80006058 00100029 00100029 0010007D 00000054 +80006098 00100080 00100029 00100080 00000057 +00051078 +00051878 +80006000 0004008D 0004008D 0004008D 00000000 +80006000 00040047 00040047 0004008D 00000046 +80006000 00040046 00040046 0004008D 00000047 +80006004 0006008B 00040046 0006008B 00020045 +8000601C 0006009D 00040046 0006009D 00020057 +00051878 +00051078 +40000000 40000000 40000000 40000000 +00050078 +00050078 +40006000 003A0102 003A0102 003A0102 00000000 +40006000 003A0143 003A0102 003A0143 00000041 +40006000 003A00C5 003A00C5 003A0143 0000007E +40006004 003A0168 003A00C5 003A0168 000000A3 +40006008 003A00C1 003A00C1 003A0168 000000A7 +4000600C 003E010F 003A00C1 003E010F 0004004E +4000600C 003E0137 003A00C1 003E0137 00040076 +4000601C 00620118 003A00C1 00620118 00280057 +4000601C 00620199 003A00C1 00620199 002800D8 +4000601C 0062019E 003A00C1 0062019E 002800DD +40006028 003A00B2 003A00B2 0062019E 002800EC +4000603C 00620216 003A00B2 00620216 00280164 +40006078 003A00B1 003A00B1 00620216 00280165 +40006098 003A00AD 003A00AD 00620216 00280169 +00050078 +00050878 +40006000 002E0108 002E0108 002E0108 00000000 +40006000 002E0149 002E0108 002E0149 00000041 +40006000 002E00D7 002E00D7 002E0149 00000072 +40006004 003400EC 002E00D7 003400EC 00060015 +40006004 00340160 002E00D7 00340160 00060089 +4000600C 00340186 002E00D7 00340186 000600AF +40006010 002E00D1 002E00D1 00340186 000600B5 +40006018 002E00C8 002E00C8 00340186 000600BE +4000601C 005C014D 002E00C8 005C014D 002E0085 +4000601C 005C01AA 002E00C8 005C01AA 002E00E2 +4000601C 005C0209 002E00C8 005C0209 002E0141 +4000603C 005C0219 002E00C8 005C0219 002E0151 +400060A0 002E00C7 002E00C7 005C0219 002E0152 +00050878 +00050078 +40006000 00100061 00100061 00100061 00000000 +40006000 0010002B 0010002B 00100061 00000036 +40006008 00100062 0010002B 00100062 00000037 +40006008 0010002A 0010002A 00100062 00000038 +40006018 0010007A 0010002A 0010007A 00000050 +40006078 00100081 0010002A 00100081 00000057 +00051078 +00051878 +40006000 0004008D 0004008D 0004008D 00000000 +40006000 00040047 00040047 0004008D 00000046 +40006000 00040046 00040046 0004008D 00000047 +40006004 0006008B 00040046 0006008B 00020045 +4000601C 0006009D 00040046 0006009D 00020057 +00051878 +00051078 +00040046 0062022A 005E01E4 +12345678 + +So with the same hardware and the same machine code, well arguably +the time reading surrounding the HOP instruction, could vary. But +that is inthe noise, and probably the overhead where we get the +46 in a time like 00040046. Anyway, even with that, and a test +loop of the exact same two instructions in a loop, the large number +of different results we get is fascinating. + +Think about that and then think about compiler variations for +the same source code: + +extern unsigned int more_fun ( unsigned int, unsigned int ); +unsigned int fun ( unsigned int a, unsigned int b ) +{ + return(more_fun(a+1,b+2)+3); +} + +this + +00000000 : + 0: e92d4800 push {fp, lr} + 4: e28db004 add fp, sp, #4 + 8: e24dd008 sub sp, sp, #8 + c: e50b0008 str r0, [fp, #-8] + 10: e50b100c str r1, [fp, #-12] + 14: e51b3008 ldr r3, [fp, #-8] + 18: e2832001 add r2, r3, #1 + 1c: e51b300c ldr r3, [fp, #-12] + 20: e2833002 add r3, r3, #2 + 24: e1a01003 mov r1, r3 + 28: e1a00002 mov r0, r2 + 2c: ebfffffe bl 0 + 30: e1a03000 mov r3, r0 + 34: e2833003 add r3, r3, #3 + 38: e1a00003 mov r0, r3 + 3c: e24bd004 sub sp, fp, #4 + 40: e8bd4800 pop {fp, lr} + 44: e12fff1e bx lr + +or this + +00000000 : + 0: e92d4010 push {r4, lr} + 4: e2811002 add r1, r1, #2 + 8: e2800001 add r0, r0, #1 + c: ebfffffe bl 0 + 10: e8bd4010 pop {r4, lr} + 14: e2800003 add r0, r0, #3 + 18: e12fff1e bx lr + +or this + + 00000000 : + 0: e92d4010 push {r4, lr} + 4: e2811002 add r1, r1, #2 + 8: e2800001 add r0, r0, #1 + c: ebfffffe bl 0 + 10: e2800003 add r0, r0, #3 + 14: e8bd8010 pop {r4, pc} + +or this + +00000000 : + 0: b510 push {r4, lr} + 2: 3102 adds r1, #2 + 4: 3001 adds r0, #1 + 6: f7ff fffe bl 0 + a: 3003 adds r0, #3 + c: bc10 pop {r4} + e: bc02 pop {r1} + 10: 4708 bx r1 + 12: 46c0 nop ; (mov r8, r8) + +or this + + 00000000 : + 0: b510 push {r4, lr} + 2: 3102 adds r1, #2 + 4: 3001 adds r0, #1 + 6: f7ff fffe bl 0 + a: 3003 adds r0, #3 + c: bd10 pop {r4, pc} + +or this using a different compiler + +00000000 : + 0: e92d4800 push {fp, lr} + 4: e1a0b00d mov fp, sp + 8: e2800001 add r0, r0, #1 + c: e2811002 add r1, r1, #2 + 10: ebfffffe bl 0 + 14: e2800003 add r0, r0, #3 + 18: e8bd4800 pop {fp, lr} + 1c: e1a0f00e mov pc, lr + +or this + +00000000 : + 0: e92d4800 push {fp, lr} + 4: e1a0b00d mov fp, sp + 8: e2800001 add r0, r0, #1 + c: e2811002 add r1, r1, #2 + 10: ebfffffe bl 0 + 14: e2800003 add r0, r0, #3 + 18: e8bd8800 pop {fp, pc} + +So we saw how vastly different execution times could be for the +same two instructions in machine code. Now take essentially one +line of C and look at how many machine code variations came from +that, and try to ponder how many different execution times we could +come up with those variations. Then ponder how it is possible to +actually come up with a benchmark, not only for one machine with +one test written in C or higher, but when comparing machines to each +other. Even the same binary used across them depending on the system +settings or cache sizes or speeds or ram or motherboard nuances. + +Ill just leave it at that.