998 lines
39 KiB
Plaintext
998 lines
39 KiB
Plaintext
|
|
See the top level README file for more information on documentation
|
|
and how to run these programs.
|
|
|
|
Demonstrating the performance differences of a two instruction loop.
|
|
Same machine code, but where you put it with and without cache
|
|
and branch prediction, makes a vast difference in performance.
|
|
|
|
.globl ASMDELAY
|
|
ASMDELAY:
|
|
subs r0,r0,#1
|
|
bne ASMDELAY
|
|
bx lr
|
|
|
|
The two instructions in the loop are the subs and bne, so this is not
|
|
even differences in compilers or options. Same two instructions
|
|
131 thousand times in a loop. Should I explain this or my theories on
|
|
this or not?
|
|
|
|
Here is the punch line
|
|
|
|
min max difference
|
|
00016DDE 003E025D 003C947F
|
|
|
|
Yes! The minimum is 0.71 clocks per loop on average, less than one
|
|
clock per instruction! How is that possible?
|
|
|
|
And the worst case I could get was 43 times slower! How could those
|
|
two instructions on the same chip/board execute at such vastly
|
|
different speeds? Do you really want to know just how bogus benchmarks
|
|
really are? This is only a small taste, apply these simple things
|
|
to any benchmark, add to that compiler differences same source code.
|
|
Many folks dont realize that the same source code can execute several
|
|
times faster or slower by simply changing compiler options, likewise
|
|
two different compilers or versions of the same (or in the case of
|
|
source distributions like gcc or llvm just building the compiler can
|
|
change how it outputs without the different compiler builds having
|
|
different command line options) can/will/do produce different results.
|
|
|
|
Simple alignment tricks like adding or removing a single instruction
|
|
in the right place can/will move the whole binary up or down in memory
|
|
changing where it falls in what I call fetch lines or cache lines (two
|
|
separate but similar terms).
|
|
|
|
I have performed this stunt many times many ways, and there are
|
|
things that can be done to further widen the performance gap, adding
|
|
some magic number of nops between the subs and bne should help with
|
|
branch prediction saving time and on the worse side cost more fetches
|
|
per loop. Not going to do that today, these two instructions are enough.
|
|
|
|
This time around using self modifying code, traditionally I would
|
|
re-assemble with more or fewer nops out front of the loop under test
|
|
to adjust its alignment.
|
|
|
|
Using the disassembly of the loop in start.s
|
|
|
|
0000802c <ASMDELAY>:
|
|
802c: e2500001 subs r0, r0, #1
|
|
8030: 1afffffd bne 802c <ASMDELAY>
|
|
8034: e12fff1e bx lr
|
|
|
|
We can see the raw instructions, the conditional branch is pc relative
|
|
not absolute, basically position independent so can be used as is.
|
|
|
|
PUT32(ra+0x00,0xe2500001);
|
|
PUT32(ra+0x04,0x1afffffd);
|
|
PUT32(ra+0x08,0xe12fff1e);
|
|
|
|
I learned something new on this one, another ARM was doing fine the
|
|
raspberry pi (zero) was hanging with branch prediction enabled. I
|
|
didnt know there was a prefetch flush you needed to do. I went way
|
|
overboard and used flushes and dmbs and dsbs liberally, needed or not.
|
|
Prefetch flush made it so that the pi worked.
|
|
|
|
---
|
|
|
|
Cache. For what we care about here a cache is a relatively small
|
|
amount of memory that is faster than the main memory. Being smaller it
|
|
can only hold some things. Ideally it holds the things you are using
|
|
more than once, or, since programs tend to do things linearly programs
|
|
run instructions sequentially at least for a little while before needing
|
|
to do a branch. When we read data, parsing strings, etc we often (enough)
|
|
will read memory in order for at least a little while. So the cache
|
|
has tables (tags) used to know what is in cache, read transactions marked
|
|
as cacheable are compared against those tags to see if the answer is in
|
|
the cache, if so then the processor does not have to wait as long as
|
|
cache is faster than main memory. If there is a cache miss meaning
|
|
the item is not in cache, the the cache will do a read, but it does
|
|
not necessarily read just the item you want, it reads the amount of
|
|
memory needed to fill a "cache line". A cache line being an aligned
|
|
amount of data, often larger than a normal sized access, the idea
|
|
being is as above, if you are executing code you often have linear
|
|
chunks, if you are reading data to processot you often have linear
|
|
chunks. So if you were to read two things back to back that are in
|
|
the same cache line, the first one if there is a miss is pretty slow
|
|
the whole line has to be read in, the line is not grossly inneficient
|
|
with respect to a read from main memory, probably slower than a smaller
|
|
size, but probably faster than multiple separate reads to gather the
|
|
same amount. So the first item read in a line is slow the second is
|
|
significantly faster, so even if you read two things you might be
|
|
faster than if you had no cache.
|
|
|
|
This example is not doing anything with data, not anything that
|
|
matters as far as the performance test. As shown above there is a two
|
|
instruction loop, these are instructions and instructions, when
|
|
the (instruction) cache is enabled, will be marked as cacheable when
|
|
fetched. So the first interesting thing we see is one of these two
|
|
loops.
|
|
|
|
invalidate_l1cache();
|
|
for(ra=0;ra<4;ra++)
|
|
{
|
|
beg=GET32(ARM_TIMER_CNT);
|
|
ASMDELAY(10);
|
|
end=GET32(ARM_TIMER_CNT);
|
|
hexstring(end-beg);
|
|
}
|
|
|
|
The invalidate basically erases the cache in the sense that it forgets
|
|
all the tags. We run this loop 10 times, so the first time it fetches
|
|
those instructions it comes from main memory. The remaining 9 times
|
|
ideally come from cache. The outer loop runs 4 times, without an
|
|
invalidate so all 10 ASMDELAY loops are ideally cached. Assuming that
|
|
|
|
0000004A
|
|
00000031
|
|
00000031
|
|
00000031
|
|
|
|
00000041
|
|
00000031
|
|
00000031
|
|
00000031
|
|
|
|
|
|
0x31 = 49
|
|
49 / 10 = 4.9
|
|
|
|
we are averaging 4.9 clocks per loop for the cached loops.
|
|
|
|
0x4A = 74
|
|
|
|
4.9 * 9 = 44
|
|
74 - 44 = 30
|
|
|
|
So based on those assumptions that first time through the loop took 30
|
|
clocks.
|
|
|
|
0x41 = 65
|
|
65 - 44 = 21
|
|
|
|
For some reason the second time the first pass is faster. This could
|
|
be as simple as dram accesses are not deterministic, also we are
|
|
sharing the dram with the GPU so maybe there was contention for that
|
|
resource there and one took longer.
|
|
|
|
00045C3F
|
|
00045C28
|
|
00045C27
|
|
00045C28
|
|
|
|
Note so far we are talking about the L1 cache inside the ARM core. We
|
|
see here with 0x20000 loops it perhaps appears that the first pass
|
|
was a little longer than the rest.
|
|
|
|
2.18 on average for the latter loops. If you work the math that
|
|
first pass first instruction fetch was 25.18 ticks. And that is
|
|
on par with the the 10 loop experiments. Just much more dramatic
|
|
with the fewer loops, which matters depending on what you are doing.
|
|
Timing a hundred thousand times through this loop is done to get an
|
|
average, doing it multiple times hopes to erase the first fetch, that
|
|
or say do a million or a billion loops so the first loop time get
|
|
swampled by the average. But if you are wanting to use this code
|
|
as a timed loop of say a few times through the loop, it is important
|
|
to know what the best and worst times are for whatever you are doing
|
|
if you are bit banging i2c or spi or whatever, and you cannot go faster
|
|
than some time period, you need to determine the best possible loop
|
|
time and use that as the tuning value, because for that bit banging
|
|
you can usually go slower, up to several times slower, but cannot
|
|
go faster even once.
|
|
|
|
Our understaning from the Broadcom ARM manual for this part, the only
|
|
public one we have for the original pi processor which is the same one
|
|
in the pi-zero, says that the ARM address space above 0xC0000000 is
|
|
uncached. There is a cache outside the ARM but in front of the dram
|
|
in theory the cache is shared between us and the GPU but who knows?
|
|
But like any other cache, esp one like this that likely does not care
|
|
about ARM instruction fetches from data reads, should be caching
|
|
our instruction reads all the time. And I dont know how to invalidate.
|
|
|
|
So these initial loops
|
|
|
|
0019F158
|
|
0019F149
|
|
0019F0FE
|
|
0019F142
|
|
0019F1C6
|
|
|
|
One would have expected the first to be slower than the others. Perhaps
|
|
code that preceeded this caused the cache to fill, perhaps we can
|
|
create experiments to get a feel about the 0xC0000000 being uncached and
|
|
assuming that the 0x00000000 arm space we are using is cached. Pretty
|
|
easy to write a small program that writes to some offset in our memory
|
|
around 0x00000000 say 0x00001000 for example, then read 0x40001000 and
|
|
0x80001000 and 0xC0001000 and you will see the same value you wrote
|
|
to 0x00001000, demonstrating that at least as far as the ARM address
|
|
space is it does wrap around.
|
|
|
|
Note this is using the ARM TIMER which blinker03 shows is 250MHz based
|
|
the ARM is in theory going 1000MHz. So four processor clocks per timer
|
|
tick.
|
|
|
|
So based on what we saw so far we would assume that once in instruction
|
|
cache then we always get the same performance yes? Well then why does
|
|
this happen (and why did I do this test)
|
|
|
|
C0006000 0005B72B 0005B72B 0005B72B 00000000
|
|
C0006000 0005B6F1 0005B6F1 0005B72B 0000003A
|
|
C000601C 0005B731 0005B6F1 0005B731 00000040
|
|
C0006058 0005B732 0005B6F1 0005B732 00000041
|
|
C0006078 0005B73B 0005B6F1 0005B73B 0000004A
|
|
|
|
What this is telling us for at least the range I tried, with the
|
|
instructions most likely in cache our loop time still varies between
|
|
0005B6F1 and 0005B73B a difference of 0000004A clocks. That is not
|
|
a lot but run this test again and again and you will see these strange
|
|
boundaries where the timing changes. How is this possible it is only
|
|
two instructions the only thing, in theory, that is changing is what
|
|
addresses they live in.
|
|
|
|
Well think about this, this is a pipelined processor, a pipeline is
|
|
basically the same as an assembly line instead of one employee or
|
|
set of employees putting together a product like a car in one place
|
|
all the tools and parts have to weave around each other to get in
|
|
to that location where the car is. Instead if you were to move the
|
|
car from station to station, each station performs one or a few
|
|
relatively simple tests, putting the tires on, mounting the doors,
|
|
etc. The tools for that station and no others are in that station
|
|
and the supplies for that station are fed to that station faster than
|
|
the assembly line is moving, on average. Well you can make the car
|
|
a little faster than keeping it stationary, but you can AVERAGE
|
|
significantly more cars over time. It may take you an hour to build
|
|
one car from beginning to end, but the factor may pump out a new car
|
|
every so many seconds. A processor pipeline is similar to that the
|
|
steps are broken out and performed per clock so that for linear code
|
|
the average is much faster than only operating on one instruction at
|
|
a time.
|
|
|
|
Processors like this do not have the old fashioned bus like the 8088/86
|
|
for example. Where you sent out address and data if it is a write, you
|
|
asserted a write signal or a read signal and some enables, then the memory
|
|
responded the next clock cycle, the whole system was running at a speed
|
|
that did not exceepd what the processor nor the SRAM could do. At some
|
|
point the thought of adding wait states, you could add slower ram or
|
|
peripherals that couldnt keep up all the time but let you try to keep
|
|
running, so some sort of wait scheme was added to allow a peripheral to
|
|
say please wait. What we use now with the AMBA/AXI/AHB busses on ARMs
|
|
is a whole different strategy, it takes a few clock cycles even for
|
|
the simplest thing, the L1 cache is buried in the core and doesnt
|
|
necessarily need as many clocks as the edge of the core. The AXI bus
|
|
will say I would like to do a read, it is an instruction fetch, here
|
|
is the address, here is how much data I want, and here is a transaction
|
|
id. The ARM has the ability to keep multiple transactions in flight
|
|
it might perform a data read generated by code doing a data read then
|
|
the next cycle start an instruction fetch. Eventually the memory or
|
|
peripherals respond, and that feeds back into the AXI bus and the
|
|
tag associated is put on the return bus along with the data. I mentioned
|
|
you specify the size. The bus might be 64 bits wide or 32 bits wide
|
|
and the size is likely in units of 32 bits for a processor like this,
|
|
so in theory you can do a 1 word read, a 2 word read a 3 word read, etc
|
|
on up to probably a number like 8 words per read. If you have a 64 bit
|
|
bus and depending on how it is designed, often it is based on 64 bit
|
|
width alignments, a two word read and a one word read might take the
|
|
same amount of time. But two one word reads should take longer than
|
|
two one word reads aligned or not. Busses like this once they have
|
|
the data ready they deliver it every clock cycle. So an 8 word read
|
|
there is the opening clock cycles to ask for the transaction. Then
|
|
some time passes as the data is located and/or gathered then when
|
|
it starts coming back it takes 4 clock cycles assuming aligned. Had
|
|
it been 6 words, 64 bit aligned, then the difference between 6 and 8
|
|
is ideally one clock cycle. But three or four 2 word transactions
|
|
should take longer you have the up front transaction handshake.
|
|
|
|
So why bother to go through that? Well the pipeline only works if we
|
|
can keep it fed with instructions. So the pipeline is some depth
|
|
which can change from one core design/architecture to another and may
|
|
or may not be documented outside ARM. And one would expect that the
|
|
logic fetches enough instructions to feed the pipe. And one would
|
|
expect that fetches are transactions of multiple instructions, say
|
|
for example 4 words per fetch, and that is probably on 4 word aligned
|
|
boundaries. So say we branch to address 0x1000, and lets pretend
|
|
there is a 6 deep pipeline. One would expect the logic to bang out
|
|
two 4 word instruction fetches one at address 0x1000 and one at address
|
|
0x1010. As those instructions roll in it starts to feed the pipe,
|
|
there would also need to be some storage to hold those 8 instructions
|
|
as they land, a cache or prefetch buffer or whatever you want to
|
|
call it. Once we get through either 4 or maybe 6 of those instructions
|
|
and so far none of them being branches, one would expect the logic
|
|
to then do a 4 word fetch to keep the prefetch buffer or pipe full.
|
|
Once there is a branch then it starts all over, one or two immediate
|
|
fetches to start to fill the pipe up again. One would expect that
|
|
even with two instructions in a loop, the logic would still need to
|
|
perform those two fetches per loop. But since all of this happens
|
|
inside the chip we cant see it, without all the legal stuff to gain
|
|
access to an ARM core and the tools to simulate we are not going
|
|
to know for sure what is going on. Stealing from the term cache line
|
|
I like to call these fetches fetch lines. Just like the situation
|
|
where you have even a two instruction loop that lands on the last and
|
|
first words of a cache line you would need to read two cache lines to
|
|
cover the fetching of those two instructions where earlier in first
|
|
cache line only one cache line read is needed. We should be able to
|
|
see situations where we hit that two cache line boundary and likewise
|
|
we whould be able to see the effects of fetch lines and where the
|
|
branches land, sometimes needing to fetch an extra fetch line.
|
|
|
|
So even with the cache enabled and filled something is happening when
|
|
we branch to address 0xC000601C extra fetch transactions are needed
|
|
likewise there is sensitivity at the other addresses. I wouldnt get
|
|
worked up over a one timer tick difference necessarily that could be
|
|
due to the non-deterministic nature of using something like dram and
|
|
sharing resources with another processor where every so often we may
|
|
have to wait longer for something.
|
|
|
|
C0006000 0005B72B 0005B72B 0005B72B 00000000
|
|
C0006000 0005B6F1 0005B6F1 0005B72B 0000003A
|
|
C000601C 0005B731 0005B6F1 0005B731 00000040
|
|
C0006058 0005B732 0005B6F1 0005B732 00000041
|
|
C0006078 0005B73B 0005B6F1 0005B73B 0000004A
|
|
|
|
We see from the first pass at 0x6000 to the second our time got faster
|
|
that is likely the filling of the cache. After that point we never
|
|
get faster as we saw up above with the early tests. The code only
|
|
triggers a print if min or max changes. So the rest of these output
|
|
lines are due to the max getting bigger.
|
|
|
|
Branch prediction. Think about that processor pipeline, each step
|
|
can do some stuff, but doesnt do everything, otherwise what is the
|
|
point? So even for our simple loop we have a subtraction and then
|
|
a branch that relies on the result of that subtraction. And that branch
|
|
if taken "flushes" the pipe meaning we just toss those instructions
|
|
but it takes time to first fetch the new instructions at the
|
|
branch destination. And then serially feed the pipe (assuming a
|
|
serial pipeline, read on), all that time the processing part of the
|
|
processor, the assembly line, is idle until instructions start moving
|
|
in and moving from one stage in the pipe to another. Branch prediction
|
|
is looking at instructions that are not yet to the execution stage
|
|
to see if they are branches and see if we can determine if they are
|
|
going to happen. If we have say a 5 instruction pipeline A,B,C,D,E
|
|
and A is where instructions enter the pipe, and E is the last step when
|
|
we are finished with it. And lets say D is where we would normally
|
|
figure out this is a branch and then act on that. If we were to look
|
|
at stage A and see that it is a branch, even better an unconditional
|
|
branch, and we also by looking at the instructions at B and C and D
|
|
that they are not unconditional branches, they might be branches but
|
|
maybe not unconditional. We might want to start an instruction fetch
|
|
for the branch destination when the branch is going into B, saving us
|
|
two clock cycles in starting that fetch. Now we could also have a
|
|
design that during A be it an unconditional or conditional branch
|
|
start a fetch, there would be a lot of unused fetch bandwith going on
|
|
but depending on our cache to processor performance main ram to
|
|
cache performance, we might end up going faster overall, perhaps like
|
|
our little test case that unconditional branch eventually happens, so
|
|
fetching every branch we see would put that code ideally in the cache
|
|
much earlier. It is likely that the logic is not going to start
|
|
fetches for every single possible branch that might happen, the logic
|
|
is going to want more complication to not have too many fetches, if
|
|
we save a clock or few here and there, and not cost more clocks than
|
|
we save, then it is a win, so what if we cant accurately predict
|
|
everything. So using our A,B,C,D,E model above. If we see that
|
|
A is a conditional branch that relies on flags, and our logic is
|
|
smart enough to see that B and C do not have instructions that affect
|
|
flags, and D has one that affects flags, then it is possible that
|
|
as D completes and the conditional branch in A moves into the B stage
|
|
we could know at that time if the branch is going to happen and if so
|
|
we can start fetching the branch destination, if we can determine
|
|
the branch destination, depends on the instruction set. It could be
|
|
an unconditional bx r1, but the instruction in B is a load of r1
|
|
so we cant figure out where to branch until we finish that load or
|
|
move or whatever. So what if we were to start adding nops in our loop
|
|
|
|
|
|
ASMDELAY:
|
|
subs r0,r0,#1
|
|
nop
|
|
...
|
|
nop
|
|
bne ASMDELAY
|
|
bx lr
|
|
|
|
Eventually we would have so many that the pipeline is full of nops
|
|
the thing that determines the branch and the thing that does the
|
|
branch are not in the pipe at the same time. But with nops we at
|
|
least can hope/insure that once the pipe is full of these nops and then
|
|
the branch comes in, when that branch reaches the magic point in
|
|
the pipe everything in front of it is a nop, so the branch predictor
|
|
should have everything it needs to fetch early. Now add to this
|
|
the herky jerky fetching do to fetch lines and cache lines.
|
|
|
|
The first batch is with cache but with branch prediciton disabled.
|
|
|
|
C0006000 0005B72B 0005B72B 0005B72B 00000000
|
|
C0006000 0005B6F1 0005B6F1 0005B72B 0000003A
|
|
C000601C 0005B731 0005B6F1 0005B731 00000040
|
|
C0006058 0005B732 0005B6F1 0005B732 00000041
|
|
C0006078 0005B73B 0005B6F1 0005B73B 0000004A
|
|
00051078
|
|
00051878
|
|
|
|
this batch is with branch prediction enabled.
|
|
|
|
C0006000 00016E12 00016E12 00016E12 00000000
|
|
C0006000 00016DDE 00016DDE 00016E12 00000034
|
|
C0006004 000224E4 00016DDE 000224E4 0000B706
|
|
C000601C 000224F0 00016DDE 000224F0 0000B712
|
|
|
|
Much faster, much faster than expected.
|
|
|
|
And yes if you are doing the math, we are well within the realm of
|
|
it taking fewer clocks than we have instructions per loop, so we are
|
|
executing two instructions in theory less time that it takes to
|
|
execute one. This processor is super scaler, meaning it has multiple
|
|
execution units. The pipeline has forks in it. The instructions
|
|
coming in the front door are examined and sorted into separate lines
|
|
as with branch prediciton this is not perfect, but the idea is to try
|
|
to sort out instructions that dont have to happen in a certain order
|
|
for example if we were to throw a useful instruction, but one that
|
|
doesnt affect our loop:
|
|
|
|
ASMDELAY:
|
|
subs r0,r0,#1
|
|
add r3,r3,#3
|
|
bne ASMDELAY
|
|
bx lr
|
|
|
|
Ideally the logic will determine that the subs modifies flags that
|
|
the bne needs so the bne must wait for the subs to complete far enough
|
|
to start to execute the bne. The add is not using the result of the subs
|
|
nor is it affecting the bne, so ideally it gets sorted out into a
|
|
separate execution pipe and it can possibly execute at the same time
|
|
that the subs happens or maybe even before in a more compilicated
|
|
loop. Pipeline implementations are also deep in the processor,
|
|
something that likely changes or improves from one architecture to
|
|
another as years go by and new designs come out. (ARMv4 to ARMv5
|
|
to ARMv6 and so on). It may be that every instruction is dealt
|
|
out like cards to different execution pipes, but there are tags of
|
|
some sort associated with them so that the execution pipes can talk
|
|
to each other to say "you cant do that one until I am finished", but
|
|
pipes that dont have that baggage can push that instruction through
|
|
as fast as they can. So in a super scaler I would expect to be able
|
|
to insert that add in there and not see a performance hit other than
|
|
the cost of the extra fetch clock cycles. But if I were to instead
|
|
insert:
|
|
|
|
ASMDELAY:
|
|
subs r0,r0,#1
|
|
and r0,r0,#0xFF
|
|
bne ASMDELAY
|
|
bx lr
|
|
|
|
The processor cannot figure out that I am never using r3, so it has
|
|
to do that add instruction, but the add has to wait for the subtract
|
|
and the bne has to wait for both the subtract and the and, now obviously
|
|
this loop cannot count down more than 255, so not enough counts for
|
|
our experiments, but demonstrates the relationships that a super scaler
|
|
processor looks for. Like branch prediction, not expected in any
|
|
way to be perfect, but if you can sometimes save one or a few clocks
|
|
here and there those clocks will add up.
|
|
|
|
I did not do this here, but you could also do some performance tests by
|
|
adding that bunch of nops
|
|
|
|
ASMDELAY:
|
|
subs r0,r0,#1
|
|
nop
|
|
...
|
|
nop
|
|
bne ASMDELAY
|
|
bx lr
|
|
|
|
And pushing the difference between fetch performance and execution
|
|
performance, can also see if there are any herky jerky motions related
|
|
to fetching and how the prefetch feeds the pipe, etc.
|
|
|
|
Without actually seeing (in simulation) how the processor works per
|
|
clock we can only guess as to what is going on by performing experiments
|
|
like this.
|
|
|
|
So I think we can see in this example the L1 caching, the first fetch
|
|
through the loop having to go to main memory, which is dram so pretty
|
|
slow, and then the rest of the loops fetching from the L1 cache which
|
|
is the fastest/closest memory we have to the processor core. Even
|
|
with the code in cache we can see differences based on the alignment
|
|
of the loop, and we can see differences with the branch prediction on.
|
|
|
|
|
|
00016DDE 003E025D 003C947F
|
|
|
|
The fastest 0x20000 count loop was 00016DDE or 0.71 timer ticks per
|
|
loop on average. And the worst 31 clocks per loop on average.
|
|
|
|
The first dump below is based on having no config.txt, again this
|
|
is a raspberry pi zero.
|
|
|
|
|
|
config.txt contains
|
|
DISABLE_L2CACHE=1
|
|
|
|
some subtle changes, but not as much to note.
|
|
|
|
|
|
Now changing the arm frequency to 250Mhz is quite useful as the
|
|
timer we are using and the arm clock are in theory the same not
|
|
necessarily in phase or anything but both are 250Mhz, so we dont have
|
|
four processor clocks per timer tick.
|
|
|
|
So the dump after this one is with the reduced arm clock, see comments
|
|
there
|
|
|
|
12345678 12345678 12345678 12345678 12345678
|
|
0019F158
|
|
0019F149
|
|
0019F0FE
|
|
0019F142
|
|
0019F1C6
|
|
00045C3F
|
|
00045C28
|
|
00045C27
|
|
00045C28
|
|
0000004A
|
|
00000031
|
|
00000031
|
|
00000031
|
|
00000041
|
|
00000031
|
|
00000031
|
|
00000031
|
|
C0000000 C0000000 C0000000 C0000000
|
|
00050078
|
|
00050078
|
|
C0006000 002200D2 002200D2 002200D2 00000000
|
|
C0006000 002200A6 002200A6 002200D2 0000002C
|
|
C0006000 00220145 002200A6 00220145 0000009F
|
|
C0006008 00220173 002200A6 00220173 000000CD
|
|
C0006010 00280096 002200A6 00280096 0005FFF0
|
|
C0006010 00280104 002200A6 00280104 0006005E
|
|
C000601C 003E015C 002200A6 003E015C 001C00B6
|
|
C000601C 003E01AA 002200A6 003E01AA 001C0104
|
|
C000602C 0022009D 0022009D 003E01AA 001C010D
|
|
C000603C 003E01BC 0022009D 003E01BC 001C011F
|
|
C000603C 003E0211 0022009D 003E0211 001C0174
|
|
C0006060 0022005E 0022005E 003E0211 001C01B3
|
|
C00060FC 003E024D 0022005E 003E024D 001C01EF
|
|
00050078
|
|
00050878
|
|
C0006000 001E0119 001E0119 001E0119 00000000
|
|
C0006000 001E00FB 001E00FB 001E0119 0000001E
|
|
C0006000 001E00C0 001E00C0 001E0119 00000059
|
|
C0006004 00200101 001E00C0 00200101 00020041
|
|
C0006008 001E00AD 001E00AD 00200101 00020054
|
|
C000600C 0020015F 001E00AD 0020015F 000200B2
|
|
C0006010 001E00A0 001E00A0 0020015F 000200BF
|
|
C0006014 00200177 001E00A0 00200177 000200D7
|
|
C000601C 003C010A 001E00A0 003C010A 001E006A
|
|
C000601C 003C01C0 001E00A0 003C01C0 001E0120
|
|
C0006028 001E008D 001E008D 003C01C0 001E0133
|
|
C000603C 003C01EC 001E008D 003C01EC 001E015F
|
|
C0006040 001E0065 001E0065 003C01EC 001E0187
|
|
C000605C 003C0252 001E0065 003C0252 001E01ED
|
|
C000609C 003C0258 001E0065 003C0258 001E01F3
|
|
C00060B0 001E0064 001E0064 003C0258 001E01F4
|
|
00050878
|
|
00050078
|
|
C0006000 0005B72B 0005B72B 0005B72B 00000000
|
|
C0006000 0005B6F1 0005B6F1 0005B72B 0000003A
|
|
C000601C 0005B731 0005B6F1 0005B731 00000040
|
|
C0006058 0005B732 0005B6F1 0005B732 00000041
|
|
C0006078 0005B73B 0005B6F1 0005B73B 0000004A
|
|
00051078
|
|
00051878
|
|
C0006000 00016E12 00016E12 00016E12 00000000
|
|
C0006000 00016DDE 00016DDE 00016E12 00000034
|
|
C0006004 000224E4 00016DDE 000224E4 0000B706
|
|
C000601C 000224F0 00016DDE 000224F0 0000B712
|
|
00051878
|
|
00051078
|
|
80000000 80000000 80000000 80000000
|
|
00050078
|
|
00050078
|
|
80006000 002200E1 002200E1 002200E1 00000000
|
|
80006000 002200C5 002200C5 002200E1 0000001C
|
|
80006000 002200B8 002200B8 002200E1 00000029
|
|
80006000 002200E7 002200B8 002200E7 0000002F
|
|
80006004 002200E9 002200B8 002200E9 00000031
|
|
80006004 002200AE 002200AE 002200E9 0000003B
|
|
80006004 0022018A 002200AE 0022018A 000000DC
|
|
80006008 00220075 00220075 0022018A 00000115
|
|
8000600C 0022005F 0022005F 0022018A 0000012B
|
|
80006010 00280105 0022005F 00280105 000600A6
|
|
8000601C 003E0168 0022005F 003E0168 001C0109
|
|
8000601C 003E01B7 0022005F 003E01B7 001C0158
|
|
8000603C 003E024B 0022005F 003E024B 001C01EC
|
|
800060FC 003E025A 0022005F 003E025A 001C01FB
|
|
00050078
|
|
00050878
|
|
80006000 001E00B2 001E00B2 001E00B2 00000000
|
|
80006000 001E00CD 001E00B2 001E00CD 0000001B
|
|
80006000 001E0158 001E00B2 001E0158 000000A6
|
|
80006004 00200102 001E00B2 00200102 00020050
|
|
80006004 0020010F 001E00B2 0020010F 0002005D
|
|
80006004 002001FC 001E00B2 002001FC 0002014A
|
|
80006008 001E006F 001E006F 002001FC 0002018D
|
|
80006008 001E005C 001E005C 002001FC 000201A0
|
|
8000601C 003C0161 001E005C 003C0161 001E0105
|
|
8000601C 003C0267 001E005C 003C0267 001E020B
|
|
8000603C 003C026C 001E005C 003C026C 001E0210
|
|
80006048 001E005B 001E005B 003C026C 001E0211
|
|
00050878
|
|
00050078
|
|
80006000 0005B711 0005B711 0005B711 00000000
|
|
80006000 0005B6F3 0005B6F3 0005B711 0000001E
|
|
80006004 0005B721 0005B6F3 0005B721 0000002E
|
|
80006018 0005B732 0005B6F3 0005B732 0000003F
|
|
80006018 0005B6F1 0005B6F1 0005B732 00000041
|
|
80006058 0005B733 0005B6F1 0005B733 00000042
|
|
00051078
|
|
00051878
|
|
80006000 00016E0A 00016E0A 00016E0A 00000000
|
|
80006000 00016DDF 00016DDF 00016E0A 0000002B
|
|
80006000 00016DDE 00016DDE 00016E0A 0000002C
|
|
80006004 000224E4 00016DDE 000224E4 0000B706
|
|
8000601C 000224F0 00016DDE 000224F0 0000B712
|
|
00051878
|
|
00051078
|
|
40000000 40000000 40000000 40000000
|
|
00050078
|
|
00050078
|
|
40006000 002200C8 002200C8 002200C8 00000000
|
|
40006000 00220118 002200C8 00220118 00000050
|
|
40006004 002200BB 002200BB 00220118 0000005D
|
|
40006004 00220190 002200BB 00220190 000000D5
|
|
40006008 002200A2 002200A2 00220190 000000EE
|
|
4000600C 00220073 00220073 00220190 0000011D
|
|
40006010 0028009C 00220073 0028009C 00060029
|
|
40006010 002800AF 00220073 002800AF 0006003C
|
|
40006010 002800BC 00220073 002800BC 00060049
|
|
40006014 002800DD 00220073 002800DD 0006006A
|
|
4000601C 003E014D 00220073 003E014D 001C00DA
|
|
4000601C 003E015F 00220073 003E015F 001C00EC
|
|
4000601C 003E0175 00220073 003E0175 001C0102
|
|
4000601C 003E0255 00220073 003E0255 001C01E2
|
|
4000603C 003E025D 00220073 003E025D 001C01EA
|
|
400060AC 0022005F 0022005F 003E025D 001C01FE
|
|
00050078
|
|
00050878
|
|
40006000 001E010C 001E010C 001E010C 00000000
|
|
40006000 001E0109 001E0109 001E010C 00000003
|
|
40006000 001E00DD 001E00DD 001E010C 0000002F
|
|
40006004 002000D4 001E00DD 002000D4 0001FFF7
|
|
40006004 00200103 001E00DD 00200103 00020026
|
|
40006004 00200196 001E00DD 00200196 000200B9
|
|
40006008 001E00AD 001E00AD 00200196 000200E9
|
|
40006010 001E007C 001E007C 00200196 0002011A
|
|
4000601C 003C025F 001E007C 003C025F 001E01E3
|
|
40006020 001E0073 001E0073 003C025F 001E01EC
|
|
40006020 001E006F 001E006F 003C025F 001E01F0
|
|
4000603C 003C0267 001E006F 003C0267 001E01F8
|
|
40006040 001E0069 001E0069 003C0267 001E01FE
|
|
400060B0 001E0066 001E0066 003C0267 001E0201
|
|
400060D0 001E0057 001E0057 003C0267 001E0210
|
|
00050878
|
|
00050078
|
|
40006000 0005B712 0005B712 0005B712 00000000
|
|
40006000 0005B6F3 0005B6F3 0005B712 0000001F
|
|
40006000 0005B6F1 0005B6F1 0005B712 00000021
|
|
40006008 0005B716 0005B6F1 0005B716 00000025
|
|
4000600C 0005B71E 0005B6F1 0005B71E 0000002D
|
|
40006018 0005B729 0005B6F1 0005B729 00000038
|
|
4000601C 0005B72F 0005B6F1 0005B72F 0000003E
|
|
4000605C 0005B730 0005B6F1 0005B730 0000003F
|
|
40006078 0005B733 0005B6F1 0005B733 00000042
|
|
00051078
|
|
00051878
|
|
40006000 00016E0A 00016E0A 00016E0A 00000000
|
|
40006000 00016DDE 00016DDE 00016E0A 0000002C
|
|
40006004 000224E5 00016DDE 000224E5 0000B707
|
|
4000601C 000224F0 00016DDE 000224F0 0000B712
|
|
4000603C 000224F2 00016DDE 000224F2 0000B714
|
|
00051878
|
|
00051078
|
|
00016DDE 003E025D 003C947F
|
|
12345678
|
|
|
|
config.txt contains
|
|
DISABLE_L2CACHE=1
|
|
|
|
Nothing major to note
|
|
|
|
config.txt contains
|
|
arm_freq=250
|
|
|
|
00040046 0062022A 005E01E4
|
|
|
|
At best 2 clocks per loop and worst 49 clocks per loop.
|
|
|
|
12345678 12345678 12345678 12345678 12345678
|
|
002F4E30
|
|
002F4E98
|
|
002F4E13
|
|
002F4E27
|
|
002F4E08
|
|
000C3558
|
|
000C3526
|
|
000C3525
|
|
000C3526
|
|
000000A2
|
|
00000075
|
|
00000075
|
|
00000075
|
|
0000008E
|
|
00000075
|
|
00000075
|
|
00000075
|
|
C0000000 C0000000 C0000000 C0000000
|
|
00050078
|
|
00050078
|
|
C0006000 003A00E4 003A00E4 003A00E4 00000000
|
|
C0006000 003A011E 003A00E4 003A011E 0000003A
|
|
C0006000 003A00DE 003A00DE 003A011E 00000040
|
|
C0006004 003A01A2 003A00DE 003A01A2 000000C4
|
|
C0006004 003A00C1 003A00C1 003A01A2 000000E1
|
|
C000600C 003E012A 003A00C1 003E012A 00040069
|
|
C000601C 00620180 003A00C1 00620180 002800BF
|
|
C000601C 00620193 003A00C1 00620193 002800D2
|
|
C0006038 003A00BD 003A00BD 00620193 002800D6
|
|
C000603C 006201D5 003A00BD 006201D5 00280118
|
|
C0006040 003A00BB 003A00BB 006201D5 0028011A
|
|
C0006078 003A00B7 003A00B7 006201D5 0028011E
|
|
C00060DC 00620209 003A00B7 00620209 00280152
|
|
C00060E4 003A00B5 003A00B5 00620209 00280154
|
|
00050078
|
|
00050878
|
|
C0006000 002E010F 002E010F 002E010F 00000000
|
|
C0006000 002E0151 002E010F 002E0151 00000042
|
|
C0006000 002E00E7 002E00E7 002E0151 0000006A
|
|
C0006004 0034013E 002E00E7 0034013E 00060057
|
|
C0006004 00340192 002E00E7 00340192 000600AB
|
|
C0006008 002E00E2 002E00E2 00340192 000600B0
|
|
C0006010 002E00E1 002E00E1 00340192 000600B1
|
|
C000601C 005C0144 002E00E1 005C0144 002E0063
|
|
C000601C 005C01BC 002E00E1 005C01BC 002E00DB
|
|
C000601C 005C01E3 002E00E1 005C01E3 002E0102
|
|
C0006020 002E00D8 002E00D8 005C01E3 002E010B
|
|
C0006020 002E00CE 002E00CE 005C01E3 002E0115
|
|
C0006030 002E00C6 002E00C6 005C01E3 002E011D
|
|
C000605C 005C0203 002E00C6 005C0203 002E013D
|
|
C0006060 002E00C0 002E00C0 005C0203 002E0143
|
|
C0006078 002E00BF 002E00BF 005C0203 002E0144
|
|
00050878
|
|
00050078
|
|
C0006000 00100072 00100072 00100072 00000000
|
|
C0006000 0010002B 0010002B 00100072 00000047
|
|
C0006000 0010002A 0010002A 00100072 00000048
|
|
C0006018 00100079 0010002A 00100079 0000004F
|
|
C0006018 00100029 00100029 00100079 00000050
|
|
C0006038 0010007D 00100029 0010007D 00000054
|
|
C00060B8 0010007F 00100029 0010007F 00000056
|
|
00051078
|
|
00051878
|
|
C0006000 0004008C 0004008C 0004008C 00000000
|
|
C0006000 00040047 00040047 0004008C 00000045
|
|
C0006000 00040046 00040046 0004008C 00000046
|
|
C0006004 0006008C 00040046 0006008C 00020046
|
|
C000601C 0006009C 00040046 0006009C 00020056
|
|
C000609C 0006009D 00040046 0006009D 00020057
|
|
00051878
|
|
00051078
|
|
80000000 80000000 80000000 80000000
|
|
00050078
|
|
00050078
|
|
80006000 003A00F2 003A00F2 003A00F2 00000000
|
|
80006000 003A012B 003A00F2 003A012B 00000039
|
|
80006000 003A00D4 003A00D4 003A012B 00000057
|
|
80006004 003A0130 003A00D4 003A0130 0000005C
|
|
80006004 003A00CA 003A00CA 003A0130 00000066
|
|
80006004 003A0147 003A00CA 003A0147 0000007D
|
|
80006008 003A00BF 003A00BF 003A0147 00000088
|
|
8000600C 003E010D 003A00BF 003E010D 0004004E
|
|
8000600C 003E019B 003A00BF 003E019B 000400DC
|
|
80006018 003A00BC 003A00BC 003E019B 000400DF
|
|
8000601C 0062017F 003A00BC 0062017F 002800C3
|
|
8000601C 0062022A 003A00BC 0062022A 0028016E
|
|
80006038 003A00AE 003A00AE 0062022A 0028017C
|
|
00050078
|
|
00050878
|
|
80006000 002E00FB 002E00FB 002E00FB 00000000
|
|
80006000 002E0145 002E00FB 002E0145 0000004A
|
|
80006000 002E00EA 002E00EA 002E0145 0000005B
|
|
80006004 00340113 002E00EA 00340113 00060029
|
|
80006004 00340132 002E00EA 00340132 00060048
|
|
80006008 002E00E4 002E00E4 00340132 0006004E
|
|
80006010 002E00BE 002E00BE 00340132 00060074
|
|
80006014 00340145 002E00BE 00340145 00060087
|
|
8000601C 005C018D 002E00BE 005C018D 002E00CF
|
|
8000601C 005C01E4 002E00BE 005C01E4 002E0126
|
|
8000603C 005C0217 002E00BE 005C0217 002E0159
|
|
80006040 002E00BD 002E00BD 005C0217 002E015A
|
|
80006068 002E00BC 002E00BC 005C0217 002E015B
|
|
800060DC 005C022A 002E00BC 005C022A 002E016E
|
|
00050878
|
|
00050078
|
|
80006000 00100060 00100060 00100060 00000000
|
|
80006000 0010002B 0010002B 00100060 00000035
|
|
80006004 00100061 0010002B 00100061 00000036
|
|
80006008 0010002A 0010002A 00100061 00000037
|
|
8000600C 00100062 0010002A 00100062 00000038
|
|
80006018 00100076 0010002A 00100076 0000004C
|
|
8000601C 00100078 0010002A 00100078 0000004E
|
|
80006058 0010007D 0010002A 0010007D 00000053
|
|
80006058 00100029 00100029 0010007D 00000054
|
|
80006098 00100080 00100029 00100080 00000057
|
|
00051078
|
|
00051878
|
|
80006000 0004008D 0004008D 0004008D 00000000
|
|
80006000 00040047 00040047 0004008D 00000046
|
|
80006000 00040046 00040046 0004008D 00000047
|
|
80006004 0006008B 00040046 0006008B 00020045
|
|
8000601C 0006009D 00040046 0006009D 00020057
|
|
00051878
|
|
00051078
|
|
40000000 40000000 40000000 40000000
|
|
00050078
|
|
00050078
|
|
40006000 003A0102 003A0102 003A0102 00000000
|
|
40006000 003A0143 003A0102 003A0143 00000041
|
|
40006000 003A00C5 003A00C5 003A0143 0000007E
|
|
40006004 003A0168 003A00C5 003A0168 000000A3
|
|
40006008 003A00C1 003A00C1 003A0168 000000A7
|
|
4000600C 003E010F 003A00C1 003E010F 0004004E
|
|
4000600C 003E0137 003A00C1 003E0137 00040076
|
|
4000601C 00620118 003A00C1 00620118 00280057
|
|
4000601C 00620199 003A00C1 00620199 002800D8
|
|
4000601C 0062019E 003A00C1 0062019E 002800DD
|
|
40006028 003A00B2 003A00B2 0062019E 002800EC
|
|
4000603C 00620216 003A00B2 00620216 00280164
|
|
40006078 003A00B1 003A00B1 00620216 00280165
|
|
40006098 003A00AD 003A00AD 00620216 00280169
|
|
00050078
|
|
00050878
|
|
40006000 002E0108 002E0108 002E0108 00000000
|
|
40006000 002E0149 002E0108 002E0149 00000041
|
|
40006000 002E00D7 002E00D7 002E0149 00000072
|
|
40006004 003400EC 002E00D7 003400EC 00060015
|
|
40006004 00340160 002E00D7 00340160 00060089
|
|
4000600C 00340186 002E00D7 00340186 000600AF
|
|
40006010 002E00D1 002E00D1 00340186 000600B5
|
|
40006018 002E00C8 002E00C8 00340186 000600BE
|
|
4000601C 005C014D 002E00C8 005C014D 002E0085
|
|
4000601C 005C01AA 002E00C8 005C01AA 002E00E2
|
|
4000601C 005C0209 002E00C8 005C0209 002E0141
|
|
4000603C 005C0219 002E00C8 005C0219 002E0151
|
|
400060A0 002E00C7 002E00C7 005C0219 002E0152
|
|
00050878
|
|
00050078
|
|
40006000 00100061 00100061 00100061 00000000
|
|
40006000 0010002B 0010002B 00100061 00000036
|
|
40006008 00100062 0010002B 00100062 00000037
|
|
40006008 0010002A 0010002A 00100062 00000038
|
|
40006018 0010007A 0010002A 0010007A 00000050
|
|
40006078 00100081 0010002A 00100081 00000057
|
|
00051078
|
|
00051878
|
|
40006000 0004008D 0004008D 0004008D 00000000
|
|
40006000 00040047 00040047 0004008D 00000046
|
|
40006000 00040046 00040046 0004008D 00000047
|
|
40006004 0006008B 00040046 0006008B 00020045
|
|
4000601C 0006009D 00040046 0006009D 00020057
|
|
00051878
|
|
00051078
|
|
00040046 0062022A 005E01E4
|
|
12345678
|
|
|
|
So with the same hardware and the same machine code, well arguably
|
|
the time reading surrounding the HOP instruction, could vary. But
|
|
that is inthe noise, and probably the overhead where we get the
|
|
46 in a time like 00040046. Anyway, even with that, and a test
|
|
loop of the exact same two instructions in a loop, the large number
|
|
of different results we get is fascinating.
|
|
|
|
Think about that and then think about compiler variations for
|
|
the same source code:
|
|
|
|
extern unsigned int more_fun ( unsigned int, unsigned int );
|
|
unsigned int fun ( unsigned int a, unsigned int b )
|
|
{
|
|
return(more_fun(a+1,b+2)+3);
|
|
}
|
|
|
|
this
|
|
|
|
00000000 <fun>:
|
|
0: e92d4800 push {fp, lr}
|
|
4: e28db004 add fp, sp, #4
|
|
8: e24dd008 sub sp, sp, #8
|
|
c: e50b0008 str r0, [fp, #-8]
|
|
10: e50b100c str r1, [fp, #-12]
|
|
14: e51b3008 ldr r3, [fp, #-8]
|
|
18: e2832001 add r2, r3, #1
|
|
1c: e51b300c ldr r3, [fp, #-12]
|
|
20: e2833002 add r3, r3, #2
|
|
24: e1a01003 mov r1, r3
|
|
28: e1a00002 mov r0, r2
|
|
2c: ebfffffe bl 0 <more_fun>
|
|
30: e1a03000 mov r3, r0
|
|
34: e2833003 add r3, r3, #3
|
|
38: e1a00003 mov r0, r3
|
|
3c: e24bd004 sub sp, fp, #4
|
|
40: e8bd4800 pop {fp, lr}
|
|
44: e12fff1e bx lr
|
|
|
|
or this
|
|
|
|
00000000 <fun>:
|
|
0: e92d4010 push {r4, lr}
|
|
4: e2811002 add r1, r1, #2
|
|
8: e2800001 add r0, r0, #1
|
|
c: ebfffffe bl 0 <more_fun>
|
|
10: e8bd4010 pop {r4, lr}
|
|
14: e2800003 add r0, r0, #3
|
|
18: e12fff1e bx lr
|
|
|
|
or this
|
|
|
|
00000000 <fun>:
|
|
0: e92d4010 push {r4, lr}
|
|
4: e2811002 add r1, r1, #2
|
|
8: e2800001 add r0, r0, #1
|
|
c: ebfffffe bl 0 <more_fun>
|
|
10: e2800003 add r0, r0, #3
|
|
14: e8bd8010 pop {r4, pc}
|
|
|
|
or this
|
|
|
|
00000000 <fun>:
|
|
0: b510 push {r4, lr}
|
|
2: 3102 adds r1, #2
|
|
4: 3001 adds r0, #1
|
|
6: f7ff fffe bl 0 <more_fun>
|
|
a: 3003 adds r0, #3
|
|
c: bc10 pop {r4}
|
|
e: bc02 pop {r1}
|
|
10: 4708 bx r1
|
|
12: 46c0 nop ; (mov r8, r8)
|
|
|
|
or this
|
|
|
|
00000000 <fun>:
|
|
0: b510 push {r4, lr}
|
|
2: 3102 adds r1, #2
|
|
4: 3001 adds r0, #1
|
|
6: f7ff fffe bl 0 <more_fun>
|
|
a: 3003 adds r0, #3
|
|
c: bd10 pop {r4, pc}
|
|
|
|
or this using a different compiler
|
|
|
|
00000000 <fun>:
|
|
0: e92d4800 push {fp, lr}
|
|
4: e1a0b00d mov fp, sp
|
|
8: e2800001 add r0, r0, #1
|
|
c: e2811002 add r1, r1, #2
|
|
10: ebfffffe bl 0 <more_fun>
|
|
14: e2800003 add r0, r0, #3
|
|
18: e8bd4800 pop {fp, lr}
|
|
1c: e1a0f00e mov pc, lr
|
|
|
|
or this
|
|
|
|
00000000 <fun>:
|
|
0: e92d4800 push {fp, lr}
|
|
4: e1a0b00d mov fp, sp
|
|
8: e2800001 add r0, r0, #1
|
|
c: e2811002 add r1, r1, #2
|
|
10: ebfffffe bl 0 <more_fun>
|
|
14: e2800003 add r0, r0, #3
|
|
18: e8bd8800 pop {fp, pc}
|
|
|
|
So we saw how vastly different execution times could be for the
|
|
same two instructions in machine code. Now take essentially one
|
|
line of C and look at how many machine code variations came from
|
|
that, and try to ponder how many different execution times we could
|
|
come up with those variations. Then ponder how it is possible to
|
|
actually come up with a benchmark, not only for one machine with
|
|
one test written in C or higher, but when comparing machines to each
|
|
other. Even the same binary used across them depending on the system
|
|
settings or cache sizes or speeds or ram or motherboard nuances.
|
|
|
|
Ill just leave it at that.
|