raspberrypi

See the top level README file for more information on documentation
and how to run these programs.

This example demonstrates MMU basics.

So what an MMU does or at least what an MMU does for us is it
translates virtual addresses into physical addresses as well as
checking access permissions, and gives us control over cachable
regions.

So what does all of that mean?

There is a boundary inside the chip around the ARM core, part of that
boundary is the memory interface for the ARM for lack of a better term
how the ARM accesses the world.  Nothing special all processors have
some sort of address and data based interface and your peripherals
or edge of the chip or whatever is address and data based.  That
boundary uses physical addresses, that boundary is on the "chip side"
or "world side" of the ARM's mmu.  Within the ARM core there is the
"processor side" of the mmu, and all accesses to the world go through
the mmu.  That is everything that is address based, all flavors of
load and store.

When the ARM powers up the mmu is disabled, which means all accesses
pass through unmodified making the "processor side" or virtual address
space equal to the world side physical address space.  All of the
examples thus far, blinkers and such are based on physical addresses.
We already know that elswhere in the chip is another address translation
of some sort, because the manual is written for 0x7Exxxxxx based
adresses, but the ARM's physical addresses for those same things is
0x20xxxxxx for the raspi 1 and 0x3Fxxxxxx for the raspi 2.  For this
discussion we only care about the ARM mmu processor side and the far
side (world side, physical address side).

So when I say the mmu translates virtual addresses into physical
addresses.  What that means is on the processor side you may have
one address you are accessing, but that does not have to be equal to
the physical address.  Lets say for example I am running a program on
an operating system, Linux lets say, and I need to compile that program
before I can use it and I need to link it for an address space so lets
say that I link it to enter at address 0x8000 and use memory from
0x00000000 to whatever I need and/or whatever is available.  So that
is all fine, except what if I have two programs and I want both running
"at the same time" how can both use the same address space without
clobbering each other?  The answer is neither is at that address space
the virtual address WHEN RUNNING one of them is in the virtual address
space 0x00000000 to some number, but in reality program 1 might have
that mapped to the physical address 0x01000000, program 2 might have its
0x00000000 to some number mapped to 0x02000000.  So when program 1
thinks it is writing to address 0xABCDE it is really writing to
0x010ABCDE and when program 2 thinks it is writing to address 0xABCDE
it is really writing to 0x020ABCDE.

It is techincally possible that some mmu out there might be able to
translate any address into any address, but certainly not the ARM mmus
you cannot have virtual 0x12345678 = physical 0xAAAABCDE.  From a
hardware perspective and hopefully a programmers perspective it makes
most sense to draw a line in the address and the upper side gets
translated and the lower stays the same.  For example there is one
mmu block size in the arm that is on one megabyte boundaries so with
a 32 bit address space one megabyte is 20 bits, so the lower 20 bits
dont change between virtual and physical but the upper 12 can/do.  So
address 0x12345678 virtual could be mapped to 0xCDE345678 using a
one megabyte mmu table entry.  The ARM mmu also allows for 4Kbyte
pages for example, which means the lower 12 bits of the virtual and
physical are the same but the upper 20 bits can be changed when going
from virtual to physical.

What does access permission mean?  Lets think about program 1 and
program 2 above, we dont want program 1 to be able to invade program
2s memory space, that would make hacking a computer super easy if any
program could access the ram used by any other program (the operating
system can sure, but we have to trust the operating system but not
trust any rogue program).  So when a program running at the application
level is accessing something there has to be a mechanism to check the
permissions of each access to make sure that that application is
allowed, if not allowed the mmu has to abort the access and somehow
call the operating system to handle this.  Different processor families
handle this differently.  Initially we dont care as we are still
running as the super user, which is also bound by the mmu, we just need
to make sure we set the permissions so that we can access everything
we care to access.

What does cachable regions mean?  We know from polling the uart to
see if there is a spot in the tx buffer for the next character that
reads to the uart need to actually go to the uart register to read
that status.  But this is a memory mapped design, hardware registers
like the uart status are accessed in the same way as some ram that
contains a variable used in a program, using load and store
instructions with some address.  We can use the instruction cache
without the mmu one because arm allows us to, second because the
arms internal bus has a signal (or set of) that differentiate fetch
read cycles from data read cycles.  The mmu when disabled passes
that through and it hits the cache which has different controls between
instruction or i cache and data or d cache.  So without the mmu we
can enable instruction caching, and only instruction fetches get
cached, I hope you know what that means, the cache is fast ram closer
to the processor when you do a read from slow dram on the far side,
a copy is kept in the cache (if the cache for that access type and
address space are enabled) so that if you read that address a second
time before that prior read is evicted the second and subsequent reads
are closer from faster ram and return an answer much faster. Because
fast ram is expensive you have a relatively small amount so only the
last small number of answers is stored there, make too many reads at
different addresses and some answers have to be evicted to make room
for new answers.  If the mmu is disabled then all accesses are marked
as "cacheable" or able to be cached.  If the cache for that type (i or
d) is enabled.  So you see the uart problem.  If we were to enable
the d cache with the mmu off then all data accesses would be cached,
so if in a tight loop polling the uart to wait for a spot in the tx
buffer the first time through the loop we read the uart status and
it goes actually to the uart to get that status, if the tx buffer is
not got a spot, then we continue to loop, the second read though
gets the copy of the first read from the cache, which says no room
yet, the third read gets the copy of the first read from the cache
which says there is no room yet.  This continues forever even after
the uart has space for a character as we have stopped actually talking
to the uart, we are reading a stale copy of the status register.  This
is true for any hardware peripheral register or ram.  We cannot cache
some or all of the peripheral address space.  We want data accesses
to be cached for all or most of ram but not for peripherals.  In order
to do that usually you use the mmu and for each of the chunks of
address space controlled by an mmu entry there are bits in that entry
that control whether or not that address space is cacheable.  So with
the mmu we could make the general purpose memory cacheable but the
hardare peripherals not.  This example will show that.

Now something not mentioned above is the notion of virtual memory, do
not confuse that with virtual address space.  We now know that you can
allow the application some virtual address space to operate in and if
it goes outside that space the operating system is alerted and takes
over.  What if we wanted to do that on purpose?  Two very simple
examples of this are, what if we wanted to pretend we have more memory
than we really have.  Doesnt make too much sense on the raspberry pi
but makes a lot of sense on your desktop/laptop.  You might have
4GB of ram, but one or more TB of disk space.  Wouldnt it be cool if
a program that is using some ram but is not running just this moment
could have its ram saved to disk to free up that ram for another program
that is running, and then later when that other program needs its ram
then we swap the ram back from disk to memory so it can use it as
memory?  that is exactly how swap or virtual memory works.  we let the
program run off the end of its space and crash into a protection fault
but instead of issuing an error and stopping the program the operating
system instead knows how much ram this program thinks it has, if it is
within that range, then it looks for more ram for this program if there
is some free it simply maps it in using the mmu, if not then it
hopefully swaps some ram from some other application to disk, freeing
some ram for this application.  The second simplest use case would be
a virtual machine, when I have say vmware running a virtual computer
on a computer.  What if I want to have the virtual machine access the
network?  I could make a range of address space that the virtual
machine thinks is the network peripheral and let the virtual machine
free run in some space, when it tries to access the network peripheral
the operating system is alerted to the protection fault, but instead
of stopping the program and issuing an error, it fakes the peripheral
access and lets the program keep running.

All very cool stuff but it requires first and foremost that all memory
accesses are funneled through a memory management unit or mmu of some
flavor.

As with all baremetal programming, wading through documentation is
the bulk of the job.  Definitely true here, with the unfortunate
problem that ARM's docs dont all look the same from one Archtectural
Reference Manual to an other.  We have this other problem that we
are techically using an ARMv6 (architecture version 6) but when
you go to http://infocenter.arm.com and look at the Reference Manuals
there is an ARMv5 and then ARMv7 and ARMv8, but no ARMv6.  Well
the ARMv5 manual is actually the original ARM ARM, that I assume they
realized couldnt maintain all the architecture variations forever in
one document, so they perhaps wisely went to one ARM ARM per rev.  With
respect to the MMU, that started in ARMv5 and with ARMv6 there were
some changes made but it still has a backwards compatible mode such
that programs that use the MMU (linux for example) dont necessarily
need an overhaul every version (or need a lot of if-then-else code
to cover all the supported architectures in one binary).  So you can
look at the various architectural reference manuals or sometimes
technical reference manuals for specific cores and see descriptions
of the MMU tables and addressing but the part I mentioned as
unfortunate is that the drawings and descriptions dont have the same
look and feel.  They have the same basic content though.

I am mostly using the ARMv5 Architectural Reference Manual.  Possibly
an older one than the one on ARMs page.  ARM DDI0100I.  Where the I is
the rev of that ARM ARM.  The ARMv5 ARM does show ARMv6 stuff in
particular with respect to them MMU, so it is probably the right
manual for this processor, although you could use the ARMv7 and be
careful to ignore features added in v7.

So there are blocks they call sections and blocks they call pages.
If we were to simply take every possible address and make a look up
table and the contents of the table are the physical address, we could
then translate any virtual address to any physical address, but it
would take up to 4Giga-entries for that table for a 32 bit address
space and each entry of the table would need to be more than 4 bytes,
32 bits for the new address then some others for permissions and
enables, so that would make no sense to have an mmu table larger than
everything we would ever access.





re-write in progress.


.  (and we would have to access
everything as bytes since a scheme like that would allow the four
bytes in an instruction or other word sized access to be in up to
four different physical places)  That is not exactly what happens
but it is along the same path.  Instead of taking the entire address
and having a look up table, we take the top bits of the address and
that goes into the first level translation table.  Basically bits
31:20 (bits 31 down to 20 or perhaps think of it as address>>20) are
added (orred) to the base address for this table we have to prepare.
The contents of the table are not necessarily the replacement bits, but
the way we are using it they are.

The ARM documentation talks about sections and pages, perhaps this is
not the intended distiction, but with sections the first level
translation table contains both the replacement bits (will describe
what that means in a second) and the permission and other control bits.
For a page, the first level translation table contains an offset to
a second level translation table, a second table.  The combination of
bits in that first table and second table serve to describe the
access permissions, and replacement bits.

So with what I am telling you so far with the addition of saying that
we will mostly be talking about 1MByte sections, that means that
I can have a virtual address of 0x1230ABCD, virtual being the address
that I write my software to use, and have that get converted by the
MMU to the address 0x4560ABCD.  Basically the address bits 31:20 I can
change in the MMU using a 1MByte section.  Further those upper address
bits which are 0x123 in this example are used to look up an entry
in the first level descriptor table, and that entry contains the bits
0x456 as well as some other bits for permissions and cache control.
Assuming the permissions and such are okay the MMU then simply replaces
the 0x123 with 0x456 causing our 0x1230ABCD address to actually
access 0x4560ABCD.  The lower 20 bits, for a 1MByte section have
to be the same in the virtual and physical address.  So only some
of the upper bits are replaced.

Now maybe you can see why there are blocks or chunks of memory that
are virtualized, the lower address bits are not modified between
the virtual and physical, basically a whole block of memory space
aligned on some power of 2.  And the other thing to understand now
is that because the translation table ultimately contains the
replacement bits for the bits used to look up into the table,  Depending
on how many permission and other control bits we want the number
of replacement bits left over in a 32 bit word are limited.  But if
we were to have a second table, then between the first and second
tables we have 64 bits so when we have a bunch of bits to replace
meaning we have a smaller block of memory being virtualized somewhere
else, we will need the secondary table.  

So you may be thinking that we have a chicken and egg problem, but we
dont.  We want to access something at some address, that act causes
the MMU to access the translation tables which are at some address
in memory, now if the MMU had to go through the MMU, you would have
that chicken and egg problem.  You dont the MMU does not use virtual
addresses it is all physical addresses, it doesnt send itself through
itself.  But this does mean that we have to carve out some amount
of memory for the MMU translation tables.  The pictures imply this
can vary but as far as we are concerned all of the MMU tables, first
level has to fit within 16Kbytes.

So we can be looking at the same picture I took a couple of pages
out of the ARM manual and put them in this repo as a postscript, if
on linux then no big deal your pdf reader will/should also read
postscript (postscript is like assembly and pdf is simply the machine
code for that assembly, assuming unencrypted, with free tools you can
generally go back and forth between pdf and ps).  Atril, evince, etc
can display this, gsview and others like it will work on both windows
and Linux.  section_translation.ps is the name of the file.

The picture on the second page is where we want to start, and a
picture is worth a thousand words, and although this is verbose already
hopefully I wont have to spend too many more words on this picture.

The first thing the picture is telling us is that there is a
base address somewhere that we tell the MMU about that is the base
address for our translation table memory, where are primary and
secondary translation tables live.  This is important SBZ means should
be zero, the lower 14 bits assuming X is zero, must be zero so we
must choose an address that has the lower 14 bits zero.  I have chosen
0x00004000 which just barely makes that requirement.  I assume
that my program is loaded into the ARM address 0x8000, I will need
to have some exception handlers at 0x0000, but 0x4000 to 0x8000 is
not being used (I have my stack elsewhere).

So we have a base address for our translation table.  So lets do the
conversion mentioned above of virtual 0x1230ABCD to physical 0x4560ABCD.
What they are calling a modified virtual address is our...virtual
address the address we write in our program on the processor side
of the MMU.  So that is the 0x1230ABCD address.  We break that address
up into its two parts, the Table Index which is 0x123 and the section
index which is the 0x0ABCD part.  The next thing down is the address
of the first level descriptor.  So they take the 12 bits of index
shift those left two so it makes a word address and add that to the
translation tables base address.  In this case 0x123<<2 = 0x48C and
our base address of 0x00004000 gives us 0x0000448C.  Now the descriptors
are all physical addresses the MMU doesnt use the MMU to access the
MMU tables.  So we read the 32 bit entry at the address we computed
and we get the first level descriptor.  The first thing we look at
in the first level descriptor are the lower 2 bits.  If those bits are
a 0b10 then this is a section, the other bit patterns are documented
not far below these pages in the manual.  The first of the two pages
I have here shows the 0b10 in those lower bits and also says that
to be a 1MB descriptor we need bit 18 to be a zero, and so we will.
The MMU now knowing this is a 1MB first level descriptor then it checks
the other bits not shown on either of these pages but we will cover,
for access permissions, if we have not violated any permissions then
it takes the upper 12 bits of the descriptor and tacks those on top
of the lower 20 bits of our virtual address to make the physical address
and then the MMU sends that down the pipe and we do our memory/peripheral
access.

These pictures in whatever form show the virtual to physical translation
but we as MMU programers need to go from physical to virtual, if after
we turn the MMU on we still want to be able to access the UART for
example will will have to have an entry so that we can control and
allow the access using the access control permissions.  Hopefully you
have figured out that we can replace those 12 bits with whatever 12
bits we want, including the same 12 bits.  Why would we use the MMU
to replace some address bits with the same address bits!  Remember the
MMU is not only there to remap memory space, but it is also there to
allow for control over access permissions and to allow control over
caching.  Separate controls for each page or section.  So working
backward we want to have our uart which is in the section 0x20200000
be available to us after the MMU is enabled.  It really makes it so
much easier if we have the virtual match the physical for peripherals
and actually this example starts off with virtual matching physical
for all the sections we care about.  So we need 0x202.... to result
in 0x202.  So our translation table entry is 0x202 based or
table_base + (0x202<<2).  And the data at that address needs to be
0x202xxxxx with the lower two bits a 0b10.  And the rest of the
bits such that it just works.

So now we have to chat a bit about that.  The "other" bits are the
domain, the TEX bits and the C and B bits.  The C bit is the simplest
one to start with that means Cacheable.  For peripherals we absolutely
dont want them to be cached.  Lets say for example we are polling a
register in the uart to see if the tx buffer is empty so we can
send another character, so we read that register a bunch of times
until some control bit indicates tx buf is empty.  Well if the cache
were on the first time we read that register its value gets cached
then the next time we get the cached value not the real value, if all
we are doing is polling and we dont evict that cached value then all
we will ever see is the stale, cached, regsiter value, if that
value did not show that tx buff was empty, then we will never see
the indication when it changes.  So never make a peripherals space
cacheable.  This is a good place to point out the purpose fo an MMU
again cache control.  Right now we can see that the MMU even with
virtual = physical, allows us to turn on the data cache, but gives
us control that we can mark perhipheral address spaces as not
cacheable.

The b bit, means bufferable, as in write buffer.  Something you may
not have heard about or thought about ever.  It is kind of like a cache
on the write end of things instead of read end.  It is a thing somewhere
between the processor and the memory that tells the processor, let me
take that write information and deliver it for you, you can keep
doing other stuff.  Now writes in general are "fire and forget".  When
you perform a write both the address and data are known, in general
the memory controller can and depending on the design, will, take the
address and data and tell the processor, I will go and do that for you
you keep processing.  Well that works fine as an optimization for the
first write, but eventually the write has to end up in the slow
main memory.  So if you do two or a bunch of writes in a row the
processor gets the optimization on the first one but the second one
has to wait for the first and the processor ends up waiting.  Well
further down if you were to have a small buffer that could hold more
than one write in flight at a time, and allow the processor to get
this optimization for more than just one write cycle but maybe many
or several then for situations where the processor is doing random
writes, you probably can gain some speed.  A good place to use this
is when you have the cache on, as a cache line is not just one
word or whatever wide, it can be several words of data, so when you
have a cache miss, need to read a cache line, but you dont have an
open spot and need to evict someone from the cache that multi-word
eviction can go into the write buffer, allowing the cache to do
the cache line read.  But if the write buffer is not there or not
enabled then everyone has to wait for that cache line eviction
to make room for the cache line fill to then finally send the
read data back to the processor. Now do we want to enable the write
buffer for peripherals?  Well probably not, even though the arm
manual may show a combination with B on that means device access.  Lets
take the generic write buffer case and not necessarily an ARM one.
The write buffer absorbs some number of write accesses for the processor
so the processor can continue excuting and not have to wait for a
slow memory transaction to complete.  So the processor is operating
ahead of the writes the program thinks have completed.  So maybe we
poll the uart status register, it says the tx buf is empty, we write
a byte, which lands in the buffer behind some other writes, we then
have another byte to send, we read the status register, if the reads
and writes are not serialized meaning if the reads take a separate
path from the writes, then it is possible that the write of our first
byte is stuck in the write buffer waiting on other writes, so the write
has not hit the uart, the txbuf still shows empty, the next read
of the status register shows empty so we send another byte, but
eventually the two writes hit but there is only room for one.  So we
probably dont want to use write buffering in general with peripeherals
unless we are sure we know how the hardware works and we dont have these
race conditions.

Now the TEX bits you just have to look up and there is the rub there
are likely more than one set of tables for TEX C and B, I am going
to stick with a TEX of 0b000 and not mess with any fancy features
there.  Now depending on whether this is considered an older arm
(ARMv5) or an ARMv6 or newer the combination of TEX, C and B have
some subtle differences.  The cache bit in particular does enable
or disable this space as cacheable.  You still independently need
to turn on the instruciton and data caches and need an if cacheable
and the cache is on for the access type within that section, then it
will cache it...So we set tex to zeros to just keep it out of the way.

Lastly the domain bits.  Now you will see a 4 bit domain thing and
a 2 bit domain thing.  These are related.  There is a register in
the MMU right next to the translation table base address register this
one is a 32 bit register that contains 16 different domain definitions.

The two bit domain controls are defined as such.

0b00 No access Any access generates a domain fault
0b01 Client Accesses are checked against the access permission bits in the TLB entry
0b10 Reserved Using this value has UNPREDICTABLE results
0b11 Manager Accesses are not checked against the access permission bits in the TLB
entry, so a permission fault cannot be generated

For starters we are going to set all of the domains to 0b11 dont check
cant fault.  What are these 16 domains though?  Notice it takes 4 bits
to describe one of 16 things.  The different domains have no specific
meaning other than that we can have 16 different definitions that we
control for whatever reason.  You might allow for 16 different
threads running at once in your operating system, or 16 different
types of software running (kernel, application, ...) you can mark
a bunch of sections as belonging to one parituclar domain, and with a
simple change to that domain control register, a whole domain might
go from one type of permission to another, from no checking to
no access for example.

Since I usually use the MMU in bare metal to enable data caching on ram
I set my domain controls to 0b11, no checking and I simply make all
the MMU sections domain number 0.

So we end up with this simple function that allows us to add first level
descriptors in the MMU translation table.

unsigned int MMU_section ( unsigned int vadd, unsigned int padd, unsigned int flags )
{
    unsigned int ra;
    unsigned int rb;
    unsigned int rc;

    ra=vadd>>20;
    rb=MMUTABLEBASE|(ra<<2);
    ra=padd>>20;
    rc=(ra<<20)|flags|2;
    PUT32(rb,rc);
    return(0);
}

So what you have to do to turn on the MMU is to first figure out all
the memory you are going to access, and make sure you have entries
for that.  Now if you do the math, 12 bits off the top are the
first level index, that is 4096 things, times 4 bytes per that is 16KBytes
thus the reason for an alignment on 16K.  Now one solution you might
simply do is fill the whole 16K with 1MByte sections that allow full
uncached access...Basically completely map the virtual to physical
one to one.  I didnt do that, I was a little more concervative on the
clock cycles, not that that really matters here...For this example I
wanted to have the memory we are really using around 0x00000000 and
then some entries I can play with to show you the MMU is working and
then the entries for the peripherals I am using.

    MMU_section(0x00000000,0x00000000,0x0000|8|4);
    MMU_section(0x00100000,0x00100000,0x0000);
    MMU_section(0x00200000,0x00200000,0x0000);
    MMU_section(0x00300000,0x00300000,0x0000);
    //peripherals
    MMU_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
    MMU_section(0x20200000,0x20200000,0x0000); //NOT CACHED!

I didnt need to cache that first section, but did, will leave it up
to you to do a read performance test of some sort to determine if the
cache when enabled does make it faster.

So once our tables are setup then we need to actually turn the
MMU on.  Now I cant figure out where I got this from, and I have
modified it in this repo.  According to this manual it was with the
ARMv6 that we got the DSB feature which says wait for either cache
or MMU to finish something before continuing.  In particular when
initializing a cache to start it up you want to clean out all the
entries in a safe way you dont want to evict them and hose memory
you want to invalidate everything, mark it such that the cache lines
are empty/available.  not mentioned yet but the MMU has a mini cache
that it uses for things it has looked up, think about every access we
do through the MMU, imagine if it had to do walk the descriptor tables
every single read or write could require two more reads from the
table.  So there is this TLB which caches up the last N number of
descriptor table lookups.  Well like cache memory on power up, the
tlb might be full of random bits as well, so we need to invalidate
that too.  Then this dsb thing comes in, we do the dsb instruction
to tell the processor to wait for the cache subsystem and MMU subsystem
to finish wiping their internal tables before we go forward and
turn them on and try to use them.

After we invalidate the cache and tlb, and you may be asking why are
we messing with the cache?  Well the MMU gets us access to the data
cache since we need the MMU to distinguish ram from peripherals before
generically turning on the data cache.  Second in the ARM the MMU
enable bit and the cache enable bits are in the same register so it
makes sense to just do cache enabling and MMU enabling in one function
call.

So after the DSB we set our domain control bits, now in this example
I have done something different, 15 of the 16 domains have the 0b11
setting which is dont fault on anything, manager mode.  I set domain
1 such that it has no access, so in the example I will change one
of the descriptor table entries to use domain one, then I will access
it and then see the access violation.  there are two registers that
hold the translation table base address, I program them both, not
sure what the difference is, why there are two...

Understand I have been runnign on ARMv6 systems without the DSB for
some time and it just works, so maybe that is dumb luck...

Now I can start the MMU.  This code relies on the caller to set
the MMU enable and I and D cache enables.  This is because this
is derived from code where sometimes I turn things on or dont turn
things on and wanted it generic.


.globl start_MMU
start_MMU:
    mov r2,#0
    mcr p15,0,r2,c7,c7,0 ;@ invalidate caches
    mcr p15,0,r2,c8,c7,0 ;@ invalidate tlb
    mcr p15,0,r2,c7,c10,4 ;@ DSB ??

    mvn r2,#0
    bic r2,#0xC
    mcr p15,0,r2,c3,c0,0 ;@ domain

    mcr p15,0,r0,c2,c0,0 ;@ tlb base
    mcr p15,0,r0,c2,c0,1 ;@ tlb base

    mrc p15,0,r2,c1,c0,0
    orr r2,r2,r1
    mcr p15,0,r2,c1,c0,0

    bx lr

I am going to mess with the translation tables after the MMU is started
so I assume we have to invalidate when a table entry changes so that
just in case the old one is cached up in the tlb, we can force the
read of the new one by invalidating all the tlbs.


.globl invalidate_tlbs
invalidate_tlbs:
    mov r2,#0
    mcr p15,0,r2,c8,c7,0  ;@ invalidate tlb
    mcr p15,0,r2,c7,c10,4 ;@ DSB ??
    bx lr

So the program starts by putting a few things in memory spaced
apart such that they will be in different sections when the
MMU is turned on.  We write then read those back.


DEADBEEF
00045678
00145678
00245678
00345678

Now the MMU is turned on with these sections mapped with virtual =
physical.

00045678
00145678
00245678
00345678

Nothing magical yet.  But now we start to swizzle things around, two
of the spaces are swapped 0x001...addresses point at 0x003 and vice
versa.  0x002 points at 0x000...And the output confirms that, we didnt
write anything to memory, just played games with what physical address
comes from what virtual.

00045678
00345678
00045678
00145678

And then the icing on the cake, one section is marked as domain 1
instead of domain 0, domain 1 was set for 0b00 no access so when we
touch that domain we should get an access violation.
                                                                         
00045678                                                                        
00000010                                                                        

How do I know what that means with that output.  Well from my blinker07
example we touched on exceptions (interrupts).  I made a generic test
fixture such that anything other than a reset prints something out
and then hangs.   In no way shape or form is this a complete handler
but what it does show is that it is the exception that is at address
0x00000010 that gets hit which is data abort.  So figuring out it was
a data abort (pretty much expected) have that then read the data fault
status registers, being a data access we expect the data/combined one
to show somthing and the instruction one to not.  Adding that
instrumentation resulted in.
                                                                            
00045678                                                                        
00000010                                                                        
00000019                                                                        
00000000                                                                        
00008110                                                                        
E5900000                                                                        
00145678           

Now I switched to the ARM1176JZF-S Technical Reference Manual for more
detail and that shows the 0x01 was domain 1, the domain we used for
that access. then the 0x9 means Domain Section Fault.

The lr during the abort shows us the instruction, which you would need
to disassemble to figure out the address, or at least that is one
way to do it perhaps there is a status register for that.

The instruction and the address match our expectations for this fault.