raspberrypi/mmu/README


See the top level README file for more information on documentation
and how to run these programs.

This example demonstrates MMU basics.

(This ONLY works on the Raspi 1 for now will get a Raspi 2 version
working at some point).

-- NEED TO RE-WRITE THIS AGAIN, SUBPAGES ENABLED, COARSE 1KB TABLES  --

So what an MMU does or at least what an MMU does for us is it
translates virtual addresses into physical addresses as well as
checking access permissions, and gives us control over cachable
regions.

So what does all of that mean?

There is a boundary inside the chip around the ARM core, part of that
boundary is the memory interface for the ARM for lack of a better term
how the ARM accesses the world.  Nothing special all processors have
some sort of address and data based interface and your peripherals
or edge of the chip or whatever is address and data based.  That
boundary uses physical addresses, that boundary is on the "chip side"
or "world side" of the ARM's mmu.  Within the ARM core there is the
"processor side" of the mmu, and all accesses to the world go through
the mmu.  That is everything that is address based, all flavors of
load and store.

When the ARM powers up the mmu is disabled, which means all accesses
pass through unmodified making the "processor side" or virtual address
space equal to the world side physical address space.  All of the
examples thus far, blinkers and such are based on physical addresses.
We already know that elswhere in the chip is another address translation
of some sort, because the manual is written for 0x7Exxxxxx based
adresses, but the ARM's physical addresses for those same things is
0x20xxxxxx for the raspi 1 and 0x3Fxxxxxx for the raspi 2.  For this
discussion we only care about the ARM mmu processor side and the far
side (world side, physical address side).

So when I say the mmu translates virtual addresses into physical
addresses.  What that means is on the processor side you may have
one address you are accessing, but that does not have to be equal to
the physical address.  Lets say for example I am running a program on
an operating system, Linux lets say, and I need to compile that program
before I can use it and I need to link it for an address space so lets
say that I link it to enter at address 0x8000 and use memory from
0x00000000 to whatever I need and/or whatever is available.  So that
is all fine, except what if I have two programs and I want both running
"at the same time" how can both use the same address space without
clobbering each other?  The answer is neither is at that address space
the virtual address WHEN RUNNING one of them is in the virtual address
space 0x00000000 to some number, but in reality program 1 might have
that mapped to the physical address 0x01000000, program 2 might have its
0x00000000 to some number mapped to 0x02000000.  So when program 1
thinks it is writing to address 0xABCDE it is really writing to
0x010ABCDE and when program 2 thinks it is writing to address 0xABCDE
it is really writing to 0x020ABCDE.

It is techincally possible that some mmu out there might be able to
translate any address into any address, but certainly not the ARM mmus
you cannot have virtual 0x12345678 = physical 0xAAAABCDE.  From a
hardware perspective and hopefully a programmers perspective it makes
most sense to draw a line in the address and the upper side gets
translated and the lower stays the same.  For example there is one
mmu block size in the arm that is on one megabyte boundaries so with
a 32 bit address space one megabyte is 20 bits, so the lower 20 bits
dont change between virtual and physical but the upper 12 can/do.  So
address 0x12345678 virtual could be mapped to 0xCDE345678 using a
one megabyte mmu table entry.  The ARM mmu also allows for 4Kbyte
pages for example, which means the lower 12 bits of the virtual and
physical are the same but the upper 20 bits can be changed when going
from virtual to physical.

What does access permission mean?  Lets think about program 1 and
program 2 above, we dont want program 1 to be able to invade program
2s memory space, that would make hacking a computer super easy if any
program could access the ram used by any other program (the operating
system can sure, but we have to trust the operating system but not
trust any rogue program).  So when a program running at the application
level is accessing something there has to be a mechanism to check the
permissions of each access to make sure that that application is
allowed, if not allowed the mmu has to abort the access and somehow
call the operating system to handle this.  Different processor families
handle this differently.  Initially we dont care as we are still
running as the super user, which is also bound by the mmu, we just need
to make sure we set the permissions so that we can access everything
we care to access.

What does cachable regions mean?  We know from polling the uart to
see if there is a spot in the tx buffer for the next character that
reads to the uart need to actually go to the uart register to read
that status.  But this is a memory mapped design, hardware registers
like the uart status are accessed in the same way as some ram that
contains a variable used in a program, using load and store
instructions with some address.  We can use the instruction cache
without the mmu one because arm allows us to, second because the
arms internal bus has a signal (or set of) that differentiate fetch
read cycles from data read cycles.  The mmu when disabled passes
that through and it hits the cache which has different controls between
instruction or i cache and data or d cache.  So without the mmu we
can enable instruction caching, and only instruction fetches get
cached, I hope you know what that means, the cache is fast ram closer
to the processor when you do a read from slow dram on the far side,
a copy is kept in the cache (if the cache for that access type and
address space are enabled) so that if you read that address a second
time before that prior read is evicted the second and subsequent reads
are closer from faster ram and return an answer much faster. Because
fast ram is expensive you have a relatively small amount so only the
last small number of answers is stored there, make too many reads at
different addresses and some answers have to be evicted to make room
for new answers.  If the mmu is disabled then all accesses are marked
as "cacheable" or able to be cached.  If the cache for that type (i or
d) is enabled.  So you see the uart problem.  If we were to enable
the d cache with the mmu off then all data accesses would be cached,
so if in a tight loop polling the uart to wait for a spot in the tx
buffer the first time through the loop we read the uart status and
it goes actually to the uart to get that status, if the tx buffer is
not got a spot, then we continue to loop, the second read though
gets the copy of the first read from the cache, which says no room
yet, the third read gets the copy of the first read from the cache
which says there is no room yet.  This continues forever even after
the uart has space for a character as we have stopped actually talking
to the uart, we are reading a stale copy of the status register.  This
is true for any hardware peripheral register or ram.  We cannot cache
some or all of the peripheral address space.  We want data accesses
to be cached for all or most of ram but not for peripherals.  In order
to do that usually you use the mmu and for each of the chunks of
address space controlled by an mmu entry there are bits in that entry
that control whether or not that address space is cacheable.  So with
the mmu we could make the general purpose memory cacheable but the
hardare peripherals not.  This example will show that.

Now something not mentioned above is the notion of virtual memory, do
not confuse that with virtual address space.  We now know that you can
allow the application some virtual address space to operate in and if
it goes outside that space the operating system is alerted and takes
over.  What if we wanted to do that on purpose?  Two very simple
examples of this are, what if we wanted to pretend we have more memory
than we really have.  Doesnt make too much sense on the raspberry pi
but makes a lot of sense on your desktop/laptop.  You might have
4GB of ram, but one or more TB of disk space.  Wouldnt it be cool if
a program that is using some ram but is not running just this moment
could have its ram saved to disk to free up that ram for another program
that is running, and then later when that other program needs its ram
then we swap the ram back from disk to memory so it can use it as
memory?  that is exactly how swap or virtual memory works.  we let the
program run off the end of its space and crash into a protection fault
but instead of issuing an error and stopping the program the operating
system instead knows how much ram this program thinks it has, if it is
within that range, then it looks for more ram for this program if there
is some free it simply maps it in using the mmu, if not then it
hopefully swaps some ram from some other application to disk, freeing
some ram for this application.  The second simplest use case would be
a virtual machine, when I have say vmware running a virtual computer
on a computer.  What if I want to have the virtual machine access the
network?  I could make a range of address space that the virtual
machine thinks is the network peripheral and let the virtual machine
free run in some space, when it tries to access the network peripheral
the operating system is alerted to the protection fault, but instead
of stopping the program and issuing an error, it fakes the peripheral
access and lets the program keep running.

All very cool stuff but it requires first and foremost that all memory
accesses are funneled through a memory management unit or mmu of some
flavor.

As with all baremetal programming, wading through documentation is
the bulk of the job.  Definitely true here, with the unfortunate
problem that ARM's docs dont all look the same from one Archtectural
Reference Manual to an other.  We have this other problem that we
are techically using an ARMv6 (architecture version 6) but when
you go to http://infocenter.arm.com and look at the Reference Manuals
there is an ARMv5 and then ARMv7 and ARMv8, but no ARMv6.  Well
the ARMv5 manual is actually the original ARM ARM, that I assume they
realized couldnt maintain all the architecture variations forever in
one document, so they perhaps wisely went to one ARM ARM per rev.  With
respect to the MMU, that started in ARMv5 and with ARMv6 there were
some changes made but it still has a backwards compatible mode such
that programs that use the MMU (linux for example) dont necessarily
need an overhaul every version (or need a lot of if-then-else code
to cover all the supported architectures in one binary).  So you can
look at the various architectural reference manuals or sometimes
technical reference manuals for specific cores and see descriptions
of the MMU tables and addressing but the part I mentioned as
unfortunate is that the drawings and descriptions dont have the same
look and feel.  They have the same basic content though.

I am mostly using the ARMv5 Architectural Reference Manual.
ARM DDI0100I.  Where the I is the rev of that ARM ARM document.  The
ARMv5 ARM does show ARMv6 stuff in particular with respect to them MMU,
so it is probably the right manual for this processor.

So there are blocks they call sections and blocks they call pages.
If we were to simply take every possible address and make a look up
table and the contents of the table are the physical address, we could
then translate any virtual address to any physical address, but it
would take up to 4Giga-entries for that table for a 32 bit address
space and each entry of the table would need to be more than 4 bytes,
32 bits for the new address then some others for permissions and
enables, so that would make no sense to have an mmu table larger than
everything we would ever access, actually we couldnt even access that
whole table as it takes more address space than we would have much
less the physical 32 bit address space we are trying to map to.

If we think about what arm did and we will get to the manual in a
second.  Lets start with a 1MByte page.  That means we take the 4GByte
possible addresses and divide them by 1MByte, we get 4096.  That
is a manageable number.  1MByte is 20 bits, 32-20 is 12 (thus 4096).
So we would need to be able to replace the 12 bits of virtual address
with 12 bits of physical address plus have other bits in the table to
indicate permissions and cache control and ideally some to indicate
this is a 1MB page or not.  And ARM has fit all of that into a 32
bit entry.  So if we wanted to map the whole 32 bit virtual address
space for the ARM we could do that with a 4096 entry (4096*32 bits is
16KBytes) MMU table.

So the ARMv5 ARM ARM (ARM Architectural Reference Manual) is what
we need now.  See the top level README for finding this document,
I have included a few pages in the form of postscript, any decent pdf
viewer should be able to handle these files.  Before the pictures
though, the section in quesiton is titled Virtual Memory System
Architecture.  In the CP15 subsection register 2 is the the translation
table base register.

First we read this comment

If N = 0 always use TTBR0. When N = 0 (the reset case), the translation
table base is backwards compatible with earlier versions of the
architecture.

we will leave that as N = 0 and not touch it and use TTBR0

Now what the TTBR0 description initially is telling me that bit 31
down to 14-n or 14 in our case since n = 0 is the base address, in
PHYSICAL address space (the mmu cant possibly go through the mmu to
figure out how to go through the mmu)  we basically need to align to
16384 bytes.  (2 to the power 14, the lower 14 bits if our TLB base
address needs to be all zeros).

We write that register using

    mcr p15,0,r0,c2,c0,0 ;@ tlb base

TLB = Translation Lookaside Buffer.  As far as we are concerned think
of it as an array of 32 bit integers, each integer being used to
completely or partially convert from virtual to physical and describe
permissions and caching.  Thinking of it as an array we can talk about
the 3rd thing in the table, but being 32 bits wide that is really
times 4 (and plus one depending on if we are talking zero based or
one based).  This will hopefully make sense in a second.

My example is going to have a define called MMUTABLEBASE which will
be where we start our TLB table.

So on the second page of the section_translation.ps file I have included
in this repo directory.  This is hopefully not too complicated but in
order to do this kind of work you have to be able to manipulate/compute
addresses.  So what this is telling us is we start with the MMUTABLEBASE
at the top, this is some space in physical memory that we have decided
we are going to use to keep our mmu table, which means nobody else
can mess with it, if we were an operating system we would only allow
us permission to touch it, and block all applications from it, but since
we are bare metal supervisor we just have to not step on our own toes.

SBZ = should be zero.  Our MMUTABLEBASE as described above is 14 bits
of zeros at the bottom and 32-14 = 18 bits of whatever we choose within
our physical address space.  Using a 0 for the MMUTABLEBASE would
not be a wise idea as interrupts and other vectors are there and we
cant be having both vectors and the mmu table in the same place so
the first sane place we could put this is 0x00004000  upper 18
bits being a 1 the lower 14 being all zeros.  We will pick our address
in a bit.

So this picture says take the MMUTABLEBASE address at the top, then
take bits 31-20 or the top 12 bits of the VIRTUAL ADDRESS, multiply
by 4 (shift left two zeros) and add that to the MMUTABLEBASE.  This
is the address in PHYSICAL memory where the "First-level descriptor"
is found.  This is how the hardware works so when we in our software
place a descriptor in memory we need to compute the address the same
way to get the descriptor in the right place.

Now *IF* the lower two bits of the first level descriptor are 0b10 then
this is a 1MB section descriptor.  the picture then shows that we
create the physical address by taking the lower 20 bits of the virtual
address and placing the 12 bits from the first level descriptor on the
top (31:20) and that is how, for this section, we convert from
virtual to physical.  Part of the virtual being used to look up into
the mmu table, and that first lookup being a 1MB section, and the
physical being a combination of the descriptor and the virtual.

If the lower two bits of the first level descriptor, the first lookup,
are not 0b10 then we will get to that in a second.

You should be able to find the same picture in your ARM ARM that I have
stolen here.   The subsection titled "Hardware page table translation"

Now they have this optional thing called a supersection which is a 16MB
sized thing rather than 1MB and one might think that that would make
life easier, instead of 4096 entries we would only need 256 to describe
the whole world in the easiest way with the largest chunks.  But
the lookup works the same bits 31:20 are used for the first lookup
no matter what (well we could play with that N=0 register, but are not
going to here, that is not legacy, lets start with legacy works on
the most chips) so you basically have to write 16 entries for a
super section, you dont save anything.  the super section is broken into
16 1MB chunks and each 1MB chunk is a first level mmu table lookup.  So
it doesnt buy us anything for now.  Note how the hardware knows a
1MB section from a 16MB supersection is bit 18 in the first level entry.

Hopefully I have not lost you yet, we are doing address manipulation,
and maybe you are one step ahead of me, yes EVERY load and store with
the mmu enabled requires at least one mmu table lookup, the mmu when it
accesses this memory does not go through itself, but EVERY other fetch
and load and store.  Which does have a performance hit, they do have
a bit of a cache in the mmu to store the last so many tlb lookups to
make walking through the same space much faster, but that tlb cache
is limited in size, if you jump around a lot in ram you will have
a penalty here.  Cant really avoid it too much.

So if my MMUTABLEBASE was 0x00004000 and I had a virtual address of
0x12345678 then the hardware is going to take the top 12 bits of that
address 0x123, multiply by 4 and add that to the MMUTABLEBASE.
0x4000+(0x123<<2) = 0x448C.  and that is the address the mmu is going
to use for the first-level lookup.

If you look in the ARM ARM at the first level descriptor format.  The
lower two bits of the value read at that address tells the mmu hardware
if this is a page fault a coarse page table, or section or reserved (a
fault?).  Above we talked about a section with those two bits being
0b10.  If the mmu finds a 0b01 instead then we look at the
coarse_translation.ps file that I have put in this directory.   Like
the section translation, we see the MMUTABLEBASE we tack on the top 20
bits of the virtual address (times 4) and that is the first level fetch.
If that first level descriptor has 0b01 in the lower two bits, then the
mmu looks at the top 200 bits of the first level descriptor, tacks
on some more bits from the virtual address and uses that address to find
the second level descriptor.  the second level descriptor is not shown
in this picture you have to look at the table in the arm arm for the
description.  Here again the lower 2 bits tell the hardware something
large or small pages basically for a legacy/compatible discussion.
and that second level descriptor contains the bits that convert the
virtual address to a physical address plus the permissions stuff.

So lets take the virtual address 0x12345678 and the MMUTABLEBASE of
0x4000 again.  The first level descriptor address is the top three
bits of the virtual address 0x123, times 4, added to the MMUTABLEBASE
0x448C.  But this time when we look it up we find a value in the
table that has the lower two bits being 0b01.  Just to be crazy lets
say that descriptor was 0xABCDE001  (ignornign the domain and other
bits just talking address right now).  That means we take 0xABCDE000
the picture shows bits 19:12 (0x45) of the virtual address (0x12345678)
so the address to the second level descriptor in this crazy case is
0xABCDE000+(0x45<<2) = 0xABCDE114  why is that crazy?  because I
chose an address where we in theory dont have ram on the raspberry pi
maybe a mirrored address space, but a sane address would have been
somewhere close to the MMUTABLEBASE so we can keep the whole of the
mmu tables in a confined area.

The "other" bits in the descriptors are the domain, the TEX bits and
the C and B bits.

The C bit is the simplest one to start with that means Cacheable.  For
peripherals we absolutely dont want them to be cached.

The b bit, means bufferable, as in write buffer.  Something you may
not have heard about or thought about ever.  It is kind of like a cache
on the write end of things instead of read end.   I digress, when
a processor writes something everything is known, the address and
data.  So the next level of logic, could, if so designed, accept
that address and data at that level and release the processor to
keep doing what it was doing (ideally fetch some more instructions
and keep running) in parallel that logic could then continue to perform
the write to the slower peripheral or really slow dram (or faster cache).
Giving us a small to large performance gain.  But, what happens if while
we are doing that first write another write happens.  Well if we only
have storage for one transaction in this little feature then the
processor has to wait for us to finish the first write however long
that takes, then we can grab the information for the second write and
then release the processor.  I call writes "fire and forget" because
ideally the processor hands off the info to the memory controller
and keeps going.  Well the kind of write buffer I know about and hopefully
this is the same kind, goes beyond that I can do one write for you at
a time type of fire and forget, it is a tiny cache like thing that
can store up some number of addresses and data and allow the processor
to continue while those addresses and data are delivered to their
destination in parallel.

The description from the ARM ARM is:

"A write buffer is a block of high-speed memory whose purpose is to
optimize stores to main memory. When a store occurs, its data, address
and other details, for example data size, are written to the write
buffer at high speed. The write buffer then completes the store at main
memory speed. This is typically much slower than the speed of the ARM
processor. In the meantime, the ARM processor can proceed to execute
further instructions at full speed."

Eventually the write has to go out, and that far side is generally
slower the write buffer can fill up and the processor has to wait for
some space before continuing.  Like a cache helps the processor with
making many loads faster, the write buffer helps to make many writes
faster.

Now the TEX bits you just have to look up and there is the rub there
are likely more than one set of tables for TEX C and B, I am going
to stick with a TEX of 0b000 and not mess with any fancy features
there.  Now depending on whether this is considered an older arm
(ARMv5) or an ARMv6 or newer the combination of TEX, C and B have
some subtle differences.  The cache bit in particular does enable
or disable this space as cacheable.  You still independently need
to turn on the instruction and data caches and need an if cacheable
and the cache is on for the access type within that section, then it
will cache it...So we set tex to zeros to just keep it out of the way.

Lastly the domain bits.  Now you will see a 4 bit domain thing and
a 2 bit domain thing.  These are related.  There is a register in
the MMU right next to the translation table base address register this
one is a 32 bit register that contains 16 different domain definitions.

The two bit domain controls are defined as such.

0b00 No access Any access generates a domain fault
0b01 Client Accesses are checked against the access permission bits in the TLB entry
0b10 Reserved Using this value has UNPREDICTABLE results
0b11 Manager Accesses are not checked against the access permission bits in the TLB
entry, so a permission fault cannot be generated

For starters we are going to set all of the domains to 0b11 dont check
cant fault.  What are these 16 domains though?  Notice it takes 4 bits
to describe one of 16 things.  The different domains have no specific
meaning other than that we can have 16 different definitions that we
control for whatever reason.  You might allow for 16 different
threads running at once in your operating system, or 16 different
types of software running (kernel, application, ...) you can mark
a bunch of sections as belonging to one parituclar domain, and with a
simple change to that domain control register, a whole domain might
go from one type of permission to another, from no checking to
no access for example.

Since I usually use the MMU in bare metal to enable data caching on ram
I set my domain controls to 0b11, no checking and I simply make all
the MMU sections domain number 0.

So we end up with this simple function that allows us to add first level
descriptors in the MMU translation table.

unsigned int mmu_section ( unsigned int vadd, unsigned int padd, unsigned int flags )
{
    unsigned int ra;
    unsigned int rb;
    unsigned int rc;

    ra=vadd>>20;
    rb=MMUTABLEBASE|(ra<<2);
    ra=padd>>20;
    rc=(ra<<20)|flags|2;
    PUT32(rb,rc);
    return(0);
}

So what you have to do to turn on the MMU is to first figure out all
the memory you are going to access, and make sure you have entries
for that.  This is important, if you forget something, and dont have
a valid entry there, then you fault, your fault handler, if you have
chosen to write it, may also fault if it isnt placed write or something
it accesses also faults...(I would assume the fault handler is also
behind the mmu but would have to read up on that).

So the smallest amount of ram on a raspi is 256MB or 0x10000000 bytes.

Our program enters at address 0x8000, so that is within the first
section 0x000xxxxx so we should make that section cacheable and
bufferable.

    mmu_section(0x00000000,0x00000000,0x0000|8|4);

This is saying map the virtual 0x000xxxxx to the physical 0x000xxxxx
enable the cache and write buffer. 0x8 is the C bit and 0x4 is the B
bit.  tex, domain, etc are zeros.

if we want to use all 256mb we would need to do this for all the
sections from 0x000xxxxx to 0x100xxxxx.  Maybe do that later.

We know that for the raspi1 the peripherals, uart and such are in
arm physical space at 0x20xxxxxx.  To allow for more ram on the raspi 2
they needed to move that and moved it to 0x3Fxxxxxx.  So we either need
16 1MB section sized entries to cover that whole range or we look at
specific sections for specific things we care to talk to and just add
those.  The uart and the gpio it is associated with is in the 0x202xxxxx
space.  There are a couple of timers in the 0x200xxxxx space so one
entry can cover those.

if we didnt want to allow those to be cached or write buffered then

    mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
    mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!

but we may play with that to demonstrate what caching a peripheral
can do to you, why we need to turn on the mmu if for no other reason
than to get some bare metal performance by using the d cache.

Now you have to think on a system level here, there are a number
of things in play.  We need to plan our memory space, where are we
putting the cache, where are our peripherals, where is our program.

If the only reason for using the mmu is to allow the use of the d cache
then just map the whole world if you want with the peripherals not
cached and the rest cached.  or only the stuff you think you are going
to use.

if you are on the raspi 2 with multiple arm cores and are using
the multiple arm cores you need to do more reading if you want one
core to talk to another by sharing some of the memory between
them.  same problem as peripherals basically plus some other issues
if you have the write buffer on then a write doesnt happen right away
it depends on how full the write buffer is and basically that is not
usually deterministic.  But worse data caching a shared space you
dont know if you are reading from the actual shared ram or from the
the cache for that core.  And further you need to read up on whether
or not each core has its own mmu or where do their memory systems
come together?  You can and I will run this example on a raspi 2 but
only using one core not messing with the other three.  Ideally making
a generic example that can be ported to other arm processors from
an mmu perspective, from a peripheral perspective you have to use
different code for the different peripherals in that other arm you
might move this knowledge to.

So once our tables are setup then we need to actually turn the
MMU on.  Now I cant figure out where I got this from, and I have
modified it in this repo.  According to this manual it was with the
ARMv6 that we got the DSB feature which says wait for either cache
or MMU to finish something before continuing.  In particular when
initializing a cache to start it up you want to clean out all the
entries in a safe way you dont want to evict them and hose memory
you want to invalidate everything, mark it such that the cache lines
are empty/available.  Likewise that little bit of TLB caching the MMU
has, we want to invalidate that too so we dont start up the mmu
with entries in there that dont match our entries.

Why are we invalidating the cache in mmu code?  Because first we
need the mmu to use the d cache (to protect the peripherals from
being cached) and second the controls that enable the mmu are in the
same register as the i and d controls so makes sense to do both
mmu and cache stuff in one function.

So after the DSB we set our domain control bits, now in this example
I have done something different, 15 of the 16 domains have the 0b11
setting which is dont fault on anything, manager mode.  I set domain
1 such that it has no access, so in the example I will change one
of the descriptor table entries to use domain one, then I will access
it and then see the access violation.  I am also programming both
translation table base addresses even though we are using the N = 0
mode and only one is needed.  Depends on which manual you read I guess
as to whether or not you see the N = 0 and the separate or shared
i and d mmu tables.  (the reason for two is if you want your i and
d address spaces to be managed separately).

Understand I have been running on ARMv6 systems without the DSB for
some time and it just works, so maybe that is dumb luck...

This code relies on the caller to set the MMU enable and I and D cache
enables.  This is because this is derived from code where sometimes I
turn things on or dont turn things on and wanted it generic.


.globl start_MMU
start_MMU:
    mov r2,#0
    mcr p15,0,r2,c7,c7,0 ;@ invalidate caches
    mcr p15,0,r2,c8,c7,0 ;@ invalidate tlb
    mcr p15,0,r2,c7,c10,4 ;@ DSB ??

    mvn r2,#0
    bic r2,#0xC
    mcr p15,0,r2,c3,c0,0 ;@ domain

    mcr p15,0,r0,c2,c0,0 ;@ tlb base
    mcr p15,0,r0,c2,c0,1 ;@ tlb base

    mrc p15,0,r2,c1,c0,0
    orr r2,r2,r1
    mcr p15,0,r2,c1,c0,0

    bx lr

I am going to mess with the translation tables after the MMU is started
so I assume we have to invalidate when a table entry changes so that
just in case the old one is cached up in the tlb, we can force the
read of the new one by invalidating all the tlbs.  Depending on the
manual you read there are cases where we dont have to invalidate, will
just invalidate anyway to be clean and generic, you can optimize later
if you want to dig into those features if your core has them.

.globl invalidate_tlbs
invalidate_tlbs:
    mov r2,#0
    mcr p15,0,r2,c8,c7,0  ;@ invalidate tlb
    mcr p15,0,r2,c7,c10,4 ;@ DSB ??
    bx lr

Something to note here.  Debugging using JTAG makes life easier than
having to press reset and wait for a debugger, or even worse having
to remove some media or a prom and stick it in some programmer to change
the program.  Depending on your processor though you have to be super
careful when debugging programs using JTAG and the caches and/or mmu.
The openocd support for the cores used in the raspi2 imply that when
the openocd server halts the cores, it disables I and D caches (not
sure about the mmu).  But, for the raspi1 and quite a few other
ARMs out there, here is the problem you have using jtag.  Instructions
are fetched and stored in the instruction cache yes?  Thus the name
and data is read through and written through the data cache yes?  Say
we have a program we have the i and d cache on so it runs for a bit
instructions go into the i cache and depending on the size of the
program and the addresses used some percentage of the program is in
i cache when we halt the processor.  Lets say the instruction at address
0x10000.  Now we want to write a new version of the program to ram
and test it, so writing to ram uses data cycles, which go to/through
the data cache to ram.  And lets say one of those instructions in
the new program is at address 0x10000.  So ideally the new instruction
is in ram at addres 0x10000, but the instruction at that address from
the prior experiment is in i cache.  If we start the program again
at the entry point, and before the program goes out and cleans the
caches and starts stuff (assuming it doesnt know it is being run for
a second time from jtag it is written to boot into this code from
reset or power up) it hits address 0x10000.  if the old instruction
that is in cache is at address 0x10000 is different from the new
instruction in the new program at address 0x10000 the cache is going
to give the processor the old instruction because we left the caches
on.  Much chaos happens when you do this.  Now your processor core and
your jtag software may automatically or may have manual controls
for disabling the mmu and cache, or maybe not.  You have to be very
very aware of this though as you might try several iterations of your
program and they all seem to be progressing fine, then strange things
start to happen, sometimes your whole old program is in cache and it
is as if the new program wasnt being loaded.  Or maybe you start to think
you didnt compile it or save it to the space where you pick up the
binary, you repeat this many times but the new program simply isnt
being run.  I recommend for the purposes of this example, you use
the reset button which you soldered down on your board like I did or
if you didnt, then power cycle the raspberry pi every time or often
or do the research to see if/how you can disable the mmu and caches
between runs and habitally perform that step.  I use openocd a lot
on many different cores that not all have caches and mmus so I dont
have the habit of doing this, instead if I get tripped up I start
resetting between tests...

So the example is going to start with the mmu off and write to
addresses in four different 1MB address spaces.  So that later we
can play with the section descriptors and demonstrate virtual to
physical address conversion.

So write some stuff and print it out on the uart.

    PUT32(0x00045678,0x00045678);
    PUT32(0x00145678,0x00145678);
    PUT32(0x00245678,0x00245678);
    PUT32(0x00345678,0x00345678);

    hexstring(GET32(0x00045678));
    hexstring(GET32(0x00145678));
    hexstring(GET32(0x00245678));
    hexstring(GET32(0x00345678));
    uart_send(0x0D); uart_send(0x0A);

then setup the mmu with at least those four sections and the peripherals

    mmu_section(0x00000000,0x00000000,0x0000|8|4);
    mmu_section(0x00100000,0x00100000,0x0000);
    mmu_section(0x00200000,0x00200000,0x0000);
    mmu_section(0x00300000,0x00300000,0x0000);
    //peripherals
    mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
    mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!

and start the mmu with the I and D caches enabled

    start_mmu(MMUTABLEBASE,0x00800001|0x1000|0x0004);

then if we read those four addresses again we get the same output
as before since we maped virtual = physical.

    hexstring(GET32(0x00045678));
    hexstring(GET32(0x00145678));
    hexstring(GET32(0x00245678));
    hexstring(GET32(0x00345678));
    uart_send(0x0D); uart_send(0x0A);

but what if we swizzle things around.  make virtual 0x001xxxxx =
physical 0x003xxxxx.  0x002 looks at 0x000 and 0x003 looks at 0x001

    mmu_section(0x00100000,0x00300000,0x0000);
    mmu_section(0x00200000,0x00000000,0x0000);
    mmu_section(0x00300000,0x00100000,0x0000);

and maybe we dont need to do this but do it anyway just in case

    invalidate_tlbs();

read them again.

    hexstring(GET32(0x00045678));
    hexstring(GET32(0x00145678));
    hexstring(GET32(0x00245678));
    hexstring(GET32(0x00345678));
    uart_send(0x0D); uart_send(0x0A);

the 0x000xxxxx entry was not modifed so we get 000045678 as the output
but the 0x001xxxxx read is now coming from physical 0x003xxxxx so we
get the 00345678 output, 0x002xxxxx comes from the 0x000xxxxx space
so that read gives 00045678 and the 0x003xxxxx is mapped to 0x001xxxxx
physical giving 00145678 as the output.


    mmu_section(0x00100000,0x00100000,0x0020);

    invalidate_tlbs();
    hexstring(GET32(0x00045678));
    hexstring(GET32(0x00145678));
    hexstring(GET32(0x00245678));
    hexstring(GET32(0x00345678));
    uart_send(0x0D); uart_send(0x0A);

So up to this point the output looks like this.

DEADBEEF
00045678
00145678
00245678
00345678

00045678
00145678
00245678
00345678

00045678
00345678
00045678
00145678

first blob is without the mmu enabled, second with the mmu but
virtual = physical, third we use the mmu to show virtual != physical
for some ranges.


the next experiment there is a system timer in the 0x200xxxxx range


    for(ra=0;ra<4;ra++)
    {
        hexstring(system_timer_low());
    }
    uart_send(0x0D); uart_send(0x0A);

    mmu_section(0x20000000,0x20000000,0x0000|8); //CACHED
    invalidate_tlbs();

    for(ra=0;ra<4;ra++)
    {
        hexstring(system_timer_low());
    }
    uart_send(0x0D); uart_send(0x0A);

your output may vary, I am using bootloader07, so the human is involved
in typing and clicking stuff and downloading the program and starting
it so the time at which after reset we hit this code may vary and
give different timer ticks.

006BBB1B
006BBEE1
006BC2A7
006BC66C

00000000
00000000
00000000
00000000

why are the cached values zeros and not the same timestamp four times
which is what I was expecting?  that is a very good question and worthy
of a research project.


--- REWRITE IN PROGRESS ---


And then the icing on the cake, one section is marked as domain 1
instead of domain 0, domain 1 was set for 0b00 no access so when we
touch that domain we should get an access violation.

00045678
00000010

How do I know what that means with that output.  Well from my blinker07
example we touched on exceptions (interrupts).  I made a generic test
fixture such that anything other than a reset prints something out
and then hangs.   In no way shape or form is this a complete handler
but what it does show is that it is the exception that is at address
0x00000010 that gets hit which is data abort.  So figuring out it was
a data abort (pretty much expected) have that then read the data fault
status registers, being a data access we expect the data/combined one
to show somthing and the instruction one to not.  Adding that
instrumentation resulted in.

00045678
00000010
00000019
00000000
00008110
E5900000
00145678

Now I switched to the ARM1176JZF-S Technical Reference Manual for more
detail and that shows the 0x01 was domain 1, the domain we used for
that access. then the 0x9 means Domain Section Fault.

The lr during the abort shows us the instruction, which you would need
to disassemble to figure out the address, or at least that is one
way to do it perhaps there is a status register for that.

The instruction and the address match our expectations for this fault.