raspberrypi

See the top level README file for more information on documentation
and how to run these programs.

This example demonstrates ARM MMU basics.

You will need the ARM ARM (ARM Architectural Reference Manual) for
ARMv5.  I have a couple of pages included in this repo, but you still
will need the ARM ARM.

This code so far does not work on the Raspberry pi 2 yet, will get
that working at some point, the knowledge here still applies, I expect
the differences to be subtle between ARMv6 and 7 but will see.



-- NEED TO RE-WRITE THIS AGAIN, SUBPAGES ENABLED, COARSE 1KB TABLES  --




So what an MMU does or at least what an MMU does for us is it
translates virtual addresses into physical addresses as well as
checking access permissions, and gives us control over cachable
regions.

So what does all of that mean?

There is a boundary inside the chip around the ARM core, part of that
boundary is the memory interface for the ARM for lack of a better term
how the ARM accesses the world.  Nothing special, all processors have
some sort of address and data based interface between the processor and
the ram and peripherals.  That boundary uses physical addresses, that
boundary is on the memory side or "world side" of the ARM's mmu.
Within the ARM core there is the "processor side" of the mmu, and all
load and store (and fetch) accesses to the world go through the mmu.

When the ARM powers up the mmu is disabled, which means all accesses
pass through unmodified making the "processor side" or virtual address
space equal to the world side physical address space.  All of my
examples thus far, blinkers and such are based on physical addresses.
We already know that elswhere in the chip is another address
translation of some sort, because the manual is written for 0x7Exxxxxx
based adresses, but the ARM's physical addresses for those same things
is 0x20xxxxxx for the raspi 1 and 0x3Fxxxxxx for the raspi 2.  For this
discussion we only care about that other mystery address translation
we care about the ARM and the ARM mmu.

So when I say the mmu translates virtual addresses into physical
addresses.  What that means is on the processor side there is an address
you are accessing, but that does not have to be the same address on
the physical address side of the mmu.  Lets say for example I am
running a program on an operating system, Linux lets say, and I need
to compile that program before I can use it and I need to link it for
an address space so lets say that I link it to enter at address 0x8000
and use memory from 0x0000 to whatever I need and/or whatever is
available.  So that is all fine, except what if I have two programs
and I want both running "at the same time" how can both use the same
address space without clobbering each other?  The answer is neither is
at that address space the virtual address WHEN RUNNING one of them is
in the virtual address space 0x00000000 to some number, but in reality
program 1 might have that mapped to the physical address 0x01000000 and
program 2 might have its 0x00000000 to some number mapped to 0x02000000.
So when program 1 thinks it is writing to address 0xABCDE it is really
writing to 0x010ABCDE and when program 2 thinks it is writing to
address 0xABCDE it is really writing to 0x020ABCDE.

If you think about it it doesnt make any sense to allow any virtual
address to map to any physical address, for example from 0x12345678
to 0xAABBCCDD.  Think about it, we are talking about a 32 bit address
space or 4Giga addresses.  If we allowed any address to convert to
any other address we would need a 4Giga to 4Giga map, we would actually
need 16Gigabytes just to hold the 4Giga physical adresses worst case.
To cut to the chase ARM has one option where the top 12 bits of the
virtual get translated to 12 bits of physical, the lower 20 bits in
that case are the same between the virtual and physical.  This means
we can control 1MByte of address space with one definition, and have
4096 entries in some table somewhere to convert from virtual to
physical.  That is quite managable.  The minimum we would need to
store are the 12 replacement bits per table entry, but ARM uses a full
32 bit entry, which for this 1MB flavor, has the 12 physical bits plus
some other control bits.

What does cachable regions mean?  The mmu also gives you the feature
of being able to choose per descriptor whether or not you want to
enable caching on that block.  One obvious reason would be for the
peripherals.  Think about a timer, ideally you read the current timer
tick and each time you read it you get the current timer tick and
as it changes you see it change.  But what if when we turned on the
data cache it covered all addresses, all loads and stores?  Then you
read the timer once, get a value, read it again, now you get the
cached value over and over again you dont see the real timer value
in the peripheral.  That is not good, you cannot manage a peripheral
if you cannot read its status register or read the data coming out
of it, etc.  So at a minimum your peripherals need to be in non-cached
blocks.  Likewise, if you have some ram that is shared by more than
one resource, say the GPU and the ARM or for the raspberry pi 2 shared
between multiple ARM cores, you have a similar situation, another
resource may change the ram on the far side of your cache but your
cache assumes it has a copy of what is in ram.  Basically a cache
only helps you if whatever on the far side of it is only modified by
writes through the cache, if there are ways to change the data on
the far side you should not cache that area.   The mmu gives you
the ability to control cached and non-cahced spaces.

What is meant by access permissions?  Lets think about those two
programs running "at the same time" on some operating system (Linux
for example) you dont want to allow one program to gain access to
the operating systems data nor some other programs data.  Some
operating systems sure that are meant for only running trusted and
well mannered programs.  But you dont want some video game on your
home computer to have access to your banking account data in another
window/program?  The mechanisms vary across processor families but
an important job for the mmu is to provide a protection mechanism.
Such that when a particular program has a time slice on the processor
there is some mechanism to allow or restrict memory spaces.  If some
code accesses an address that it does not have permission for then
an abort happens and the processor is notified.  An interesting
side effect of this is that this doesnt have to be fatal, in fact it
could be by design.  Think of a virtual machine, you could let the
virtual machine software run on the processor, and when it accesses
one of its peripherals the real operating system gets an abort but
instead of killing the virtual machine it actually simulates the
peripheral and lets the virtual machine keep running.  Another one
that you have probably run into is when you run out of ram in your
computer, the notion of virtual memory which is differen than virtual
address space.  Virtual memory in this case is when your program
ventures off the end of its allowed address space into ram it thinks
it has.  The operating system gets an abort, finds some ram from
some other program, swaps that ram to disk for example, then allows
the program that was running to have a little more ram by mapping it
back in and allowing it to run.  Later when the program whose data
got swapped to disk needs it it swaps back and whatever was in the
ram it swaps with then goes to disk.  The term swap comes from the
idea that these blocks of ram are swapped back and forth to disk,
program A's ram goes to disk and is swapped with program T's, then
program T's is swapped with program K's and so on.  This is why
starting right after you venture off that edge from real ram to
virtual, your computers performance drops dramatically and disk
activity goes way up, the more things running the more swapping going
on and disk is significantly slower than ram.

As with all baremetal programming, wading through documentation is
the bulk of the job.  Definitely true here, with the unfortunate
problem that ARM's docs dont all look the same from one Archtectural
Reference Manual to an other.  We have this other problem that we
are techically using an ARMv6 (architecture version 6)(for the raspi 1)
but when you go to ARM's website there is an ARMv5 and then ARMv7 and
ARMv8, but no ARMv6.  Well the ARMv5 manual is actually the original
ARM ARM, that I assume they realized couldnt maintain all the
architecture variations forever in one document, so they perhaps
wisely went to one ARM ARM per rev.  With respect to the MMU, the ARMv5
reference manual covers the ARMv4 (I didnt know there was an mmu option
there) ARMv5 and ARMv6, and there is mode such that you can have the
same code/tables and it works on all three, meaning you dont have to
if-then-else your code based on whatever architecture you find.  This
raspi 1 example is based on subpages enabled which is this legacy or
compatibility mode across the three.

I am mostly using the ARMv5 Architectural Reference Manual.
ARM DDI0100I.

The 1MB sections mentioned above are called...sections...The ARM
mmu also has blobs that are smaller sizes 4096 byte pages for
example, will touch on those two sizes.  The 4096 byte one is called
a small page.

As mentioned above, 32 bit address space, 1MB is 20 bits so 32-20 is
12 bits or 4096 possible combinations or the address space is broken
up into 4096 1MB sections.  The top 12 bits of the virtual address
get translated to 12 bits of physical.  No rules on the translation
you can have virtual = physical or have any combination, or have
a bunch of virtual sections point at the same physical space, whatever
you want/need.

ARM uses the term Virtual Memory System Architecture or VMSA and
they say things like VMSAv6 to talk about the ARMv6 VMSA.  There
is a section in the ARM ARM titled Virtual Memory System Architecture.
In there we see the coprocessor registers, specifically CP15 register
2 is the translation table base register.


So the ARMv5 ARM ARM (ARM Architectural Reference Manual) is what
we need now.  See the top level README for finding this document,
I have included a few pages in the form of postscript, any decent pdf
viewer should be able to handle these files.  Before the pictures
though, the section in quesiton is titled Virtual Memory System
Architecture.  In the CP15 subsection register 2 is the the translation
table base register.  There are three opcodes which give us access to
three things, TTBR0, TTBR1 and the control register.  

First we read this comment

If N = 0 always use TTBR0. When N = 0 (the reset case), the translation
table base is backwards compatible with earlier versions of the
architecture.

That is the one we want, we will leave that as N = 0 and not touch it
and use TTBR0

Now what the TTBR0 description initially is telling me that bit 31
down to 14-n or 14 in our case since n = 0 is the base address, in
PHYSICAL address space.  Note the mmu cannot possibly go through the
mmu to figure out how to go through the mmu, the mmu itself only
operates on physical space and has direct access to it.  In a second
we are going to see that we need the base address for the mmu table
to be aligned to 16384 bytes.  (2 to the power 14, the lower 14 bits
of our TLB base address needs to be all zeros).

We write that register using

    mcr p15,0,r0,c2,c0,0 ;@ tlb base

TLB = Translation Lookaside Buffer.  As far as we are concerned think
of it as an array of 32 bit integers, each integer (descriptor) being
used to completely or partially convert from virtual to physical and
describe permissions and caching.

My example is going to have a define called MMUTABLEBASE which will
be where we start our TLB table.

Here is the reality of the world.  Some folks struggle with bit
manipulation, orring and anding and shifting and such, some dont.  The
MMU is logic so it operates on these tables in the way that logic would,
meaning from a programmers perspective it is a lot of bit manipulation
but otherwise is relatively simple to something a program could do.  As
programmers we need to know how the logic uses portsion of the virtual
address to look into this descriptor table or TLB, and then extracts
from those bits the next thing it needs to do.  We have to know this so
that for a particular virtual address we can place the descriptor we
want in the place where the hardware is going to find it.  So we need
a few lines of code plus some basic understanding of what is going on.
Just like bit manipulation causes some folks to struggle, reading
a chapter like this mmu chapter is equally daunting.  It is nice to
have somehone hold your hand through it.  Hopefully I am doing more
good than bad in that respect.

There is a file, section_translation.ps in this repo, you should be
able to use a pdf viewer to open this file.  The figure on the
second page shows just the address translation from virtual to physical
for a 1MB section.  This picture uses X instead of N, we are using an
N = 0 so that means X = 0.   The translation table base at the top
of the diagram is our MMUTABLEBASE, the address in physical space
of the beginning of our first level TLB or descriptor table.  The
first thing we need to do is find the table entry for the virtual
address in question (the Modified virtual address in this diagram,
as far as we are concerned it is unmodified it is the virtual
address we intend to use).  The first thing we see is the lower
14 bits of the translation table base are SBZ = should be zero.
Basically we need to have the translation table base aligned on a
16Kbyte boundary (2 to the 14th is 16K).  It would not make sense
to use all zeros as the translation table base, we have our reset
and interrupt vectors at and near address zero in the arms address
space so the first sane address would be 0x00004000.  The first
level descriptor is based on the top 12 bits of the virtual address
or 4096 entries, that is 16KBytes (not a coincidence), 0x4000 + 0x4000
is 0x8000, where our arm programs entry point is, so we have space
there if we want to use it.  But any address with the lower 14 bits
being zero will work so long as you have enough memory at that address
and you are not clobbering anything else that is using that memory
space.

So what this picture is showing us is that we take the top 12 bits
of the virtual address, multiply by 4 or shift left 2, and add tat
to the translation table base, this gives the address for the first
level descriptor for that virtual address.  The diagram shows the
first level fetch which returns a 32 bit value that we have placed
in the table.  If the lower 2 bits of that first level descriptor are
0b10 then this is a 1MB Section.  If a 1MB section then the top 12
bits of the first level descriptor replace the top 12 bits of the
virtual address to convert it into a physical address.  Understand
here first and foremost so long as we do the N = 0 thing, the first
level descriptor or the first thing the mmu does is look at the top
12 bits of the virtual address, always.  If the lower two bits of
the first level descriptor are not 0b10 then we get into
a second level descriptor and more virtual bits come into play, but
for now if we start by learning just 1MB sections, the conversion
from virtual to physical only cares about the top 12 bits of the
address.  So for 1MB sections we dont have to concentrate on every
actual address we are going to access we only need to think about
the 1MB aligned ranges.  The uart for example on the raspi 1 has
a number of registers that start with 0x202150xx, if we use a 1MB
section for those we only care about the 0x202xxxxx part of the
address.  To not have to change our code we would want to have
the virtual = physical for that and do not mark it as cacheable.

So if my MMUTABLEBASE was 0x00004000 and I had a virtual address of
0x12345678 then the hardware is going to take the top 12 bits of that
address 0x123, multiply by 4 and add that to the MMUTABLEBASE.
0x4000+(0x123<<2) = 0x448C.  and that is the address the mmu is going
to use for the first-level lookup.  Ignoring the other bits in the
descriptor for now, if the first-level descriptor has the value
0xABC00002, the lower two bits are 0x10, a 1MB section, so the top
12 bits replace the virtual addresses top 12 bits and our 0x12345678
is converted to the physical address 0xABC45678.


Now they have this optional thing called a supersection which is a 16MB
sized thing rather than 1MB and one might think that that would make
life easier, right?  Wrong.  No matter what, assuming the N = 0 thing
the first level descriptor is found using the top 12 bits of the
virtual address, so in order to do some 16MB thing you need 16 entries
one for each of the possible 1MB sections.  If you are already
generating 16 descriptors might as well just make them 1MB sections,
you can read up on the differences between super sections and sections
and try them if you want.  For what I am doing here dont need them,
just wanted to point out you still need 16 entries per super section.

Hopefully I have not lost you yet with this address manipulation,
and maybe you are one step ahead of me, yes EVERY load and store with
the mmu enabled requires at least one mmu table lookup, the mmu when it
accesses this memory does not go through itself, but EVERY other fetch
and load and store.  Which does have a performance hit, they do have
a bit of a cache in the mmu to store the last so many tlb lookups.
That helps, but you cannot avoid the mmu having to do the conversion
on every address.

In the ARM ARM I am looking at the subsection on first-level descriptors
has a table:
Table B4-1 First-level descriptor format (VMSAv6, subpages enabled)
What this is telling us is that if the first-level descriptor, the
32 bit number we place in the right place in the TLB, has the lower
two bits 0b10 then that entry is a 1MB section and the mmu can get
everything it needs from that first level descriptor.  But if the
lower two bits are 0b01 then this is a coarse page table entry and
we have to go to a second level descriptor to complete the
conversion from virtual to physical.  Not every address will need
this only the address ranges we want to be more coarsely divided than
1MB.  Or the other way of saying it is of we want to control an
address range in chunks smaller than 1MB then we need to use pages
not sections.  You can certainly use pages for the whole world, but
if you do the math, 4096Byte pages would mean your mmu table needs
to be 4MB+16K worst case.  And you have to do more work to set that
all up.

The coarse_translation.ps file I have included in t




--  REWRITE IN PROGRESS HERE ---





If you look in the ARM ARM at the first level descriptor format.  The
lower two bits of the value read at that address tells the mmu hardware
if this is a page fault a coarse page table, or section or reserved (a
fault?).  Above we talked about a section with those two bits being
0b10.  If the mmu finds a 0b01 instead then we look at the
coarse_translation.ps file that I have put in this directory.   Like
the section translation, we see the MMUTABLEBASE we tack on the top 20
bits of the virtual address (times 4) and that is the first level fetch.
If that first level descriptor has 0b01 in the lower two bits, then the
mmu looks at the top 200 bits of the first level descriptor, tacks
on some more bits from the virtual address and uses that address to find
the second level descriptor.  the second level descriptor is not shown
in this picture you have to look at the table in the arm arm for the
description.  Here again the lower 2 bits tell the hardware something
large or small pages basically for a legacy/compatible discussion.
and that second level descriptor contains the bits that convert the
virtual address to a physical address plus the permissions stuff.

So lets take the virtual address 0x12345678 and the MMUTABLEBASE of
0x4000 again.  The first level descriptor address is the top three
bits of the virtual address 0x123, times 4, added to the MMUTABLEBASE
0x448C.  But this time when we look it up we find a value in the
table that has the lower two bits being 0b01.  Just to be crazy lets
say that descriptor was 0xABCDE001  (ignornign the domain and other
bits just talking address right now).  That means we take 0xABCDE000
the picture shows bits 19:12 (0x45) of the virtual address (0x12345678)
so the address to the second level descriptor in this crazy case is
0xABCDE000+(0x45<<2) = 0xABCDE114  why is that crazy?  because I
chose an address where we in theory dont have ram on the raspberry pi
maybe a mirrored address space, but a sane address would have been
somewhere close to the MMUTABLEBASE so we can keep the whole of the
mmu tables in a confined area.

The "other" bits in the descriptors are the domain, the TEX bits and
the C and B bits.

The C bit is the simplest one to start with that means Cacheable.  For
peripherals we absolutely dont want them to be cached.

The b bit, means bufferable, as in write buffer.  Something you may
not have heard about or thought about ever.  It is kind of like a cache
on the write end of things instead of read end.   I digress, when
a processor writes something everything is known, the address and
data.  So the next level of logic, could, if so designed, accept
that address and data at that level and release the processor to
keep doing what it was doing (ideally fetch some more instructions
and keep running) in parallel that logic could then continue to perform
the write to the slower peripheral or really slow dram (or faster cache).
Giving us a small to large performance gain.  But, what happens if while
we are doing that first write another write happens.  Well if we only
have storage for one transaction in this little feature then the
processor has to wait for us to finish the first write however long
that takes, then we can grab the information for the second write and
then release the processor.  I call writes "fire and forget" because
ideally the processor hands off the info to the memory controller
and keeps going.  Well the kind of write buffer I know about and hopefully
this is the same kind, goes beyond that I can do one write for you at
a time type of fire and forget, it is a tiny cache like thing that
can store up some number of addresses and data and allow the processor
to continue while those addresses and data are delivered to their
destination in parallel.

The description from the ARM ARM is:

"A write buffer is a block of high-speed memory whose purpose is to
optimize stores to main memory. When a store occurs, its data, address
and other details, for example data size, are written to the write
buffer at high speed. The write buffer then completes the store at main
memory speed. This is typically much slower than the speed of the ARM
processor. In the meantime, the ARM processor can proceed to execute
further instructions at full speed."

Eventually the write has to go out, and that far side is generally
slower the write buffer can fill up and the processor has to wait for
some space before continuing.  Like a cache helps the processor with
making many loads faster, the write buffer helps to make many writes
faster.

Now the TEX bits you just have to look up and there is the rub there
are likely more than one set of tables for TEX C and B, I am going
to stick with a TEX of 0b000 and not mess with any fancy features
there.  Now depending on whether this is considered an older arm
(ARMv5) or an ARMv6 or newer the combination of TEX, C and B have
some subtle differences.  The cache bit in particular does enable
or disable this space as cacheable.  You still independently need
to turn on the instruction and data caches and need an if cacheable
and the cache is on for the access type within that section, then it
will cache it...So we set tex to zeros to just keep it out of the way.

Lastly the domain bits.  Now you will see a 4 bit domain thing and
a 2 bit domain thing.  These are related.  There is a register in
the MMU right next to the translation table base address register this
one is a 32 bit register that contains 16 different domain definitions.

The two bit domain controls are defined as such.

0b00 No access Any access generates a domain fault
0b01 Client Accesses are checked against the access permission bits in the TLB entry
0b10 Reserved Using this value has UNPREDICTABLE results
0b11 Manager Accesses are not checked against the access permission bits in the TLB
entry, so a permission fault cannot be generated

For starters we are going to set all of the domains to 0b11 dont check
cant fault.  What are these 16 domains though?  Notice it takes 4 bits
to describe one of 16 things.  The different domains have no specific
meaning other than that we can have 16 different definitions that we
control for whatever reason.  You might allow for 16 different
threads running at once in your operating system, or 16 different
types of software running (kernel, application, ...) you can mark
a bunch of sections as belonging to one parituclar domain, and with a
simple change to that domain control register, a whole domain might
go from one type of permission to another, from no checking to
no access for example.

Since I usually use the MMU in bare metal to enable data caching on ram
I set my domain controls to 0b11, no checking and I simply make all
the MMU sections domain number 0.

So we end up with this simple function that allows us to add first level
descriptors in the MMU translation table.

unsigned int mmu_section ( unsigned int vadd, unsigned int padd, unsigned int flags )
{
    unsigned int ra;
    unsigned int rb;
    unsigned int rc;

    ra=vadd>>20;
    rb=MMUTABLEBASE|(ra<<2);
    ra=padd>>20;
    rc=(ra<<20)|flags|2;
    PUT32(rb,rc);
    return(0);
}

So what you have to do to turn on the MMU is to first figure out all
the memory you are going to access, and make sure you have entries
for that.  This is important, if you forget something, and dont have
a valid entry there, then you fault, your fault handler, if you have
chosen to write it, may also fault if it isnt placed write or something
it accesses also faults...(I would assume the fault handler is also
behind the mmu but would have to read up on that).

So the smallest amount of ram on a raspi is 256MB or 0x10000000 bytes.

Our program enters at address 0x8000, so that is within the first
section 0x000xxxxx so we should make that section cacheable and
bufferable.

    mmu_section(0x00000000,0x00000000,0x0000|8|4);

This is saying map the virtual 0x000xxxxx to the physical 0x000xxxxx
enable the cache and write buffer. 0x8 is the C bit and 0x4 is the B
bit.  tex, domain, etc are zeros.

if we want to use all 256mb we would need to do this for all the
sections from 0x000xxxxx to 0x100xxxxx.  Maybe do that later.

We know that for the raspi1 the peripherals, uart and such are in
arm physical space at 0x20xxxxxx.  To allow for more ram on the raspi 2
they needed to move that and moved it to 0x3Fxxxxxx.  So we either need
16 1MB section sized entries to cover that whole range or we look at
specific sections for specific things we care to talk to and just add
those.  The uart and the gpio it is associated with is in the 0x202xxxxx
space.  There are a couple of timers in the 0x200xxxxx space so one
entry can cover those.

if we didnt want to allow those to be cached or write buffered then

    mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
    mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!

but we may play with that to demonstrate what caching a peripheral
can do to you, why we need to turn on the mmu if for no other reason
than to get some bare metal performance by using the d cache.

Now you have to think on a system level here, there are a number
of things in play.  We need to plan our memory space, where are we
putting the cache, where are our peripherals, where is our program.

If the only reason for using the mmu is to allow the use of the d cache
then just map the whole world if you want with the peripherals not
cached and the rest cached.  or only the stuff you think you are going
to use.

if you are on the raspi 2 with multiple arm cores and are using
the multiple arm cores you need to do more reading if you want one
core to talk to another by sharing some of the memory between
them.  same problem as peripherals basically plus some other issues
if you have the write buffer on then a write doesnt happen right away
it depends on how full the write buffer is and basically that is not
usually deterministic.  But worse data caching a shared space you
dont know if you are reading from the actual shared ram or from the
the cache for that core.  And further you need to read up on whether
or not each core has its own mmu or where do their memory systems
come together?  You can and I will run this example on a raspi 2 but
only using one core not messing with the other three.  Ideally making
a generic example that can be ported to other arm processors from
an mmu perspective, from a peripheral perspective you have to use
different code for the different peripherals in that other arm you
might move this knowledge to.

So once our tables are setup then we need to actually turn the
MMU on.  Now I cant figure out where I got this from, and I have
modified it in this repo.  According to this manual it was with the
ARMv6 that we got the DSB feature which says wait for either cache
or MMU to finish something before continuing.  In particular when
initializing a cache to start it up you want to clean out all the
entries in a safe way you dont want to evict them and hose memory
you want to invalidate everything, mark it such that the cache lines
are empty/available.  Likewise that little bit of TLB caching the MMU
has, we want to invalidate that too so we dont start up the mmu
with entries in there that dont match our entries.

Why are we invalidating the cache in mmu code?  Because first we
need the mmu to use the d cache (to protect the peripherals from
being cached) and second the controls that enable the mmu are in the
same register as the i and d controls so makes sense to do both
mmu and cache stuff in one function.

So after the DSB we set our domain control bits, now in this example
I have done something different, 15 of the 16 domains have the 0b11
setting which is dont fault on anything, manager mode.  I set domain
1 such that it has no access, so in the example I will change one
of the descriptor table entries to use domain one, then I will access
it and then see the access violation.  I am also programming both
translation table base addresses even though we are using the N = 0
mode and only one is needed.  Depends on which manual you read I guess
as to whether or not you see the N = 0 and the separate or shared
i and d mmu tables.  (the reason for two is if you want your i and
d address spaces to be managed separately).

Understand I have been running on ARMv6 systems without the DSB for
some time and it just works, so maybe that is dumb luck...

This code relies on the caller to set the MMU enable and I and D cache
enables.  This is because this is derived from code where sometimes I
turn things on or dont turn things on and wanted it generic.


.globl start_MMU
start_MMU:
    mov r2,#0
    mcr p15,0,r2,c7,c7,0 ;@ invalidate caches
    mcr p15,0,r2,c8,c7,0 ;@ invalidate tlb
    mcr p15,0,r2,c7,c10,4 ;@ DSB ??

    mvn r2,#0
    bic r2,#0xC
    mcr p15,0,r2,c3,c0,0 ;@ domain

    mcr p15,0,r0,c2,c0,0 ;@ tlb base
    mcr p15,0,r0,c2,c0,1 ;@ tlb base

    mrc p15,0,r2,c1,c0,0
    orr r2,r2,r1
    mcr p15,0,r2,c1,c0,0

    bx lr

I am going to mess with the translation tables after the MMU is started
so I assume we have to invalidate when a table entry changes so that
just in case the old one is cached up in the tlb, we can force the
read of the new one by invalidating all the tlbs.  Depending on the
manual you read there are cases where we dont have to invalidate, will
just invalidate anyway to be clean and generic, you can optimize later
if you want to dig into those features if your core has them.

.globl invalidate_tlbs
invalidate_tlbs:
    mov r2,#0
    mcr p15,0,r2,c8,c7,0  ;@ invalidate tlb
    mcr p15,0,r2,c7,c10,4 ;@ DSB ??
    bx lr

Something to note here.  Debugging using JTAG makes life easier than
having to press reset and wait for a debugger, or even worse having
to remove some media or a prom and stick it in some programmer to change
the program.  Depending on your processor though you have to be super
careful when debugging programs using JTAG and the caches and/or mmu.
The openocd support for the cores used in the raspi2 imply that when
the openocd server halts the cores, it disables I and D caches (not
sure about the mmu).  But, for the raspi1 and quite a few other
ARMs out there, here is the problem you have using jtag.  Instructions
are fetched and stored in the instruction cache yes?  Thus the name
and data is read through and written through the data cache yes?  Say
we have a program we have the i and d cache on so it runs for a bit
instructions go into the i cache and depending on the size of the
program and the addresses used some percentage of the program is in
i cache when we halt the processor.  Lets say the instruction at address
0x10000.  Now we want to write a new version of the program to ram
and test it, so writing to ram uses data cycles, which go to/through
the data cache to ram.  And lets say one of those instructions in
the new program is at address 0x10000.  So ideally the new instruction
is in ram at addres 0x10000, but the instruction at that address from
the prior experiment is in i cache.  If we start the program again
at the entry point, and before the program goes out and cleans the
caches and starts stuff (assuming it doesnt know it is being run for
a second time from jtag it is written to boot into this code from
reset or power up) it hits address 0x10000.  if the old instruction
that is in cache is at address 0x10000 is different from the new
instruction in the new program at address 0x10000 the cache is going
to give the processor the old instruction because we left the caches
on.  Much chaos happens when you do this.  Now your processor core and
your jtag software may automatically or may have manual controls
for disabling the mmu and cache, or maybe not.  You have to be very
very aware of this though as you might try several iterations of your
program and they all seem to be progressing fine, then strange things
start to happen, sometimes your whole old program is in cache and it
is as if the new program wasnt being loaded.  Or maybe you start to think
you didnt compile it or save it to the space where you pick up the
binary, you repeat this many times but the new program simply isnt
being run.  I recommend for the purposes of this example, you use
the reset button which you soldered down on your board like I did or
if you didnt, then power cycle the raspberry pi every time or often
or do the research to see if/how you can disable the mmu and caches
between runs and habitally perform that step.  I use openocd a lot
on many different cores that not all have caches and mmus so I dont
have the habit of doing this, instead if I get tripped up I start
resetting between tests...

So the example is going to start with the mmu off and write to
addresses in four different 1MB address spaces.  So that later we
can play with the section descriptors and demonstrate virtual to
physical address conversion.

So write some stuff and print it out on the uart.

    PUT32(0x00045678,0x00045678);
    PUT32(0x00145678,0x00145678);
    PUT32(0x00245678,0x00245678);
    PUT32(0x00345678,0x00345678);

    hexstring(GET32(0x00045678));
    hexstring(GET32(0x00145678));
    hexstring(GET32(0x00245678));
    hexstring(GET32(0x00345678));
    uart_send(0x0D); uart_send(0x0A);

then setup the mmu with at least those four sections and the peripherals

    mmu_section(0x00000000,0x00000000,0x0000|8|4);
    mmu_section(0x00100000,0x00100000,0x0000);
    mmu_section(0x00200000,0x00200000,0x0000);
    mmu_section(0x00300000,0x00300000,0x0000);
    //peripherals
    mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
    mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!

and start the mmu with the I and D caches enabled

    start_mmu(MMUTABLEBASE,0x00800001|0x1000|0x0004);

then if we read those four addresses again we get the same output
as before since we maped virtual = physical.

    hexstring(GET32(0x00045678));
    hexstring(GET32(0x00145678));
    hexstring(GET32(0x00245678));
    hexstring(GET32(0x00345678));
    uart_send(0x0D); uart_send(0x0A);

but what if we swizzle things around.  make virtual 0x001xxxxx =
physical 0x003xxxxx.  0x002 looks at 0x000 and 0x003 looks at 0x001

    mmu_section(0x00100000,0x00300000,0x0000);
    mmu_section(0x00200000,0x00000000,0x0000);
    mmu_section(0x00300000,0x00100000,0x0000);

and maybe we dont need to do this but do it anyway just in case

    invalidate_tlbs();

read them again.

    hexstring(GET32(0x00045678));
    hexstring(GET32(0x00145678));
    hexstring(GET32(0x00245678));
    hexstring(GET32(0x00345678));
    uart_send(0x0D); uart_send(0x0A);

the 0x000xxxxx entry was not modifed so we get 000045678 as the output
but the 0x001xxxxx read is now coming from physical 0x003xxxxx so we
get the 00345678 output, 0x002xxxxx comes from the 0x000xxxxx space
so that read gives 00045678 and the 0x003xxxxx is mapped to 0x001xxxxx
physical giving 00145678 as the output.


    mmu_section(0x00100000,0x00100000,0x0020);

    invalidate_tlbs();
    hexstring(GET32(0x00045678));
    hexstring(GET32(0x00145678));
    hexstring(GET32(0x00245678));
    hexstring(GET32(0x00345678));
    uart_send(0x0D); uart_send(0x0A);

So up to this point the output looks like this.

DEADBEEF
00045678
00145678
00245678
00345678

00045678
00145678
00245678
00345678

00045678
00345678
00045678
00145678

first blob is without the mmu enabled, second with the mmu but
virtual = physical, third we use the mmu to show virtual != physical
for some ranges.


the next experiment there is a system timer in the 0x200xxxxx range


    for(ra=0;ra<4;ra++)
    {
        hexstring(system_timer_low());
    }
    uart_send(0x0D); uart_send(0x0A);

    mmu_section(0x20000000,0x20000000,0x0000|8); //CACHED
    invalidate_tlbs();

    for(ra=0;ra<4;ra++)
    {
        hexstring(system_timer_low());
    }
    uart_send(0x0D); uart_send(0x0A);

your output may vary, I am using bootloader07, so the human is involved
in typing and clicking stuff and downloading the program and starting
it so the time at which after reset we hit this code may vary and
give different timer ticks.

006BBB1B
006BBEE1
006BC2A7
006BC66C

00000000
00000000
00000000
00000000

why are the cached values zeros and not the same timestamp four times
which is what I was expecting?  that is a very good question and worthy
of a research project.



--- REWRITE IN PROGRESS ---




And then the icing on the cake, one section is marked as domain 1
instead of domain 0, domain 1 was set for 0b00 no access so when we
touch that domain we should get an access violation.

00045678
00000010

How do I know what that means with that output.  Well from my blinker07
example we touched on exceptions (interrupts).  I made a generic test
fixture such that anything other than a reset prints something out
and then hangs.   In no way shape or form is this a complete handler
but what it does show is that it is the exception that is at address
0x00000010 that gets hit which is data abort.  So figuring out it was
a data abort (pretty much expected) have that then read the data fault
status registers, being a data access we expect the data/combined one
to show somthing and the instruction one to not.  Adding that
instrumentation resulted in.

00045678
00000010
00000019
00000000
00008110
E5900000
00145678

Now I switched to the ARM1176JZF-S Technical Reference Manual for more
detail and that shows the 0x01 was domain 1, the domain we used for
that access. then the 0x9 means Domain Section Fault.

The lr during the abort shows us the instruction, which you would need
to disassemble to figure out the address, or at least that is one
way to do it perhaps there is a status register for that.

The instruction and the address match our expectations for this fault.