raspberrypi/mmu/README


See the top level README file for more information on documentation
and how to run these programs.

This example demonstrates ARM MMU basics.

You will need the ARM ARM (ARM Architectural Reference Manual) for
ARMv5.  I have a couple of pages included in this repo, but you still
will need the ARM ARM.

This code so far does not work on the Raspberry pi 2 yet, will get
that working at some point, the knowledge here still applies, I expect
the differences to be subtle between ARMv6 and 7 but will see.


-- NEED TO RE-WRITE THIS AGAIN, SUBPAGES ENABLED, COARSE 1KB TABLES  --


So what an MMU does or at least what an MMU does for us is it
translates virtual addresses into physical addresses as well as
checking access permissions, and gives us control over cachable
regions.

So what does all of that mean?

There is a boundary inside the chip around the ARM core, part of that
boundary is the memory interface for the ARM for lack of a better term
how the ARM accesses the world.  Nothing special, all processors have
some sort of address and data based interface between the processor and
the ram and peripherals.  That boundary uses physical addresses, that
boundary is on the memory side or "world side" of the ARM's mmu.
Within the ARM core there is the "processor side" of the mmu, and all
load and store (and fetch) accesses to the world go through the mmu.

When the ARM powers up the mmu is disabled, which means all accesses
pass through unmodified making the "processor side" or virtual address
space equal to the world side physical address space.  All of my
examples thus far, blinkers and such are based on physical addresses.
We already know that elswhere in the chip is another address
translation of some sort, because the manual is written for 0x7Exxxxxx
based adresses, but the ARM's physical addresses for those same things
is 0x20xxxxxx for the raspi 1 and 0x3Fxxxxxx for the raspi 2.  For this
discussion we only care about that other mystery address translation
we care about the ARM and the ARM mmu.

So when I say the mmu translates virtual addresses into physical
addresses.  What that means is on the processor side there is an address
you are accessing, but that does not have to be the same address on
the physical address side of the mmu.  Lets say for example I am
running a program on an operating system, Linux lets say, and I need
to compile that program before I can use it and I need to link it for
an address space so lets say that I link it to enter at address 0x8000
and use memory from 0x0000 to whatever I need and/or whatever is
available.  So that is all fine, except what if I have two programs
and I want both running "at the same time" how can both use the same
address space without clobbering each other?  The answer is neither is
at that address space the virtual address WHEN RUNNING one of them is
in the virtual address space 0x00000000 to some number, but in reality
program 1 might have that mapped to the physical address 0x01000000 and
program 2 might have its 0x00000000 to some number mapped to 0x02000000.
So when program 1 thinks it is writing to address 0xABCDE it is really
writing to 0x010ABCDE and when program 2 thinks it is writing to
address 0xABCDE it is really writing to 0x020ABCDE.

If you think about it it doesnt make any sense to allow any virtual
address to map to any physical address, for example from 0x12345678
to 0xAABBCCDD.  Think about it, we are talking about a 32 bit address
space or 4Giga addresses.  If we allowed any address to convert to
any other address we would need a 4Giga to 4Giga map, we would actually
need 16Gigabytes just to hold the 4Giga physical adresses worst case.
To cut to the chase ARM has one option where the top 12 bits of the
virtual get translated to 12 bits of physical, the lower 20 bits in
that case are the same between the virtual and physical.  This means
we can control 1MByte of address space with one definition, and have
4096 entries in some table somewhere to convert from virtual to
physical.  That is quite managable.  The minimum we would need to
store are the 12 replacement bits per table entry, but ARM uses a full
32 bit entry, which for this 1MB flavor, has the 12 physical bits plus
some other control bits.

What does cachable regions mean?  The mmu also gives you the feature
of being able to choose per descriptor whether or not you want to
enable caching on that block.  One obvious reason would be for the
peripherals.  Think about a timer, ideally you read the current timer
tick and each time you read it you get the current timer tick and
as it changes you see it change.  But what if when we turned on the
data cache it covered all addresses, all loads and stores?  Then you
read the timer once, get a value, read it again, now you get the
cached value over and over again you dont see the real timer value
in the peripheral.  That is not good, you cannot manage a peripheral
if you cannot read its status register or read the data coming out
of it, etc.  So at a minimum your peripherals need to be in non-cached
blocks.  Likewise, if you have some ram that is shared by more than
one resource, say the GPU and the ARM or for the raspberry pi 2 shared
between multiple ARM cores, you have a similar situation, another
resource may change the ram on the far side of your cache but your
cache assumes it has a copy of what is in ram.  Basically a cache
only helps you if whatever on the far side of it is only modified by
writes through the cache, if there are ways to change the data on
the far side you should not cache that area.   The mmu gives you
the ability to control cached and non-cahced spaces.

What is meant by access permissions?  Lets think about those two
programs running "at the same time" on some operating system (Linux
for example) you dont want to allow one program to gain access to
the operating systems data nor some other programs data.  Some
operating systems sure that are meant for only running trusted and
well mannered programs.  But you dont want some video game on your
home computer to have access to your banking account data in another
window/program?  The mechanisms vary across processor families but
an important job for the mmu is to provide a protection mechanism.
Such that when a particular program has a time slice on the processor
there is some mechanism to allow or restrict memory spaces.  If some
code accesses an address that it does not have permission for then
an abort happens and the processor is notified.  An interesting
side effect of this is that this doesnt have to be fatal, in fact it
could be by design.  Think of a virtual machine, you could let the
virtual machine software run on the processor, and when it accesses
one of its peripherals the real operating system gets an abort but
instead of killing the virtual machine it actually simulates the
peripheral and lets the virtual machine keep running.  Another one
that you have probably run into is when you run out of ram in your
computer, the notion of virtual memory which is differen than virtual
address space.  Virtual memory in this case is when your program
ventures off the end of its allowed address space into ram it thinks
it has.  The operating system gets an abort, finds some ram from
some other program, swaps that ram to disk for example, then allows
the program that was running to have a little more ram by mapping it
back in and allowing it to run.  Later when the program whose data
got swapped to disk needs it it swaps back and whatever was in the
ram it swaps with then goes to disk.  The term swap comes from the
idea that these blocks of ram are swapped back and forth to disk,
program A's ram goes to disk and is swapped with program T's, then
program T's is swapped with program K's and so on.  This is why
starting right after you venture off that edge from real ram to
virtual, your computers performance drops dramatically and disk
activity goes way up, the more things running the more swapping going
on and disk is significantly slower than ram.

As with all baremetal programming, wading through documentation is
the bulk of the job.  Definitely true here, with the unfortunate
problem that ARM's docs dont all look the same from one Archtectural
Reference Manual to an other.  We have this other problem that we
are techically using an ARMv6 (architecture version 6)(for the raspi 1)
but when you go to ARM's website there is an ARMv5 and then ARMv7 and
ARMv8, but no ARMv6.  Well the ARMv5 manual is actually the original
ARM ARM, that I assume they realized couldnt maintain all the
architecture variations forever in one document, so they perhaps
wisely went to one ARM ARM per rev.  With respect to the MMU, the ARMv5
reference manual covers the ARMv4 (I didnt know there was an mmu option
there) ARMv5 and ARMv6, and there is mode such that you can have the
same code/tables and it works on all three, meaning you dont have to
if-then-else your code based on whatever architecture you find.  This
raspi 1 example is based on subpages enabled which is this legacy or
compatibility mode across the three.

I am mostly using the ARMv5 Architectural Reference Manual.
ARM DDI0100I.

The 1MB sections mentioned above are called...sections...The ARM
mmu also has blobs that are smaller sizes 4096 byte pages for
example, will touch on those two sizes.  The 4096 byte one is called
a small page.

As mentioned above, 32 bit address space, 1MB is 20 bits so 32-20 is
12 bits or 4096 possible combinations or the address space is broken
up into 4096 1MB sections.  The top 12 bits of the virtual address
get translated to 12 bits of physical.  No rules on the translation
you can have virtual = physical or have any combination, or have
a bunch of virtual sections point at the same physical space, whatever
you want/need.

ARM uses the term Virtual Memory System Architecture or VMSA and
they say things like VMSAv6 to talk about the ARMv6 VMSA.  There
is a section in the ARM ARM titled Virtual Memory System Architecture.
In there we see the coprocessor registers, specifically CP15 register
2 is the translation table base register.


So the ARMv5 ARM ARM (ARM Architectural Reference Manual) is what
we need now.  See the top level README for finding this document,
I have included a few pages in the form of postscript, any decent pdf
viewer should be able to handle these files.  Before the pictures
though, the section in quesiton is titled Virtual Memory System
Architecture.  In the CP15 subsection register 2 is the the translation
table base register.  There are three opcodes which give us access to
three things, TTBR0, TTBR1 and the control register.

First we read this comment

If N = 0 always use TTBR0. When N = 0 (the reset case), the translation
table base is backwards compatible with earlier versions of the
architecture.

That is the one we want, we will leave that as N = 0 and not touch it
and use TTBR0

Now what the TTBR0 description initially is telling me that bit 31
down to 14-n or 14 in our case since n = 0 is the base address, in
PHYSICAL address space.  Note the mmu cannot possibly go through the
mmu to figure out how to go through the mmu, the mmu itself only
operates on physical space and has direct access to it.  In a second
we are going to see that we need the base address for the mmu table
to be aligned to 16384 bytes.  (2 to the power 14, the lower 14 bits
of our TLB base address needs to be all zeros).

We write that register using

    mcr p15,0,r0,c2,c0,0 ;@ tlb base

TLB = Translation Lookaside Buffer.  As far as we are concerned think
of it as an array of 32 bit integers, each integer (descriptor) being
used to completely or partially convert from virtual to physical and
describe permissions and caching.

My example is going to have a define called MMUTABLEBASE which will
be where we start our TLB table.

Here is the reality of the world.  Some folks struggle with bit
manipulation, orring and anding and shifting and such, some dont.  The
MMU is logic so it operates on these tables in the way that logic would,
meaning from a programmers perspective it is a lot of bit manipulation
but otherwise is relatively simple to something a program could do.  As
programmers we need to know how the logic uses portsion of the virtual
address to look into this descriptor table or TLB, and then extracts
from those bits the next thing it needs to do.  We have to know this so
that for a particular virtual address we can place the descriptor we
want in the place where the hardware is going to find it.  So we need
a few lines of code plus some basic understanding of what is going on.
Just like bit manipulation causes some folks to struggle, reading
a chapter like this mmu chapter is equally daunting.  It is nice to
have somehone hold your hand through it.  Hopefully I am doing more
good than bad in that respect.

There is a file, section_translation.ps in this repo, you should be
able to use a pdf viewer to open this file.  The figure on the
second page shows just the address translation from virtual to physical
for a 1MB section.  This picture uses X instead of N, we are using an
N = 0 so that means X = 0.   The translation table base at the top
of the diagram is our MMUTABLEBASE, the address in physical space
of the beginning of our first level TLB or descriptor table.  The
first thing we need to do is find the table entry for the virtual
address in question (the Modified virtual address in this diagram,
as far as we are concerned it is unmodified it is the virtual
address we intend to use).  The first thing we see is the lower
14 bits of the translation table base are SBZ = should be zero.
Basically we need to have the translation table base aligned on a
16Kbyte boundary (2 to the 14th is 16K).  It would not make sense
to use all zeros as the translation table base, we have our reset
and interrupt vectors at and near address zero in the arms address
space so the first sane address would be 0x00004000.  The first
level descriptor is based on the top 12 bits of the virtual address
or 4096 entries, that is 16KBytes (not a coincidence), 0x4000 + 0x4000
is 0x8000, where our arm programs entry point is, so we have space
there if we want to use it.  But any address with the lower 14 bits
being zero will work so long as you have enough memory at that address
and you are not clobbering anything else that is using that memory
space.

So what this picture is showing us is that we take the top 12 bits
of the virtual address, multiply by 4 or shift left 2, and add tat
to the translation table base, this gives the address for the first
level descriptor for that virtual address.  The diagram shows the
first level fetch which returns a 32 bit value that we have placed
in the table.  If the lower 2 bits of that first level descriptor are
0b10 then this is a 1MB Section.  If a 1MB section then the top 12
bits of the first level descriptor replace the top 12 bits of the
virtual address to convert it into a physical address.  Understand
here first and foremost so long as we do the N = 0 thing, the first
level descriptor or the first thing the mmu does is look at the top
12 bits of the virtual address, always.  If the lower two bits of
the first level descriptor are not 0b10 then we get into
a second level descriptor and more virtual bits come into play, but
for now if we start by learning just 1MB sections, the conversion
from virtual to physical only cares about the top 12 bits of the
address.  So for 1MB sections we dont have to concentrate on every
actual address we are going to access we only need to think about
the 1MB aligned ranges.  The uart for example on the raspi 1 has
a number of registers that start with 0x202150xx, if we use a 1MB
section for those we only care about the 0x202xxxxx part of the
address.  To not have to change our code we would want to have
the virtual = physical for that and do not mark it as cacheable.

So if my MMUTABLEBASE was 0x00004000 and I had a virtual address of
0x12345678 then the hardware is going to take the top 12 bits of that
address 0x123, multiply by 4 and add that to the MMUTABLEBASE.
0x4000+(0x123<<2) = 0x448C.  and that is the address the mmu is going
to use for the first-level lookup.  Ignoring the other bits in the
descriptor for now, if the first-level descriptor has the value
0xABC00002, the lower two bits are 0x10, a 1MB section, so the top
12 bits replace the virtual addresses top 12 bits and our 0x12345678
is converted to the physical address 0xABC45678.


Now they have this optional thing called a supersection which is a 16MB
sized thing rather than 1MB and one might think that that would make
life easier, right?  Wrong.  No matter what, assuming the N = 0 thing
the first level descriptor is found using the top 12 bits of the
virtual address, so in order to do some 16MB thing you need 16 entries
one for each of the possible 1MB sections.  If you are already
generating 16 descriptors might as well just make them 1MB sections,
you can read up on the differences between super sections and sections
and try them if you want.  For what I am doing here dont need them,
just wanted to point out you still need 16 entries per super section.

Hopefully I have not lost you yet with this address manipulation,
and maybe you are one step ahead of me, yes EVERY load and store with
the mmu enabled requires at least one mmu table lookup, the mmu when it
accesses this memory does not go through itself, but EVERY other fetch
and load and store.  Which does have a performance hit, they do have
a bit of a cache in the mmu to store the last so many tlb lookups.
That helps, but you cannot avoid the mmu having to do the conversion
on every address.

In the ARM ARM I am looking at the subsection on first-level descriptors
has a table:
Table B4-1 First-level descriptor format (VMSAv6, subpages enabled)
What this is telling us is that if the first-level descriptor, the
32 bit number we place in the right place in the TLB, has the lower
two bits 0b10 then that entry defines a 1MB section and the mmu can get
everything it needs from that first level descriptor.  But if the
lower two bits are 0b01 then this is a coarse page table entry and
we have to go to a second level descriptor to complete the
conversion from virtual to physical.  Not every address will need
this only the address ranges we want to be more coarsely divided than
1MB.  Or the other way of saying it is of we want to control an
address range in chunks smaller than 1MB then we need to use pages
not sections.  You can certainly use pages for the whole world, but
if you do the math, 4096Byte pages would mean your mmu table needs
to be 4MB+16K worst case.  And you have to do more work to set that
all up.

The coarse_translation.ps file I have included in this repo starts
off the same way as a section, has to the logic doesnt know what
you want until it sees the first level descriptor.  If it sees a
0b01 as the lower 2 bits of the first level descriptor then this is
a coarse page table entry and it needs to do a second level fetch.
The second level fetch does not use the mmu tlb table base address
bits 31:10 of the second level address plus bits 19:12 of the
virtual address (times 4) are where the second level descriptor lives.
Note that is 8 more bits so the section is divided into 256 parts, this
page table address is similar to the mmu table address, but it needs
to be aligned on a 1K boundry (lower 10 bits zeros) and can be worst
case 1KBytes in size.

The second level descriptor format defined in the ARM ARM (small pages
are most interesting here, subpages enabled) is a little different
than a first level section, we had a domain in the first level
descriptor to get here, but now have direct access to four sets of
AP bits you/I would have to read more to know what the difference
is between the domain defined AP and these additional four, for now
I dont care this is bare metal, set them to full access (0b11) and
move on (see below about domain and ap bits).

So lets take the virtual address 0x12345678 and the MMUTABLEBASE of
0x4000 again.  The first level descriptor address is the top three
bits of the virtual address 0x123, times 4, added to the MMUTABLEBASE
0x448C.  But this time when we look it up we find a value in the
table that has the lower two bits being 0b01.  Just to be crazy lets
say that descriptor was 0xABCDE001  (ignoring the domain and other
bits just talking address right now).  That means we take 0xABCDE000
the picture shows bits 19:12 (0x45) of the virtual address (0x12345678)
so the address to the second level descriptor in this crazy case is
0xABCDE000+(0x45<<2) = 0xABCDE114  why is that crazy?  because I
chose an address where we in theory dont have ram on the raspberry pi
maybe a mirrored address space, but a sane address would have been
somewhere close to the MMUTABLEBASE so we can keep the whole of the
mmu tables in a confined area.  Used this address simply for
demonstration purposes not based on a workable solution.

The "other" bits in the descriptors are the domain, the TEX bits,
the C and B bits, domain and AP.

The C bit is the simplest one to start with that means Cacheable.  For
peripherals we absolutely dont want them to be cached.  For ram, maybe.

The b bit, means bufferable, as in write buffer.  Something you may
not have heard about or thought about ever.  It is kind of like a cache
on the write end of things instead of read end.   I digress, when
a processor writes something everything is known, the address and
data.  So the next level of logic, could, if so designed, accept
that address and data at that level and release the processor to
keep doing what it was doing (ideally fetch some more instructions
and keep running) in parallel that logic could then continue to perform
the write to the slower peripheral or really slow dram (or faster cache).
Giving us a small to large performance gain.  But, what happens if while
we are doing that first write another write happens.  Well if we only
have storage for one transaction in this little feature then the
processor has to wait for us to finish the first write however long
that takes, then we can grab the information for the second write and
then release the processor.  I call writes "fire and forget" because
ideally the processor hands off the info to the memory controller
and keeps going, the memory controller has all the info it needs to
complete the task.  For a read the processor needs that data back so
basically has to wait.  Well a write buffer can store up to some number
of addresses and data.  It can still fill up and have to hold the
processor off.  But it is similar to a cache is to reading, it has
some faster ram that stages writes so the processor, sometimes, can
keep on going.

Now the TEX bits you just have to look up and there is the rub there
are likely more than one set of tables for TEX C and B, I am going
to stick with a TEX of 0b000 and not mess with any fancy features
there.  Now depending on whether this is considered an older arm
(ARMv5) or an ARMv6 or newer the combination of TEX, C and B have
some subtle differences.  The cache bit in particular does enable
or disable this space as cacheable.  That simply asserts bits on
the AMDA/AXI (memory) bus that marks the transaction as cacheable,
you still need a cache and need it setup and enabled for the
transaction to actually get cached.  If you dont have the cache for
that transaction type enabled then it just does a normal memory (or
peripheral) operation.  So we set TEX to zeros to keep it out of the
way.

Lastly the domain and AP bits.  Now you will see a 4 bit domain thing
and a 2 bit domain thing.  These are related.  There is a register in
the MMU right next to the translation table base address register this
one is a 32 bit register that contains 16 different domain definitions.

The two bit domain controls are defined as such (these are AP bits)

0b00 No access Any access generates a domain fault
0b01 Client Accesses are checked against the access permission bits in the TLB entry
0b10 Reserved Using this value has UNPREDICTABLE results
0b11 Manager Accesses are not checked against the access permission bits in the TLB
entry, so a permission fault cannot be generated

For starters we are going to set all of the domains to 0b11 dont check
cant fault.  What are these 16 domains though?  Notice it takes 4 bits
to describe one of 16 things.  The different domains have no specific
meaning other than that we can have 16 different definitions that we
control for whatever reason.  You might allow for 16 different
threads running at once in your operating system, or 16 different
types of software running (kernel, application, ...) you can mark
a bunch of sections as belonging to one parituclar domain, and with a
simple change to that domain control register, a whole domain might
go from one type of permission to another, from no checking to
no access for example.  By just writing this domain register you can
quickly change what address spaces have permission and which ones dont
without necessarily changing the mmu table.

Since I usually use the MMU in bare metal to enable data caching on ram
I set my domain controls to 0b11, no checking and I simply make all
the MMU sections domain number 0.

So we end up with this simple function that allows us to add first level
descriptors in the MMU translation table.

unsigned int mmu_section ( unsigned int vadd, unsigned int padd, unsigned int flags )
{
    unsigned int ra;
    unsigned int rb;
    unsigned int rc;

    ra=vadd>>20;
    rb=MMUTABLEBASE|(ra<<2);
    ra=padd>>20;
    rc=(ra<<20)|flags|2;
    PUT32(rb,rc);
    return(0);
}

So what you have to do to turn on the MMU is to first figure out all
the memory you are going to access, and make sure you have entries
for that.  This is important, if you forget something, and dont have
a valid entry there, then you fault, your fault handler, if you have
chosen to write it, may also fault if it isnt placed write or something
it accesses also faults...(I would assume the fault handler is also
behind the mmu but would have to read up on that).

So the smallest amount of ram on a raspi is 256MB or 0x10000000 bytes.

Our program enters at address 0x8000, so that is within the first
section 0x000xxxxx so we should make that section cacheable and
bufferable.

    mmu_section(0x00000000,0x00000000,0x0000|8|4);

This is saying map the virtual 0x000xxxxx to the physical 0x000xxxxx
enable the cache and write buffer. 0x8 is the C bit and 0x4 is the B
bit.  tex, domain, etc are zeros.

If we want to use all 256mb we would need to do this for all the
sections from 0x000xxxxx to 0x100xxxxx.  Maybe do that later.

We know that for the raspi1 the peripherals, uart and such are in
arm physical space at 0x20xxxxxx.  To allow for more ram on the raspi 2
they needed to move that and moved it to 0x3Fxxxxxx.  So we either need
16 1MB section sized entries to cover that whole range or we look at
specific sections for specific things we care to talk to and just add
those.  The uart and the gpio it is associated with is in the 0x202xxxxx
space.  There are a couple of timers in the 0x200xxxxx space so one
entry can cover those.

if we didnt want to allow those to be cached or write buffered then

    mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
    mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
    mmu_section(0x3F000000,0x3F000000,0x0000); //NOT CACHED!
    mmu_section(0x3F200000,0x3F200000,0x0000); //NOT CACHED!

but we may play with that to demonstrate what caching a peripheral
can do to you, why we need to turn on the mmu if for no other reason
than to get some bare metal performance by using the d cache.

Now you have to think on a system level here, there are a number
of things in play.  We need to plan our memory space, where are we
putting the MMU table, where are our peripherals, where is our program.

If the only reason for using the mmu is to allow the use of the d cache
then just map the whole world virtual = physical if you want with the
peripherals not cached and the rest cached.

If you are on the raspi 2 with multiple arm cores and are using
the multiple arm cores you need to do more reading if you want one
core to talk to another by sharing some of the memory between
them.  Same problem as peripherals basically with multiple masters
of the ram/peripheral on the far side of my cache, how do I insure
what is in my cache maches the far side?  Easiest way is to not
cache that space.  You need to read up on if the cores share a cache
or have their own (or if l2 if present is shared but l1 is not),
ldrex/strex were implemented specifically for multi core, but you
need to understand the cache effects on these instructions (<grin>
not documented well, I have an example on just this one topic).

So once our tables are setup then we need to actually turn the
MMU on.  Now I cant figure out where I got this from, and I have
modified it in this repo.  According to this manual it was with the
ARMv6 that we got the DSB feature which says wait for either cache
or MMU to finish something before continuing.  In particular when
initializing a cache to start it up you want to clean out all the
entries in a safe way you dont want to evict them and hose memory
you want to invalidate everything, mark it such that the cache lines
are empty/available.  Likewise that little bit of TLB caching the MMU
has, we want to invalidate that too so we dont start up the mmu
with entries in there that dont match our entries.

Why are we invalidating the cache in mmu init code?  Because first we
need the mmu to use the d cache (to protect the peripherals from
being cached) and second the controls that enable the mmu are in the
same register as the i and d controls so it made sense to do both
mmu and cache stuff in one function.

So after the DSB we set our domain control bits, now in this example
I have done something different, 15 of the 16 domains have the 0b11
setting which is dont fault on anything, manager mode.  I set domain
1 such that it has no access, so in the example I will change one
of the descriptor table entries to use domain one, then I will access
it and then see the access violation.  I am also programming both
translation table base addresses even though we are using the N = 0
mode and only one is needed.  Depends on which manual you read I guess
as to whether or not you see the N = 0 and the separate or shared
i and d mmu tables.  (the reason for two is if you want your i and
d address spaces to be managed separately).

Understand I have been running on ARMv6 systems without the DSB and it
just works, so maybe that is dumb luck...

This code relies on the caller to pass in the MMU enable and I and D
cache enables.  This is because this is derived from code where
sometimes I turn things on or dont turn things on and wanted it
generic.


.globl start_MMU
start_MMU:
    mov r2,#0
    mcr p15,0,r2,c7,c7,0 ;@ invalidate caches
    mcr p15,0,r2,c8,c7,0 ;@ invalidate tlb
    mcr p15,0,r2,c7,c10,4 ;@ DSB ??

    mvn r2,#0
    bic r2,#0xC
    mcr p15,0,r2,c3,c0,0 ;@ domain

    mcr p15,0,r0,c2,c0,0 ;@ tlb base
    mcr p15,0,r0,c2,c0,1 ;@ tlb base

    mrc p15,0,r2,c1,c0,0
    orr r2,r2,r1
    mcr p15,0,r2,c1,c0,0

    bx lr

I am going to mess with the translation tables after the MMU is started
so the easiest way to deal with the TLB cache is to invalidate it, but
dont need to mess with main L1 cache.  ARMv6 introduces a feature to
help with this, but going with this solution.

.globl invalidate_tlbs
invalidate_tlbs:
    mov r2,#0
    mcr p15,0,r2,c8,c7,0  ;@ invalidate tlb
    mcr p15,0,r2,c7,c10,4 ;@ DSB ??
    bx lr

Something to note here.  Debugging using the JTAG based on chip debugger
makes life easier, that removing sd cards or the old days pulling an
eeprom out and putting it it in an eraser then a programmer.  BUT,
it is not completely without issue.  When and where and if you hit this
depends heavily on the core you are using and the jtag tools and the
commands you remember/prefer.  The basic problem is caches can and
often do separate instruction I fetches from data D reads and writes.
So if you have test run A of a program that has executed the instruction
at address 0xD000.  So that instruction is in the I cache.  You have
also executed the instruction at 0xC000 but it has been evicted, but
you dont actually know what is in the I cache or not, shouldnt even
try to assume.  You stop the processor, you write a new program to
memory, now these are data D writes, and go through the D cache.  Then
you set the start address and run again.  Now there are a number of
combinations here and only one if them works, the rest can lead to
failure.

For each instruction/address in the program, if the prior instruction
at that address was in the i cache, and since data writes do not go
through the i cache then the new instruction for that address is either
in the d cache or in main ram.  When you run the new program you will
get the stale/old instruction from a prior run when you fetch that
address (unless an invalidate happens, if a flush happens then you
write back, but why would an I cache flush?), and if the new instruction
at that address is not the same as the old one unpredictable results
will occur.  You can start to see the combinations, did the data
write go through to d cache or to ram, will it flush to ram and is the
i cache invalid for that address, etc.

There is also the quesiton of are the I and D caches shared, they can
be but that is both specific to the core and your setup.  Also does
the jtag debugger have the ability to disable the caches, has it done
it for you, can you do it manually.

Any time you are using the i or d caches you need to be careful using
a jtag debugger or even a bootloader type approach depending on its
design as you might end up doing data writes of instructions and going
around the i cache or worse.  So for this kind of work using a chip
reset and non volitle rom/flash based bootloader can/will save you
a lot of headaches.  If you know your debugger is solving this for you,
great, but always make sure as you change from the raspi 2 back to
a raspi 1 for example it might not be doing it and it will drive you
nuts when you keep downloading a new program and it either crashes
in a strange way or simply just keeps running the old program and
not appearing to take your new changes.

So the example is going to start with the mmu off and write to
addresses in four different 1MB address spaces.  So that later we
can play with the section descriptors and demonstrate virtual to
physical address conversion.

So write some stuff and print it out on the uart.

    PUT32(0x00045678,0x00045678);
    PUT32(0x00145678,0x00145678);
    PUT32(0x00245678,0x00245678);
    PUT32(0x00345678,0x00345678);

    hexstring(GET32(0x00045678));
    hexstring(GET32(0x00145678));
    hexstring(GET32(0x00245678));
    hexstring(GET32(0x00345678));
    uart_send(0x0D); uart_send(0x0A);

then setup the mmu with at least those four sections and the peripherals

    mmu_section(0x00000000,0x00000000,0x0000|8|4);
    mmu_section(0x00100000,0x00100000,0x0000);
    mmu_section(0x00200000,0x00200000,0x0000);
    mmu_section(0x00300000,0x00300000,0x0000);
    //peripherals
    mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
    mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!

and start the mmu with the I and D caches enabled

    start_mmu(MMUTABLEBASE,0x00000001|0x1000|0x0004);

then if we read those four addresses again we get the same output
as before since we maped virtual = physical.

    hexstring(GET32(0x00045678));
    hexstring(GET32(0x00145678));
    hexstring(GET32(0x00245678));
    hexstring(GET32(0x00345678));
    uart_send(0x0D); uart_send(0x0A);

but what if we swizzle things around.  make virtual 0x001xxxxx =
physical 0x003xxxxx.  0x002 looks at 0x000 and 0x003 looks at 0x001
(dont mess with the 0x00000000 section, that is where our program is
running)

    mmu_section(0x00100000,0x00300000,0x0000);
    mmu_section(0x00200000,0x00000000,0x0000);
    mmu_section(0x00300000,0x00100000,0x0000);

and maybe we dont need to do this but do it anyway just in case

    invalidate_tlbs();

read them again.

    hexstring(GET32(0x00045678));
    hexstring(GET32(0x00145678));
    hexstring(GET32(0x00245678));
    hexstring(GET32(0x00345678));
    uart_send(0x0D); uart_send(0x0A);

the 0x000xxxxx entry was not modifed so we get 000045678 as the output
but the 0x001xxxxx read is now coming from physical 0x003xxxxx so we
get the 00345678 output, 0x002xxxxx comes from the 0x000xxxxx space
so that read gives 00045678 and the 0x003xxxxx is mapped to 0x001xxxxx
physical giving 00145678 as the output.

So up to this point the output looks like this.

DEADBEEF
00045678
00145678
00245678
00345678

00045678
00145678
00245678
00345678

00045678
00345678
00045678
00145678

first blob is without the mmu enabled, second with the mmu but
virtual = physical, third we use the mmu to show virtual != physical
for some ranges.

Now for some small pages, I made this function to help out.

unsigned int mmu_small ( unsigned int vadd, unsigned int padd, unsigned int flags, unsigned int mmubase )
{
    unsigned int ra;
    unsigned int rb;
    unsigned int rc;

    ra=vadd>>20;
    rb=MMUTABLEBASE|(ra<<2);
    rc=(mmubase&0xFFFFFC00)/*|(domain<<5)*/|1;
    //hexstrings(rb); hexstring(rc);
    PUT32(rb,rc); //first level descriptor
    ra=(vadd>>12)&0xFF;
    rb=(mmubase&0xFFFFFC00)|(ra<<2);
    rc=(padd&0xFFFFF000)|(0xFF0)|flags|2;
    //hexstrings(rb); hexstring(rc);
    PUT32(rb,rc); //second level descriptor
    return(0);
}

So before turning on the mmu some physical addresses were written
with some data.  The function takes the virtual, physical, flags and
where you want the secondary table to be.  Remember secondary tables
can be up to 1K in size and are aligned on a 1K boundary.


    mmu_small(0x0AA45000,0x00145000,0,0x00000400);
    mmu_small(0x0BB45000,0x00245000,0,0x00000800);
    mmu_small(0x0CC45000,0x00345000,0,0x00000C00);
    mmu_small(0x0DD45000,0x00345000,0,0x00001000);
    mmu_small(0x0DD46000,0x00146000,0,0x00001000);
    //put these back
    mmu_section(0x00100000,0x00100000,0x0000);
    mmu_section(0x00200000,0x00200000,0x0000);
    mmu_section(0x00300000,0x00300000,0x0000);
    invalidate_tlbs();

Now why did I use different secondary table addresses most of the
time but not all of the time?  A secondary table lookup is the same
first level descriptor for the top 12 bits of the address, if the
top 12 bits of the address are different it is a different secondary
table.  So to demonstrate that we actually have separation within a
section I have two small pages within a 1MB section that I point
at two different physical address spaces.  So in short if the top
12 bits of the virtual address are the same then they share the same
coarse page table, the way the function works it writes both first
and second level descriptors so if you were to do this

    mmu_small(0x0DD45000,0x00345000,0,0x00001000);
    mmu_small(0x0DD46000,0x00146000,0,0x00001400);

Then both of those virtual addresses would go to the 0x1400 table, and
the first virtual address would not have a secondary entry its
secondary entry would be in a table at 0x1000 but the first level
no longer points to 0x1000 so the mmu would get whatever it finds
in the 0x1400 table.


The last example is just demonstrating an access violation.  Changing
the domain to that one domain we did not set full access to

    //access violation.

    mmu_section(0x00100000,0x00100000,0x0020);
    invalidate_tlbs();

    hexstring(GET32(0x00045678));
    hexstring(GET32(0x00145678));
    hexstring(GET32(0x00245678));
    hexstring(GET32(0x00345678));
    uart_send(0x0D); uart_send(0x0A);

The first 0x45678 read comes from that first level descriptor, with
that domain

00045678
00000010

How do I know what that means with that output.  Well from my blinker07
example we touched on exceptions (interrupts).  I made a generic test
fixture such that anything other than a reset prints something out
and then hangs.   In no way shape or form is this a complete handler
but what it does show is that it is the exception that is at address
0x00000010 that gets hit which is data abort.  So figuring out it was
a data abort (pretty much expected) have that then read the data fault
status registers, being a data access we expect the data/combined one
to show somthing and the instruction one to not.  Adding that
instrumentation resulted in.

00045678
00000010
00000019
00000000
00008110
E5900000
00145678

Now I switched to the ARM1176JZF-S Technical Reference Manual for more
detail and that shows the 0x01 was domain 1, the domain we used for
that access. then the 0x9 means Domain Section Fault.

The lr during the abort shows us the instruction, which you would need
to disassemble to figure out the address, or at least that is one
way to do it perhaps there is a status register for that.

The instruction and the address match our expectations for this fault.

This is simply a basic intro.  Just enough to be dangerous.  The MMU
is one of the simplest peripherals to program so long as bit
manipulation is not something that causes you to lose sleep.  What makes
it hard is that if you mess up even one bit, or forget even one thing
you can crash in spectacular ways (often silently without any way of
knowing what happened).  Debugging can be hard at best.

The ARM ARM indicates that the ARMv6 adds the feature of separating
the I and D from an mmu perspective which is an interesting thought
(see the jtag debugging comments, and think about how this can affect
you re-loading a program into ram and running) you have enough ammo
to try that.  The ARMv7 doesnt seem to have a legacy mode yet, still
reading, the descriptors and how they are addresses looks basically
the same but this code doesnt yet work on the raspi 2, so I will
continue to work on that and update this repo when I figure it out.