mmu readme re-written (in the piaplus, will use that one for other pi1's)
This commit is contained in:
903
boards/piaplus/mmu/README
Normal file
903
boards/piaplus/mmu/README
Normal file
@@ -0,0 +1,903 @@
|
||||
|
||||
See the top level README for information on where to find documentation
|
||||
for the raspberry pi and the ARM processor inside. Also find information
|
||||
on how to load and run these programs.
|
||||
|
||||
This example is for the pi A+, see other directories for other flavors
|
||||
of raspberry pi.
|
||||
|
||||
This example demonstrates ARM MMU basics.
|
||||
|
||||
You will need the ARM ARM (ARM Architectural Reference Manual) for
|
||||
ARMv5. I have a couple of pages included in this repo, but you still
|
||||
will need the ARM ARM.
|
||||
|
||||
So what an MMU does or at least what an MMU does for us is it
|
||||
translates virtual addresses into physical addresses as well as
|
||||
checking access permissions, and gives us control over cachable
|
||||
regions.
|
||||
|
||||
What does all of that mean?
|
||||
|
||||
Well lets go back a little. If you are old enough to have a desktop
|
||||
computer then a CPU to you may not or may have meant the big box that
|
||||
you plugged the monitor, keyboard, and mouse into. And that isnt
|
||||
all that incorrect. But when we get into understanding things at
|
||||
this level, bare metal, we have to dig way deeper.
|
||||
|
||||
I currently use processor core or ARM core or some such terms. You
|
||||
have to separate the notion of the system and break it into smaller
|
||||
parts. There is a processor core, that somehow magically gets our
|
||||
instructions, it executes them which means from time to time it does
|
||||
memory bus accesses to talk to the things that our instructions have
|
||||
told it to do. We the programmer know the addresses for things the
|
||||
processor is very stupid in that respect, it knows basically nothing.
|
||||
|
||||
Now the processor has a bus (or busses sometimes), a bunch of signals,
|
||||
address, data in, data out, and control signals to indicate reads from
|
||||
writes and so on. That bus is for this discussion is connected to the
|
||||
mmu, and there is a similar if not identical one on the other side,
|
||||
but everything we want to say to the outside world we say through
|
||||
the mmu. When the mmu is not doing its thing, it just passes those
|
||||
requests right on through unmodified. This example has to do with
|
||||
what happens when you enable the mmu.
|
||||
|
||||
So for this discussion lets say the processor side of the mmu addresses
|
||||
are called virtual addresses and the world side (memory, perpherals
|
||||
(uart, gpio, etc), and almost everything else) are physical addresses.
|
||||
One job of the mmu is to translate from virtual to physical.
|
||||
|
||||
You may have used tools in your toolchain other than the compiler and
|
||||
may have realized that programs you compile to run on top of the
|
||||
operating system you use on your computer are all compiled to run at
|
||||
the same address. How is that possible and have them run "at the same
|
||||
time"? Well the reality is that none of them are running at that
|
||||
address. You might have two programs both compiled to run at address
|
||||
0x8000, but the reality is thanks to the mmu and the operating system
|
||||
managing resources, program A may actually be running at 0x10008000 and
|
||||
program B at 0x20008000, no conflict at all. When program A accesses
|
||||
what it thinks is address 0xABCDE it is really talking 0x100ABCDE,
|
||||
likewise if program B accesses 0xABCDE it is really 0x200ABCDE.
|
||||
The 0x8000 or 0xABCDE addresses are virtual, that is what the program
|
||||
thinks it is talking to, the 0x10008000 or 0x20008000 addresses are
|
||||
physical, that is what we are really talking to or at least that is
|
||||
what the MMU thinks it is talking to <grin>. We already know by this
|
||||
point that there is another magic address translation in the raspberry
|
||||
pi. The Broadcom documents talk about peripherals being at some address
|
||||
0x7Fxxxxxx, but depending on which pi we have we have to access 0x20xxxxxx
|
||||
or 0x3Fxxxxxx from the ARMs perspective. And that is not atypical but
|
||||
also not as obvious. Take any of the peripherals for example we may
|
||||
have to have some 0x20ABCDEF address for something but when we push
|
||||
down into the logic of that peripheral many of those address bits
|
||||
go away and we may be left with 0xEF or 0xF or 0x3, no reason to carry
|
||||
about extra address bits in the logic if you only have a few registers.
|
||||
|
||||
So for this discussion the processor and our programs operate using
|
||||
virtual addresses. The mmu turns those into physical addresses. When
|
||||
the mmu is disabled then physical = virtual. And when it is on there
|
||||
is no reason we cannot make physical = virtual if we want, and we will
|
||||
for most of this. Not making an operating system here just
|
||||
demonstrating some basics.
|
||||
|
||||
Checking access permissions, what does that mean. Well remember our
|
||||
two programs one at 0x10008000 and the other at 0x20008000. Well if
|
||||
one program is smart enough what is to keep it from accessing the
|
||||
other programs memory? Let us start with thinking single core
|
||||
processors which the ARM11 on this chip is. We now live in a world
|
||||
where even our phones have 4 or 8 processor cores working together.
|
||||
The idea translates from single to multiple. With any one of these
|
||||
single cores, the operating system gives each program a little slice
|
||||
of time. Then usually an interrupt happens either based on time or
|
||||
based on some other event and the operating system says it is time
|
||||
for someone else to use the processor for a while. The operating
|
||||
system has to do a little mmu swizzling to say switch 0x8000 to point
|
||||
at 0x10008000 instead of 0x20008000, but it also changes the virtual
|
||||
id (or whatever term your processor uses) for the code it is about to
|
||||
allow to run (remember the operating system is code itself and runs
|
||||
in an address space with permissions as well). The mmu tables not
|
||||
only operate on converting virtual addresses to physical but they
|
||||
also are or can be set to allow or dis-allow virtual ids. How exactly
|
||||
varies widely from one processor family to another, one mmu to another
|
||||
(ARM vs x86, vs mips, etc). But if you want to have a computer that
|
||||
is not trivial to hack by having one program run around where it isnt
|
||||
supposed to you have to have this layer of protection. And we will
|
||||
see that, initially we will just allow everyone or at least us full
|
||||
access.
|
||||
|
||||
Control over cachable regions. That gets into what is a cache in this
|
||||
context. Well memory is expensive, it takes a lot of transistors, we
|
||||
have two basic volatile types SRAM and DRAM. SRAM when you set one
|
||||
bit to a value a one or a zero, so long as the power stays on it
|
||||
remembers that value. DRAM is more like a rechargeable battery, it
|
||||
drains over time, if you want it to remember a zero, no problem (just
|
||||
run with this simplification if you actually know how they work) if
|
||||
you want it to remember a one though, you have to keep reminding it
|
||||
that it is a one by charging it back up, if you forget to charge it
|
||||
back up it will drain to a zero. We dont actually have to do this
|
||||
there is logic that does this refresh for us. But...SRAM takes twice
|
||||
as many transistors per bit than DRAM, so that right there makes it
|
||||
more expensive, also the speed of the memory drives up the prices
|
||||
in crazy ways as well. You may think that the DRAM in your computer
|
||||
is 1 or 2000Mhz, but it is really much much slower, they are just
|
||||
playing parallel games to allow the bus to be that fast. So what
|
||||
does this have to do with caches? Well the state of the world today
|
||||
is we have gobs of relatively slow DRAM. And programs tend to do
|
||||
a couple of things. First off obviously programs run sequentially
|
||||
you run one instruction after another until you hit a branch, so
|
||||
if you had a way to read a head a little bit of the code you are running
|
||||
you would have to wait so long for that slow memory. Another thing
|
||||
that we/programs do with data other than instrucitons, is we tend
|
||||
to re-use a variable for some period of time. We re-use the same
|
||||
memory address for a while then go onto somewhere else and maybe come
|
||||
back and mabye not.
|
||||
|
||||
So the state of the world is gobs of slow DRAM then we put one or more
|
||||
layers of caches in front made of faster SRAM but because of the cost
|
||||
of SRAM they are relatively small but still big enough to store some
|
||||
instructions and some data that we are actively using. Just like the
|
||||
MMU, these caches are inline between us and the rest of the world.
|
||||
Whenever we perform a read with the cache enabled the cache will see
|
||||
if it has a copy of our data, if so that is a hit and it returns its
|
||||
copy of our data. If it is a miss then it will go get our data plus
|
||||
some more data after or around our data just in case we are sequentially
|
||||
working through some memory or accessing various portions of a struct,
|
||||
etc (or are executing code linearly before hitting a branch). Now the
|
||||
cache knows what copies of things it has, and it is very limited in
|
||||
size relative to the address space. So obviously it is going to run
|
||||
out of space. So before it can go get the thing we are asking for, it
|
||||
has to make room by evicting something it has. Before going into that
|
||||
understand that when we write the cache looks at that as well, sometimes
|
||||
a write to something causes the cache to go get a copy of that area of
|
||||
memory and sometimes only reads cause the cache to make a copy. But
|
||||
either way if the cache has a copy of that thing in the cache, it will
|
||||
complete that write by writing to the caches copy, now the cache has a
|
||||
copy that is newer and different than the outside world. So now we
|
||||
have this situation where the cache needs to make room by evicting
|
||||
somebody. Caches are designed by different people and they dont all
|
||||
use the same logic to make this decision, some keep track of the oldest
|
||||
stuff, some keep track of the oldest stuff, some just use a randomizer
|
||||
and the unlucky data gets evicted. The cache knows if the data it has
|
||||
a copy of has been written to, meaning that its copy is the fresh copy
|
||||
with new data and the copy out in the world is stale/old and must be
|
||||
updated before we free up that portion of the cache. If there have
|
||||
been no modifications then we really dont have to write that data out,
|
||||
buf if there are we do. Now we have a hole, can read the data from
|
||||
the world and return the one thing the processor asked for.
|
||||
|
||||
Am I ever going to get to the point about control over cachable regions?
|
||||
We understand that the cache keeps a copy of stuff we read so that
|
||||
if we read it or something right next to it we dont have to go out to
|
||||
slow memory. We get an answer for those second and third reads much
|
||||
faster hoping that overall the one long read of extra data at a slow
|
||||
speed is balanced by several reads that take very little time to make
|
||||
it overall faster. But what if the address we are reading is the
|
||||
status of something? It is an address that is managed by maybe us
|
||||
but also by someone (logic or program) else too? Like the uart status
|
||||
that tells us there is room to send another character? If we read
|
||||
the uart status, and the cache reads the uart status one time and keeps
|
||||
a copy (that says the uart is busy) in the cache, and so long as that
|
||||
doesnt get evicted every time we read that status we get the copy that
|
||||
says the uart is busy, possibly forever. Well that wont work. This
|
||||
is cache coherence, and has to do with more than one owner of a resource
|
||||
that is on the far side of one or more caches. In the case of the
|
||||
uart that other resource is the uart logic itself. But in the case
|
||||
of multiple processors (the arm and the gpu, or in multi-core systems
|
||||
one core and another). So we as the manager of the mmu need to be able
|
||||
to specify whether a region that we map can be cached or not. There
|
||||
are signals on the bus on the world side of the mmu that runs into
|
||||
the processor/mmu side of the cache that tell the cache if a particular
|
||||
access is cacheable. Only the ones marked cacheable go through all
|
||||
of that rambling above, ones marked as not cacheable pass on through
|
||||
essentially.
|
||||
|
||||
And one last cache comment before moving into real stuff. Instruction
|
||||
vs data. When the processor needs to fetch more instructions to
|
||||
execute it knows those reads are instruction fetches. Likewise when
|
||||
our program tells the processor to do a read, the processor knows those
|
||||
are data reads. Instruction fetches are always reads, and if we assume
|
||||
no self modifying code, then the copy in the cache always matches
|
||||
the copy out in the world. So we dont have to have an mmu to help
|
||||
us isolate regions for purposes of cache coherncy with respect to
|
||||
instruction fetches. The problem comes with data reads and writes.
|
||||
So we often have separate instruction cache controls and data cache
|
||||
controls in the mmu and perhaps in the L1 cache as it can sometimes
|
||||
treat the two separately. Here again caches and mmus vary from one
|
||||
architecture to another (ARM, x86, MIPS, etc). So we can actually
|
||||
turn on instruciton caching without the mmu and hope for a performance
|
||||
improvement. But we cannot in general turn on a data cache and not
|
||||
have cache coherency problems with our peripherals, so we need the
|
||||
mmu for that. Some designs, some microcontrollers for example, will
|
||||
be designed such that memory is below some address, and peripherals
|
||||
and will only cache data accesses below that line, preventing the need
|
||||
for an MMU for that reason, and being a microcontroller we dont need
|
||||
the mmu for the other reasons either.
|
||||
|
||||
As with all baremetal programming, wading through documentation is
|
||||
the bulk of the job. Definitely true here, with the unfortunate
|
||||
problem that ARM's docs dont all look the same from one Archtectural
|
||||
Reference Manual to an other. We have this other problem that we
|
||||
are techically using an ARMv6 (architecture version 6)(for the raspi 1)
|
||||
but when you go to ARM's website there is an ARMv5 and then ARMv7 and
|
||||
ARMv8, but no ARMv6. Well the ARMv5 manual is actually the original
|
||||
ARM ARM, that I assume they realized couldnt maintain all the
|
||||
architecture variations forever in one document, so they perhaps
|
||||
wisely went to one ARM ARM per rev. With respect to the MMU, the ARMv5
|
||||
reference manual covers the ARMv4 (I didnt know there was an mmu option
|
||||
there) ARMv5 and ARMv6, and there is a mode such that you can have the
|
||||
same code/tables and it works on all three, meaning you dont have to
|
||||
if-then-else your code based on whatever architecture you find. This
|
||||
raspi 1 example is based on subpages enabled which is this legacy or
|
||||
compatibility mode across the three.
|
||||
|
||||
I am mostly using the ARMv5 Architectural Reference Manual.
|
||||
ARM DDI0100I.
|
||||
|
||||
It should be obvious that we cannot translate ANY virtual address into
|
||||
ANY physical address 0x12345678 into 0xAABBCCDD for example. Why not?
|
||||
Well there are 32 bits, so 4GigaAddresses if it were possible to map
|
||||
every one of those to any arbitrary other 32 bit address we would need
|
||||
a 4 GigaWord table or 16 Gigabytes. First off how would we access
|
||||
those 16 Gigabytes which is more than we can access on this system and
|
||||
then have other memory that those translate for also on this system.
|
||||
It just doesnt fit. So obviously we have to reduce the problem and
|
||||
how you do that is you only modify the top address bits and leave the
|
||||
lower ones the same between virtual and physical. How many upper
|
||||
bits gets into the design of the mmu and a balancing game of how
|
||||
many different things do we want to map. If we were to only take
|
||||
the top 4 bits we could re-map 1/16th of the address space, that would
|
||||
make for a pretty small table to look up the translation, but would
|
||||
that make any sense? You couldnt even have 16 different programs
|
||||
unless you had ram in each of those areas which certainly on the
|
||||
raspberry pi we dont. All the ram we have is in the lower 16th.
|
||||
And we know we cant translate every address to every address so we
|
||||
have to find some middle ground. ARM or at least in this legacy mode
|
||||
initially divides the world up into 1MB sections. 32 bit address space
|
||||
1MB is 20 bits, 32-20 is 12, or 4096 possible combinations. To support
|
||||
1MB pages we would need an mmu table with 4096 entries. That is
|
||||
managable. But maybe there are times when we need to divide one or
|
||||
more of those 1MB sections up into smaller parts. And they allow for
|
||||
that. We will also look at what they call a small page which is in
|
||||
units of 4096 bytes.
|
||||
|
||||
ARM uses the term Virtual Memory System Architecture or VMSA and
|
||||
they say things like VMSAv6 to talk about the ARMv6 VMSA. There
|
||||
is a section in the ARM ARM titled Virtual Memory System Architecture.
|
||||
In there we see the coprocessor registers, specifically CP15 register
|
||||
2 is the translation table base register.
|
||||
|
||||
So the ARMv5 ARM ARM (ARM Architectural Reference Manual) is what
|
||||
we need now. See the top level README for finding this document,
|
||||
I have included a few pages in the form of postscript, any decent pdf
|
||||
viewer should be able to handle these files. Before the pictures
|
||||
though, the section in quesiton is titled Virtual Memory System
|
||||
Architecture. In the CP15 subsection register 2 is the the translation
|
||||
table base register. There are three opcodes which give us access to
|
||||
three things, TTBR0, TTBR1 and the control register.
|
||||
|
||||
First we read this comment
|
||||
|
||||
If N = 0 always use TTBR0. When N = 0 (the reset case), the translation
|
||||
table base is backwards compatible with earlier versions of the
|
||||
architecture.
|
||||
|
||||
That is the one we want, we will leave that as N = 0 and not touch it
|
||||
and use TTBR0
|
||||
|
||||
Now what the TTBR0 description initially is telling me that bit 31
|
||||
down to 14-n or 14 in our case since n = 0 is the base address, in
|
||||
PHYSICAL address space. Note the mmu cannot possibly go through the
|
||||
mmu to figure out how to go through the mmu, the mmu itself only
|
||||
operates on physical space and has direct access to it. In a second
|
||||
we are going to see that we need the base address for the mmu table
|
||||
to be aligned to 16384 bytes (when n=0). (2 to the power 14, the
|
||||
lower 14 bits of our TLB base address needs to be all zeros).
|
||||
|
||||
We write that register using
|
||||
|
||||
mcr p15,0,r0,c2,c0,0 ;@ tlb base
|
||||
|
||||
TLB = Translation Lookaside Buffer. As far as we are concerned think
|
||||
of it as an array of 32 bit integers, each integer (descriptor) being
|
||||
used to completely or partially convert from virtual to physical and
|
||||
describe permissions and caching.
|
||||
|
||||
My example is going to have a define called MMUTABLEBASE which will
|
||||
be where we start our TLB table.
|
||||
|
||||
Here is the reality of the world. Some folks struggle with bit
|
||||
manipulation, orring and anding and shifting and such, some dont. The
|
||||
MMU is logic so it operates on these tables in the way that logic would,
|
||||
meaning from a programmers perspective it is a lot of bit manipulation
|
||||
but otherwise is relatively simple to something a program could do. As
|
||||
programmers we need to know how the logic uses portions of the virtual
|
||||
address to look into this descriptor table or TLB, and then extracts
|
||||
from those bits the next thing it needs to do. We have to know this so
|
||||
that for a particular virtual address we can place the descriptor we
|
||||
want in the place where the hardware is going to find it. So we need
|
||||
a few lines of code plus some basic understanding of what is going on.
|
||||
Just like bit manipulation causes some folks to struggle, reading
|
||||
a chapter like this mmu chapter is equally daunting. It is nice to
|
||||
have someone hold your hand through it. Hopefully I am doing more
|
||||
good than bad in that respect.
|
||||
|
||||
There is a file, section_translation.ps in this repo, you should be
|
||||
able to use a pdf viewer to open this file. The figure on the
|
||||
second page shows just the address translation from virtual to physical
|
||||
for a 1MB section. This picture uses X instead of N, we are using an
|
||||
N = 0 so that means X = 0. The translation table base at the top
|
||||
of the diagram is our MMUTABLEBASE, the address in physical space
|
||||
of the beginning of our first level TLB or descriptor table. The
|
||||
first thing we need to do is find the table entry for the virtual
|
||||
address in question (the Modified virtual address in this diagram,
|
||||
as far as we are concerned it is unmodified it is the virtual
|
||||
address we intend to use). The first thing we see is the lower
|
||||
14 bits of the translation table base are SBZ = should be zero.
|
||||
Basically we need to have the translation table base aligned on a
|
||||
16Kbyte boundary (2 to the 14th is 16K). It would not make sense
|
||||
to use all zeros as the translation table base, we have our reset
|
||||
and interrupt vectors at and near address zero in the arms address
|
||||
space so the first sane address would be 0x00004000. The first
|
||||
level descriptor is based on the top 12 bits of the virtual address
|
||||
or 4096 entries, that is 16KBytes (not a coincidence), 0x4000 + 0x4000
|
||||
is 0x8000, where our arm programs entry point is, so we have space
|
||||
there if we want to use it. But any address with the lower 14 bits
|
||||
being zero will work so long as you have enough memory at that address
|
||||
and you are not clobbering anything else that is using that memory
|
||||
space.
|
||||
|
||||
So what this picture is showing us is that we take the top 12 bits
|
||||
of the virtual address, multiply by 4 or shift left 2, and add that
|
||||
to the translation table base, this gives the address for the first
|
||||
level descriptor for that virtual address. The diagram shows the
|
||||
first level fetch which returns a 32 bit value that we have placed
|
||||
in the table. We have to place a descriptor there that tells the
|
||||
mmu to do what we want. If the lower 2 bits of that first level
|
||||
descriptor are 0b10 then this is a 1MB Section. If a 1MB section
|
||||
then the top 12 bits of the first level descriptor replace the top
|
||||
12 bits of the virtual address to convert it into a physical address.
|
||||
Understand here first and foremost so long as we do the N = 0 thing,
|
||||
the first level descriptor or the first thing the mmu does is look at
|
||||
the top 12 bits of the virtual address, always. If the lower two bits
|
||||
of the first level descriptor are not 0b10 then we get into
|
||||
a second level descriptor and more virtual bits come into play, but
|
||||
for now if we start by learning just 1MB sections, the conversion
|
||||
from virtual to physical only cares about the top 12 bits of the
|
||||
address. So for 1MB sections we dont have to concentrate on every
|
||||
actual address we are going to access we only need to think about
|
||||
the 1MB aligned ranges. The uart for example on the raspi 1 has
|
||||
a number of registers that start with 0x202150xx, if we use a 1MB
|
||||
section for those we only care about the 0x202xxxxx part of the
|
||||
address. To not have to change our code we would want to have
|
||||
the virtual = physical for that and mark it as not cacheable.
|
||||
|
||||
So if my MMUTABLEBASE was 0x00004000 and I had a virtual address of
|
||||
0x12345678 then the hardware is going to take the top 12 bits of that
|
||||
address 0x123, multiply by 4 and add that to the MMUTABLEBASE.
|
||||
0x4000+(0x123<<2) = 0x0000448C. And that is the address the mmu is
|
||||
going to use for the first-level lookup. Ignoring the other bits in
|
||||
the descriptor for now, if the first-level descriptor has the value
|
||||
0xABC00002, the lower two bits are 0x10, a 1MB section, so the top
|
||||
12 bits replace the virtual addresses top 12 bits and our 0x12345678
|
||||
is converted to the physical address 0xABC45678.
|
||||
|
||||
Now they have this optional thing called a supersection which is a 16MB
|
||||
sized thing rather than 1MB and one might think that that would make
|
||||
life easier, right? Wrong. No matter what, assuming the N = 0 thing
|
||||
the first level descriptor is found using the top 12 bits of the
|
||||
virtual address, so in order to do some 16MB thing you need 16 entries
|
||||
one for each of the possible 1MB sections. If you are already
|
||||
generating 16 descriptors anyway, you might as well just make them 1MB
|
||||
sections, you can read up on the differences between super sections and
|
||||
sections and try them if you want. For what I am doing here dont need
|
||||
them, just wanted to point out you still need 16 entries per super
|
||||
section.
|
||||
|
||||
Hopefully I have not lost you yet with this address manipulation,
|
||||
and maybe you are one step ahead of me, yes EVERY fetch, load or store
|
||||
with the mmu enabled requires at least one mmu table lookup, the mmu
|
||||
when it accesses this memory does not go through itself, but EVERY
|
||||
other fetch and load and store. Which does have a performance hit,
|
||||
they do have a bit of a cache in the mmu to store the last so many tlb
|
||||
lookups. That helps, but you cannot avoid the mmu having to do the
|
||||
conversion on every address.
|
||||
|
||||
In the ARM ARM I am looking at the subsection on first-level
|
||||
descriptors has a table:
|
||||
|
||||
Table B4-1 First-level descriptor format (VMSAv6, subpages enabled)
|
||||
|
||||
What this is telling us is that if the first-level descriptor, the
|
||||
32 bit number we place in the right place in the TLB, has the lower
|
||||
two bits 0b10 then that entry defines a 1MB section and the mmu can get
|
||||
everything it needs from that first level descriptor. But if the
|
||||
lower two bits are 0b01 then this is a coarse page table entry and
|
||||
we have to go to a second level descriptor to complete the
|
||||
conversion from virtual to physical. Not every address will need
|
||||
this only the address ranges we want to be more coarsely divided than
|
||||
1MB. Or the other way of saying it is of we want to control an
|
||||
address range in chunks smaller than 1MB then we need to use pages
|
||||
not sections. You can certainly use pages for the whole world, but
|
||||
if you do the math, 4096Byte pages would mean your mmu table needs
|
||||
to be 4MB+16K worst case. And you have to do more work to set that
|
||||
all up.
|
||||
|
||||
The coarse_translation.ps file I have included in this repo starts
|
||||
off the same way as a section, it has to, the logic doesnt know what
|
||||
you want until it sees the first level descriptor. If it sees a
|
||||
0b01 as the lower 2 bits of the first level descriptor then this is
|
||||
a coarse page table entry and it needs to do a second level fetch.
|
||||
The second level fetch does not use the mmu tlb table base address
|
||||
bits 31:10 of the second level address plus bits 19:12 of the
|
||||
virtual address (times 4) are where the second level descriptor lives.
|
||||
Note that is 8 more bits so the section is divided into 256 parts, this
|
||||
page table address is similar to the mmu table address, but it needs
|
||||
to be aligned on a 1K boundry (lower 10 bits zeros) and can be worst
|
||||
case 1KBytes in size.
|
||||
|
||||
The second level descriptor format defined in the ARM ARM (small pages
|
||||
are most interesting here, subpages enabled) is a little different
|
||||
than a first level section, we had a domain in the first level
|
||||
descriptor to get here, but now have direct access to four sets of
|
||||
AP bits you/I would have to read more to know what the difference
|
||||
is between the domain defined AP and these additional four, for now
|
||||
I dont care this is bare metal, set them to full access (0b11) and
|
||||
move on (see below about domain and ap bits).
|
||||
|
||||
So lets take the virtual address 0x12345678 and the MMUTABLEBASE of
|
||||
0x4000 again. The first level descriptor address is the top three
|
||||
bits of the virtual address 0x123, times 4, added to the MMUTABLEBASE
|
||||
0x448C. But this time when we look it up we find a value in the
|
||||
table that has the lower two bits being 0b01. Just to be crazy lets
|
||||
say that descriptor was 0xABCDE001 (ignoring the domain and other
|
||||
bits just talking address right now). That means we take 0xABCDE000
|
||||
the picture shows bits 19:12 (0x45) of the virtual address (0x12345678)
|
||||
so the address to the second level descriptor in this crazy case is
|
||||
0xABCDE000+(0x45<<2) = 0xABCDE114 why is that crazy? because I
|
||||
chose an address where we in theory dont have ram on the raspberry pi
|
||||
maybe a mirrored address space, but a sane address would have been
|
||||
somewhere close to the MMUTABLEBASE so we can keep the whole of the
|
||||
mmu tables in a confined area. Used this address simply for
|
||||
demonstration purposes not based on a workable solution.
|
||||
|
||||
The "other" bits in the descriptors are the domain, the TEX bits,
|
||||
the C and B bits, domain and AP.
|
||||
|
||||
The C bit is the simplest one to start with that means Cacheable. For
|
||||
peripherals we absolutely dont want them to be cached. For ram, maybe.
|
||||
|
||||
The b bit, means bufferable, as in write buffer. Something you may
|
||||
not have heard about or thought about ever. It is kind of like a cache
|
||||
on the write end of things instead of read end. I digress, when
|
||||
a processor writes something everything is known, the address and
|
||||
data. So just like when you give a letter to the post(wo)man as
|
||||
far as you are concerned you are done, you dont need to wait for it
|
||||
to actually make it all the way to its destination. You can go on with
|
||||
your day. Likewise if you have 10 letters to send, if you keep going
|
||||
with this though you could fill up the mail truck then you would have
|
||||
to wait for another and then you could go on with your day. A write
|
||||
buffer is the same deal. For reads we have to wait for an answer so it
|
||||
doesnt work the same way but writes we have this option. Why not use
|
||||
it all the time? Well we dont have control over it, the writes happen
|
||||
at some unknown to us time in the future, we can possibly get into a
|
||||
cache coherency like problem of assuming something was written when
|
||||
it wasnt.
|
||||
|
||||
Now the TEX bits you just have to look up and there is the rub, there
|
||||
are likely more than one set of tables for TEX C and B, I am going
|
||||
to stick with a TEX of 0b000 and not mess with any fancy features
|
||||
there. Now depending on whether this is considered an older arm
|
||||
(ARMv5) or an ARMv6 or newer the combination of TEX, C and B have
|
||||
some subtle differences. The cache bit in particular does enable
|
||||
or disable this space as cacheable. That simply asserts bits on
|
||||
the AMDA/AXI (memory) bus that marks the transaction as cacheable,
|
||||
you still need a cache and need it setup and enabled for the
|
||||
transaction to actually get cached. If you dont have the cache for
|
||||
that transaction type enabled then it just does a normal memory (or
|
||||
peripheral) operation. So we set TEX to zeros to keep it out of the
|
||||
way.
|
||||
|
||||
Lastly the domain and AP bits. Now you will see a 4 bit domain thing
|
||||
and a 2 bit domain thing. These are related. There is a register in
|
||||
the MMU right next to the translation table base address register this
|
||||
one is a 32 bit register that contains 16 different domain definitions.
|
||||
|
||||
The two bit domain controls are defined as such (these are AP bits)
|
||||
|
||||
0b00 No access Any access generates a domain fault
|
||||
0b01 Client Accesses are checked against the access permission bits in the TLB entry
|
||||
0b10 Reserved Using this value has UNPREDICTABLE results
|
||||
0b11 Manager Accesses are not checked against the access permission bits in the TLB
|
||||
entry, so a permission fault cannot be generated
|
||||
|
||||
For starters we are going to set all of the domains to 0b11 dont check
|
||||
cant fault. What are these 16 domains though? Notice it takes 4 bits
|
||||
to describe one of 16 things. The different domains have no specific
|
||||
meaning other than that we can have 16 different definitions that we
|
||||
control for whatever reason. You might allow for 16 different
|
||||
threads running at once in your operating system, or 16 different
|
||||
types of software running (kernel, application, ...) you can mark
|
||||
a bunch of sections as belonging to one parituclar domain, and with a
|
||||
simple change to that domain control register, a whole domain might
|
||||
go from one type of permission to another, from no checking to
|
||||
no access for example. By just writing this domain register you can
|
||||
quickly change what address spaces have permission and which ones dont
|
||||
without necessarily changing the mmu table.
|
||||
|
||||
Since I usually use the MMU in bare metal to enable data caching on ram
|
||||
I set my domain controls to 0b11, no checking and I simply make all
|
||||
the MMU sections domain number 0.
|
||||
|
||||
So we end up with this simple function that allows us to add first level
|
||||
descriptors in the MMU translation table.
|
||||
|
||||
unsigned int mmu_section ( unsigned int vadd, unsigned int padd, unsigned int flags )
|
||||
{
|
||||
unsigned int ra;
|
||||
unsigned int rb;
|
||||
unsigned int rc;
|
||||
|
||||
ra=vadd>>20;
|
||||
rb=MMUTABLEBASE|(ra<<2);
|
||||
ra=padd>>20;
|
||||
rc=(ra<<20)|flags|2;
|
||||
PUT32(rb,rc);
|
||||
return(0);
|
||||
}
|
||||
|
||||
So what you have to do to turn on the MMU is to first figure out all
|
||||
the memory you are going to access, and make sure you have entries
|
||||
for that. This is important, if you forget something, and dont have
|
||||
a valid entry there, then you fault, your fault handler, if you have
|
||||
chosen to write one, and it may also fault.
|
||||
|
||||
So the smallest amount of ram on a raspi is 256MB or 0x10000000 bytes.
|
||||
|
||||
Our program enters at address 0x8000, so that is within the first
|
||||
section 0x000xxxxx so we should make that section cacheable and
|
||||
bufferable.
|
||||
|
||||
mmu_section(0x00000000,0x00000000,0x0000|8|4);
|
||||
|
||||
This is saying map the virtual 0x000xxxxx to the physical 0x000xxxxx
|
||||
enable the cache and write buffer. 0x8 is the C bit and 0x4 is the B
|
||||
bit. tex, domain, etc are zeros.
|
||||
|
||||
If we want to use all 256mb we would need to do this for all the
|
||||
sections from 0x000xxxxx to 0x100xxxxx. Actually I changed the code
|
||||
and the first thing it does is map everything virtual = physical with
|
||||
no caching.
|
||||
|
||||
We know that for the pi1 the peripherals, uart and such are in ARM
|
||||
physical space at 0x20xxxxxx. So we either need 16 1MB section sized
|
||||
entries to cover that whole range or we look at specific sections for
|
||||
specific things we care to talk to and just add those. The uart and
|
||||
the gpio it is associated with is in the 0x202xxxxx space. There are
|
||||
a couple of timers in the 0x200xxxxx space so one entry can cover those.
|
||||
|
||||
if we didnt want to allow those to be cached or write buffered then
|
||||
|
||||
mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
|
||||
mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
|
||||
|
||||
(yes we already did this when we had a loop map the whole world)
|
||||
|
||||
Now you have to think on a system level here, there are a number
|
||||
of things in play. We need to plan our memory space, where are we
|
||||
putting the MMU table, where are our peripherals, where is our program.
|
||||
|
||||
If the only reason for using the mmu is to allow the use of the d cache
|
||||
then just map the whole world virtual = physical if you want with the
|
||||
peripherals not cached and the rest cached.
|
||||
|
||||
So once our tables are setup then we need to actually turn the
|
||||
MMU on. Now I cant figure out where I got this from, and I have
|
||||
modified it in this repo. According to this manual it was with the
|
||||
ARMv6 that we got the DSB feature which says wait for either cache
|
||||
or MMU to finish something before continuing. In particular when
|
||||
initializing a cache to start it up you want to clean out all the
|
||||
entries in a safe way you dont want to evict them and hose memory
|
||||
you want to invalidate everything, mark it such that the cache lines
|
||||
are empty/available by throwing away what was there, not saving it.
|
||||
Likewise that little bit of TLB caching the MMU has, we want to
|
||||
invalidate that too so we dont start up the mmu with entries in there
|
||||
that dont match our entries.
|
||||
|
||||
Why are we invalidating the cache in mmu init code? Because first we
|
||||
need the mmu to use the d cache (to protect the peripherals from
|
||||
being cached) and second the controls that enable the mmu are in the
|
||||
same register as the i and d controls so it made sense to do both
|
||||
mmu and cache stuff in one function.
|
||||
|
||||
So after the DSB we set our domain control bits, now in this example
|
||||
I have done something different, 15 of the 16 domains have the 0b11
|
||||
setting which is dont fault on anything, manager mode. I set domain
|
||||
1 such that it has no access, so in the example I will change one
|
||||
of the descriptor table entries to use domain one, then I will access
|
||||
it and then see the access violation. I am also programming both
|
||||
translation table base addresses even though we are using the N = 0
|
||||
mode and only one is needed. Depends on which manual you read I guess
|
||||
as to whether or not you see the N = 0 and the separate or shared
|
||||
i and d mmu tables. (the reason for two registers is if you want your
|
||||
i and d address spaces to be managed separately).
|
||||
|
||||
Understand I have been running on ARMv6 systems without the DSB and it
|
||||
just works, so maybe that was dumb luck...
|
||||
|
||||
This code relies on the caller to pass in the MMU enable and I and D
|
||||
cache enables. This is because this is derived from code where
|
||||
sometimes I turn things on or dont turn things on and wanted it
|
||||
generic.
|
||||
|
||||
.globl start_MMU
|
||||
start_MMU:
|
||||
mov r2,#0
|
||||
mcr p15,0,r2,c7,c7,0 ;@ invalidate caches
|
||||
mcr p15,0,r2,c8,c7,0 ;@ invalidate tlb
|
||||
mcr p15,0,r2,c7,c10,4 ;@ DSB ??
|
||||
|
||||
mvn r2,#0
|
||||
bic r2,#0xC
|
||||
mcr p15,0,r2,c3,c0,0 ;@ domain
|
||||
|
||||
mcr p15,0,r0,c2,c0,0 ;@ tlb base
|
||||
mcr p15,0,r0,c2,c0,1 ;@ tlb base
|
||||
|
||||
mrc p15,0,r2,c1,c0,0
|
||||
orr r2,r2,r1
|
||||
mcr p15,0,r2,c1,c0,0
|
||||
|
||||
bx lr
|
||||
|
||||
I am going to mess with the translation tables after the MMU is started
|
||||
so the easiest way to deal with the TLB cache is to invalidate it, but
|
||||
dont need to mess with main L1 cache. ARMv6 introduces a feature to
|
||||
help with this, but going with this solution.
|
||||
|
||||
.globl invalidate_tlbs
|
||||
invalidate_tlbs:
|
||||
mov r2,#0
|
||||
mcr p15,0,r2,c8,c7,0 ;@ invalidate tlb
|
||||
mcr p15,0,r2,c7,c10,4 ;@ DSB ??
|
||||
bx lr
|
||||
|
||||
Something to note here. Debugging using the JTAG based on chip debugger
|
||||
makes life easier, that removing sd cards or the old days pulling an
|
||||
eeprom out and putting it it in an eraser then a programmer. BUT,
|
||||
it is not completely without issue. When and where and if you hit this
|
||||
depends heavily on the core you are using and the jtag tools and the
|
||||
commands you remember/prefer. This is a basic cache coherency problem
|
||||
in a self modifying code kind of way. When we use the jtag debugger
|
||||
to write instructions to memory the debugger uses the ARM bus and does
|
||||
a data write, which does not go through the instruction cache. So if
|
||||
there is an instruction at address 0xD000 in the instruction cache when
|
||||
we stopped the ARM, and we write a new instruction from our new program
|
||||
to address 0xD000, when we start the ARM again if that 0xD000 doesnt
|
||||
get invalidated to make room for other instructions by the time we get
|
||||
to it, it will execute the old stale instrucion from one or more
|
||||
programs we ran in the past. Randomly mixing instructions from
|
||||
different programs just doesnt work. Again some of the debuggers and/or
|
||||
cores will disable caching when you use jtag, but some like this
|
||||
ARM11 may not, and this becomes a very real problem if you dont deal
|
||||
with it in some way (never I cache, never use the jtag debugger if
|
||||
using the I cache, see if your tools can disable the I cache before
|
||||
running the next program, etc). You also have to be aware of if the
|
||||
I and D caches are shared, and if so does that help you or not. Read
|
||||
your docs.
|
||||
|
||||
So the example is going to start with the mmu off and write to
|
||||
addresses in four different 1MB address spaces. So that later we
|
||||
can play with the section descriptors and demonstrate virtual to
|
||||
physical address conversion.
|
||||
|
||||
So write some stuff and print it out on the uart.
|
||||
|
||||
PUT32(0x00045678,0x00045678);
|
||||
PUT32(0x00145678,0x00145678);
|
||||
PUT32(0x00245678,0x00245678);
|
||||
PUT32(0x00345678,0x00345678);
|
||||
|
||||
hexstring(GET32(0x00045678));
|
||||
hexstring(GET32(0x00145678));
|
||||
hexstring(GET32(0x00245678));
|
||||
hexstring(GET32(0x00345678));
|
||||
uart_send(0x0D); uart_send(0x0A);
|
||||
|
||||
then setup the mmu with at least those four sections and the peripherals
|
||||
|
||||
mmu_section(0x00000000,0x00000000,0x0000|8|4);
|
||||
mmu_section(0x00100000,0x00100000,0x0000);
|
||||
mmu_section(0x00200000,0x00200000,0x0000);
|
||||
mmu_section(0x00300000,0x00300000,0x0000);
|
||||
//peripherals
|
||||
mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
|
||||
mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
|
||||
|
||||
actually the example now loops through the whole address space then
|
||||
does the two peripheral lines even though they are redundant.
|
||||
|
||||
and start the mmu with the I and D caches enabled
|
||||
|
||||
start_mmu(MMUTABLEBASE,0x00000001|0x1000|0x0004);
|
||||
|
||||
then if we read those four addresses again we get the same output
|
||||
as before since we maped virtual = physical.
|
||||
|
||||
hexstring(GET32(0x00045678));
|
||||
hexstring(GET32(0x00145678));
|
||||
hexstring(GET32(0x00245678));
|
||||
hexstring(GET32(0x00345678));
|
||||
uart_send(0x0D); uart_send(0x0A);
|
||||
|
||||
but what if we swizzle things around. make virtual 0x001xxxxx =
|
||||
physical 0x003xxxxx. 0x002 looks at 0x000 and 0x003 looks at 0x001
|
||||
(dont mess with the 0x00000000 section, that is where our program is
|
||||
running)
|
||||
|
||||
mmu_section(0x00100000,0x00300000,0x0000);
|
||||
mmu_section(0x00200000,0x00000000,0x0000);
|
||||
mmu_section(0x00300000,0x00100000,0x0000);
|
||||
|
||||
and maybe we dont need to do this but do it anyway just in case
|
||||
|
||||
invalidate_tlbs();
|
||||
|
||||
read them again.
|
||||
|
||||
hexstring(GET32(0x00045678));
|
||||
hexstring(GET32(0x00145678));
|
||||
hexstring(GET32(0x00245678));
|
||||
hexstring(GET32(0x00345678));
|
||||
uart_send(0x0D); uart_send(0x0A);
|
||||
|
||||
the 0x000xxxxx entry was not modifed so we get 000045678 as the output
|
||||
but the 0x001xxxxx read is now coming from physical 0x003xxxxx so we
|
||||
get the 00345678 output, 0x002xxxxx comes from the 0x000xxxxx space
|
||||
so that read gives 00045678 and the 0x003xxxxx is mapped to 0x001xxxxx
|
||||
physical giving 00145678 as the output.
|
||||
|
||||
So up to this point the output looks like this.
|
||||
|
||||
DEADBEEF
|
||||
00045678
|
||||
00145678
|
||||
00245678
|
||||
00345678
|
||||
|
||||
00045678
|
||||
00145678
|
||||
00245678
|
||||
00345678
|
||||
|
||||
00045678
|
||||
00345678
|
||||
00045678
|
||||
00145678
|
||||
|
||||
first blob is without the mmu enabled, second with the mmu but
|
||||
virtual = physical, third we use the mmu to show virtual != physical
|
||||
for some ranges.
|
||||
|
||||
Now for some small pages, I made this function to help out, note that
|
||||
it sets up both the first and second level descriptor.
|
||||
|
||||
unsigned int mmu_small ( unsigned int vadd, unsigned int padd, unsigned int flags, unsigned int mmubase )
|
||||
{
|
||||
unsigned int ra;
|
||||
unsigned int rb;
|
||||
unsigned int rc;
|
||||
|
||||
ra=vadd>>20;
|
||||
rb=MMUTABLEBASE|(ra<<2);
|
||||
rc=(mmubase&0xFFFFFC00)/*|(domain<<5)*/|1;
|
||||
//hexstrings(rb); hexstring(rc);
|
||||
PUT32(rb,rc); //first level descriptor
|
||||
ra=(vadd>>12)&0xFF;
|
||||
rb=(mmubase&0xFFFFFC00)|(ra<<2);
|
||||
rc=(padd&0xFFFFF000)|(0xFF0)|flags|2;
|
||||
//hexstrings(rb); hexstring(rc);
|
||||
PUT32(rb,rc); //second level descriptor
|
||||
return(0);
|
||||
}
|
||||
|
||||
So before turning on the mmu some physical addresses were written
|
||||
with some data. The function takes the virtual, physical, flags and
|
||||
where you want the secondary table to be. Remember secondary tables
|
||||
can be up to 1K in size and are aligned on a 1K boundary.
|
||||
|
||||
mmu_small(0x0AA45000,0x00145000,0,0x00000400);
|
||||
mmu_small(0x0BB45000,0x00245000,0,0x00000800);
|
||||
mmu_small(0x0CC45000,0x00345000,0,0x00000C00);
|
||||
mmu_small(0x0DD45000,0x00345000,0,0x00001000);
|
||||
mmu_small(0x0DD46000,0x00146000,0,0x00001000);
|
||||
//put these back
|
||||
mmu_section(0x00100000,0x00100000,0x0000);
|
||||
mmu_section(0x00200000,0x00200000,0x0000);
|
||||
mmu_section(0x00300000,0x00300000,0x0000);
|
||||
invalidate_tlbs();
|
||||
|
||||
Now why did I use different secondary table addresses most of the
|
||||
time but not all of the time? All accesses go through the first level
|
||||
descriptor before determining if they need a second. In order for
|
||||
two small page entries to work they have to have the same first level
|
||||
descriptor, and thus have to live in the same secondary table, so if
|
||||
you use this function with addresses whose top 12 bits match, their
|
||||
secondary table addresses have to match. And unless you think through
|
||||
a safe way to do it, if the upper 12 bits dont match then just use a
|
||||
different secondary table address.
|
||||
|
||||
If you were to do this instead
|
||||
|
||||
mmu_small(0x0DD45000,0x00345000,0,0x00001000);
|
||||
mmu_small(0x0DD46000,0x00146000,0,0x00001400);
|
||||
|
||||
That would be a bug, because the first line would have its secondary
|
||||
entry based on 0x1000, the second line would write the first level to
|
||||
point both of them at 0x1400, set its second level based on 0x1400 and
|
||||
now that first line's entry is not going to be used, it gets whatever
|
||||
it finds in the 0x1400 table.
|
||||
|
||||
So this basically points some small pages at the memory we setup
|
||||
in the beginning. Those last two small page entries demonstrating
|
||||
that we have separated from a section and now see small pages.
|
||||
|
||||
The last example is just demonstrating an access violation. Changing
|
||||
the domain to that one domain we did not set full access to
|
||||
|
||||
//access violation.
|
||||
|
||||
mmu_section(0x00100000,0x00100000,0x0020);
|
||||
invalidate_tlbs();
|
||||
|
||||
hexstring(GET32(0x00045678));
|
||||
hexstring(GET32(0x00145678));
|
||||
hexstring(GET32(0x00245678));
|
||||
hexstring(GET32(0x00345678));
|
||||
uart_send(0x0D); uart_send(0x0A);
|
||||
|
||||
The first 0x45678 read comes from that first level descriptor, with
|
||||
that domain
|
||||
|
||||
00045678
|
||||
00000010
|
||||
|
||||
How do I know what that means with that output. Well from my blinker05
|
||||
example we touched on exceptions (interrupts). I made a generic test
|
||||
fixture such that anything other than a reset prints something out
|
||||
and then hangs. In no way shape or form is this a complete handler
|
||||
but what it does show is that it is the exception that is at address
|
||||
0x00000010 that gets hit which is data abort. So figuring out it was
|
||||
a data abort (pretty much expected) have that then read the data fault
|
||||
status registers, being a data access we expect the data/combined one
|
||||
to show somthing and the instruction one to not. Adding that
|
||||
instrumentation resulted in.
|
||||
|
||||
00045678
|
||||
00000010
|
||||
00000019
|
||||
00000000
|
||||
00008110
|
||||
E5900000
|
||||
00145678
|
||||
|
||||
Now I switched to the ARM1176JZF-S Technical Reference Manual for more
|
||||
detail and that shows the 0x01 was domain 1, the domain we used for
|
||||
that access. then the 0x9 means Domain Section Fault.
|
||||
|
||||
The lr during the abort shows us the instruction, which you would need
|
||||
to disassemble to figure out the address, or at least that is one
|
||||
way to do it perhaps there is a status register for that.
|
||||
|
||||
The instruction and the address match our expectations for this fault.
|
||||
|
||||
This is simply a basic intro. Just enough to be dangerous. The MMU
|
||||
is one of the simplest peripherals to program so long as bit
|
||||
manipulation is not something that causes you to lose sleep. What makes
|
||||
it hard is that if you mess up even one bit, or forget even one thing
|
||||
you can crash in spectacular ways (often silently without any way of
|
||||
knowing what happened). Debugging can be hard at best.
|
||||
|
||||
The ARM ARM indicates that the ARMv6 adds the feature of separating
|
||||
the I and D from an mmu perspective which is an interesting thought
|
||||
(see the jtag debugging comments, and think about how this can affect
|
||||
you re-loading a program into ram and running) you have enough ammo
|
||||
to try that.
|
||||
Reference in New Issue
Block a user