864 lines
41 KiB
Plaintext
864 lines
41 KiB
Plaintext
|
|
See the top level README file for more information on documentation
|
|
and how to run these programs.
|
|
|
|
This example demonstrates ARM MMU basics.
|
|
|
|
You will need the ARM ARM (ARM Architectural Reference Manual) for
|
|
ARMv5. I have a couple of pages included in this repo, but you still
|
|
will need the ARM ARM.
|
|
|
|
This code so far does not work on the Raspberry pi 2 yet, will get
|
|
that working at some point, the knowledge here still applies, I expect
|
|
the differences to be subtle between ARMv6 and 7 but will see.
|
|
|
|
|
|
|
|
-- NEED TO RE-WRITE THIS AGAIN, SUBPAGES ENABLED, COARSE 1KB TABLES --
|
|
|
|
|
|
|
|
|
|
So what an MMU does or at least what an MMU does for us is it
|
|
translates virtual addresses into physical addresses as well as
|
|
checking access permissions, and gives us control over cachable
|
|
regions.
|
|
|
|
So what does all of that mean?
|
|
|
|
There is a boundary inside the chip around the ARM core, part of that
|
|
boundary is the memory interface for the ARM for lack of a better term
|
|
how the ARM accesses the world. Nothing special, all processors have
|
|
some sort of address and data based interface between the processor and
|
|
the ram and peripherals. That boundary uses physical addresses, that
|
|
boundary is on the memory side or "world side" of the ARM's mmu.
|
|
Within the ARM core there is the "processor side" of the mmu, and all
|
|
load and store (and fetch) accesses to the world go through the mmu.
|
|
|
|
When the ARM powers up the mmu is disabled, which means all accesses
|
|
pass through unmodified making the "processor side" or virtual address
|
|
space equal to the world side physical address space. All of my
|
|
examples thus far, blinkers and such are based on physical addresses.
|
|
We already know that elswhere in the chip is another address
|
|
translation of some sort, because the manual is written for 0x7Exxxxxx
|
|
based adresses, but the ARM's physical addresses for those same things
|
|
is 0x20xxxxxx for the raspi 1 and 0x3Fxxxxxx for the raspi 2. For this
|
|
discussion we only care about that other mystery address translation
|
|
we care about the ARM and the ARM mmu.
|
|
|
|
So when I say the mmu translates virtual addresses into physical
|
|
addresses. What that means is on the processor side there is an address
|
|
you are accessing, but that does not have to be the same address on
|
|
the physical address side of the mmu. Lets say for example I am
|
|
running a program on an operating system, Linux lets say, and I need
|
|
to compile that program before I can use it and I need to link it for
|
|
an address space so lets say that I link it to enter at address 0x8000
|
|
and use memory from 0x0000 to whatever I need and/or whatever is
|
|
available. So that is all fine, except what if I have two programs
|
|
and I want both running "at the same time" how can both use the same
|
|
address space without clobbering each other? The answer is neither is
|
|
at that address space the virtual address WHEN RUNNING one of them is
|
|
in the virtual address space 0x00000000 to some number, but in reality
|
|
program 1 might have that mapped to the physical address 0x01000000 and
|
|
program 2 might have its 0x00000000 to some number mapped to 0x02000000.
|
|
So when program 1 thinks it is writing to address 0xABCDE it is really
|
|
writing to 0x010ABCDE and when program 2 thinks it is writing to
|
|
address 0xABCDE it is really writing to 0x020ABCDE.
|
|
|
|
If you think about it it doesnt make any sense to allow any virtual
|
|
address to map to any physical address, for example from 0x12345678
|
|
to 0xAABBCCDD. Think about it, we are talking about a 32 bit address
|
|
space or 4Giga addresses. If we allowed any address to convert to
|
|
any other address we would need a 4Giga to 4Giga map, we would actually
|
|
need 16Gigabytes just to hold the 4Giga physical adresses worst case.
|
|
To cut to the chase ARM has one option where the top 12 bits of the
|
|
virtual get translated to 12 bits of physical, the lower 20 bits in
|
|
that case are the same between the virtual and physical. This means
|
|
we can control 1MByte of address space with one definition, and have
|
|
4096 entries in some table somewhere to convert from virtual to
|
|
physical. That is quite managable. The minimum we would need to
|
|
store are the 12 replacement bits per table entry, but ARM uses a full
|
|
32 bit entry, which for this 1MB flavor, has the 12 physical bits plus
|
|
some other control bits.
|
|
|
|
What does cachable regions mean? The mmu also gives you the feature
|
|
of being able to choose per descriptor whether or not you want to
|
|
enable caching on that block. One obvious reason would be for the
|
|
peripherals. Think about a timer, ideally you read the current timer
|
|
tick and each time you read it you get the current timer tick and
|
|
as it changes you see it change. But what if when we turned on the
|
|
data cache it covered all addresses, all loads and stores? Then you
|
|
read the timer once, get a value, read it again, now you get the
|
|
cached value over and over again you dont see the real timer value
|
|
in the peripheral. That is not good, you cannot manage a peripheral
|
|
if you cannot read its status register or read the data coming out
|
|
of it, etc. So at a minimum your peripherals need to be in non-cached
|
|
blocks. Likewise, if you have some ram that is shared by more than
|
|
one resource, say the GPU and the ARM or for the raspberry pi 2 shared
|
|
between multiple ARM cores, you have a similar situation, another
|
|
resource may change the ram on the far side of your cache but your
|
|
cache assumes it has a copy of what is in ram. Basically a cache
|
|
only helps you if whatever on the far side of it is only modified by
|
|
writes through the cache, if there are ways to change the data on
|
|
the far side you should not cache that area. The mmu gives you
|
|
the ability to control cached and non-cahced spaces.
|
|
|
|
What is meant by access permissions? Lets think about those two
|
|
programs running "at the same time" on some operating system (Linux
|
|
for example) you dont want to allow one program to gain access to
|
|
the operating systems data nor some other programs data. Some
|
|
operating systems sure that are meant for only running trusted and
|
|
well mannered programs. But you dont want some video game on your
|
|
home computer to have access to your banking account data in another
|
|
window/program? The mechanisms vary across processor families but
|
|
an important job for the mmu is to provide a protection mechanism.
|
|
Such that when a particular program has a time slice on the processor
|
|
there is some mechanism to allow or restrict memory spaces. If some
|
|
code accesses an address that it does not have permission for then
|
|
an abort happens and the processor is notified. An interesting
|
|
side effect of this is that this doesnt have to be fatal, in fact it
|
|
could be by design. Think of a virtual machine, you could let the
|
|
virtual machine software run on the processor, and when it accesses
|
|
one of its peripherals the real operating system gets an abort but
|
|
instead of killing the virtual machine it actually simulates the
|
|
peripheral and lets the virtual machine keep running. Another one
|
|
that you have probably run into is when you run out of ram in your
|
|
computer, the notion of virtual memory which is differen than virtual
|
|
address space. Virtual memory in this case is when your program
|
|
ventures off the end of its allowed address space into ram it thinks
|
|
it has. The operating system gets an abort, finds some ram from
|
|
some other program, swaps that ram to disk for example, then allows
|
|
the program that was running to have a little more ram by mapping it
|
|
back in and allowing it to run. Later when the program whose data
|
|
got swapped to disk needs it it swaps back and whatever was in the
|
|
ram it swaps with then goes to disk. The term swap comes from the
|
|
idea that these blocks of ram are swapped back and forth to disk,
|
|
program A's ram goes to disk and is swapped with program T's, then
|
|
program T's is swapped with program K's and so on. This is why
|
|
starting right after you venture off that edge from real ram to
|
|
virtual, your computers performance drops dramatically and disk
|
|
activity goes way up, the more things running the more swapping going
|
|
on and disk is significantly slower than ram.
|
|
|
|
As with all baremetal programming, wading through documentation is
|
|
the bulk of the job. Definitely true here, with the unfortunate
|
|
problem that ARM's docs dont all look the same from one Archtectural
|
|
Reference Manual to an other. We have this other problem that we
|
|
are techically using an ARMv6 (architecture version 6)(for the raspi 1)
|
|
but when you go to ARM's website there is an ARMv5 and then ARMv7 and
|
|
ARMv8, but no ARMv6. Well the ARMv5 manual is actually the original
|
|
ARM ARM, that I assume they realized couldnt maintain all the
|
|
architecture variations forever in one document, so they perhaps
|
|
wisely went to one ARM ARM per rev. With respect to the MMU, the ARMv5
|
|
reference manual covers the ARMv4 (I didnt know there was an mmu option
|
|
there) ARMv5 and ARMv6, and there is mode such that you can have the
|
|
same code/tables and it works on all three, meaning you dont have to
|
|
if-then-else your code based on whatever architecture you find. This
|
|
raspi 1 example is based on subpages enabled which is this legacy or
|
|
compatibility mode across the three.
|
|
|
|
I am mostly using the ARMv5 Architectural Reference Manual.
|
|
ARM DDI0100I.
|
|
|
|
The 1MB sections mentioned above are called...sections...The ARM
|
|
mmu also has blobs that are smaller sizes 4096 byte pages for
|
|
example, will touch on those two sizes. The 4096 byte one is called
|
|
a small page.
|
|
|
|
As mentioned above, 32 bit address space, 1MB is 20 bits so 32-20 is
|
|
12 bits or 4096 possible combinations or the address space is broken
|
|
up into 4096 1MB sections. The top 12 bits of the virtual address
|
|
get translated to 12 bits of physical. No rules on the translation
|
|
you can have virtual = physical or have any combination, or have
|
|
a bunch of virtual sections point at the same physical space, whatever
|
|
you want/need.
|
|
|
|
ARM uses the term Virtual Memory System Architecture or VMSA and
|
|
they say things like VMSAv6 to talk about the ARMv6 VMSA. There
|
|
is a section in the ARM ARM titled Virtual Memory System Architecture.
|
|
In there we see the coprocessor registers, specifically CP15 register
|
|
2 is the translation table base register.
|
|
|
|
|
|
So the ARMv5 ARM ARM (ARM Architectural Reference Manual) is what
|
|
we need now. See the top level README for finding this document,
|
|
I have included a few pages in the form of postscript, any decent pdf
|
|
viewer should be able to handle these files. Before the pictures
|
|
though, the section in quesiton is titled Virtual Memory System
|
|
Architecture. In the CP15 subsection register 2 is the the translation
|
|
table base register. There are three opcodes which give us access to
|
|
three things, TTBR0, TTBR1 and the control register.
|
|
|
|
First we read this comment
|
|
|
|
If N = 0 always use TTBR0. When N = 0 (the reset case), the translation
|
|
table base is backwards compatible with earlier versions of the
|
|
architecture.
|
|
|
|
That is the one we want, we will leave that as N = 0 and not touch it
|
|
and use TTBR0
|
|
|
|
Now what the TTBR0 description initially is telling me that bit 31
|
|
down to 14-n or 14 in our case since n = 0 is the base address, in
|
|
PHYSICAL address space. Note the mmu cannot possibly go through the
|
|
mmu to figure out how to go through the mmu, the mmu itself only
|
|
operates on physical space and has direct access to it. In a second
|
|
we are going to see that we need the base address for the mmu table
|
|
to be aligned to 16384 bytes. (2 to the power 14, the lower 14 bits
|
|
of our TLB base address needs to be all zeros).
|
|
|
|
We write that register using
|
|
|
|
mcr p15,0,r0,c2,c0,0 ;@ tlb base
|
|
|
|
TLB = Translation Lookaside Buffer. As far as we are concerned think
|
|
of it as an array of 32 bit integers, each integer (descriptor) being
|
|
used to completely or partially convert from virtual to physical and
|
|
describe permissions and caching.
|
|
|
|
My example is going to have a define called MMUTABLEBASE which will
|
|
be where we start our TLB table.
|
|
|
|
Here is the reality of the world. Some folks struggle with bit
|
|
manipulation, orring and anding and shifting and such, some dont. The
|
|
MMU is logic so it operates on these tables in the way that logic would,
|
|
meaning from a programmers perspective it is a lot of bit manipulation
|
|
but otherwise is relatively simple to something a program could do. As
|
|
programmers we need to know how the logic uses portsion of the virtual
|
|
address to look into this descriptor table or TLB, and then extracts
|
|
from those bits the next thing it needs to do. We have to know this so
|
|
that for a particular virtual address we can place the descriptor we
|
|
want in the place where the hardware is going to find it. So we need
|
|
a few lines of code plus some basic understanding of what is going on.
|
|
Just like bit manipulation causes some folks to struggle, reading
|
|
a chapter like this mmu chapter is equally daunting. It is nice to
|
|
have somehone hold your hand through it. Hopefully I am doing more
|
|
good than bad in that respect.
|
|
|
|
There is a file, section_translation.ps in this repo, you should be
|
|
able to use a pdf viewer to open this file. The figure on the
|
|
second page shows just the address translation from virtual to physical
|
|
for a 1MB section. This picture uses X instead of N, we are using an
|
|
N = 0 so that means X = 0. The translation table base at the top
|
|
of the diagram is our MMUTABLEBASE, the address in physical space
|
|
of the beginning of our first level TLB or descriptor table. The
|
|
first thing we need to do is find the table entry for the virtual
|
|
address in question (the Modified virtual address in this diagram,
|
|
as far as we are concerned it is unmodified it is the virtual
|
|
address we intend to use). The first thing we see is the lower
|
|
14 bits of the translation table base are SBZ = should be zero.
|
|
Basically we need to have the translation table base aligned on a
|
|
16Kbyte boundary (2 to the 14th is 16K). It would not make sense
|
|
to use all zeros as the translation table base, we have our reset
|
|
and interrupt vectors at and near address zero in the arms address
|
|
space so the first sane address would be 0x00004000. The first
|
|
level descriptor is based on the top 12 bits of the virtual address
|
|
or 4096 entries, that is 16KBytes (not a coincidence), 0x4000 + 0x4000
|
|
is 0x8000, where our arm programs entry point is, so we have space
|
|
there if we want to use it. But any address with the lower 14 bits
|
|
being zero will work so long as you have enough memory at that address
|
|
and you are not clobbering anything else that is using that memory
|
|
space.
|
|
|
|
So what this picture is showing us is that we take the top 12 bits
|
|
of the virtual address, multiply by 4 or shift left 2, and add tat
|
|
to the translation table base, this gives the address for the first
|
|
level descriptor for that virtual address. The diagram shows the
|
|
first level fetch which returns a 32 bit value that we have placed
|
|
in the table. If the lower 2 bits of that first level descriptor are
|
|
0b10 then this is a 1MB Section. If a 1MB section then the top 12
|
|
bits of the first level descriptor replace the top 12 bits of the
|
|
virtual address to convert it into a physical address. Understand
|
|
here first and foremost so long as we do the N = 0 thing, the first
|
|
level descriptor or the first thing the mmu does is look at the top
|
|
12 bits of the virtual address, always. If the lower two bits of
|
|
the first level descriptor are not 0b10 then we get into
|
|
a second level descriptor and more virtual bits come into play, but
|
|
for now if we start by learning just 1MB sections, the conversion
|
|
from virtual to physical only cares about the top 12 bits of the
|
|
address. So for 1MB sections we dont have to concentrate on every
|
|
actual address we are going to access we only need to think about
|
|
the 1MB aligned ranges. The uart for example on the raspi 1 has
|
|
a number of registers that start with 0x202150xx, if we use a 1MB
|
|
section for those we only care about the 0x202xxxxx part of the
|
|
address. To not have to change our code we would want to have
|
|
the virtual = physical for that and do not mark it as cacheable.
|
|
|
|
So if my MMUTABLEBASE was 0x00004000 and I had a virtual address of
|
|
0x12345678 then the hardware is going to take the top 12 bits of that
|
|
address 0x123, multiply by 4 and add that to the MMUTABLEBASE.
|
|
0x4000+(0x123<<2) = 0x448C. and that is the address the mmu is going
|
|
to use for the first-level lookup. Ignoring the other bits in the
|
|
descriptor for now, if the first-level descriptor has the value
|
|
0xABC00002, the lower two bits are 0x10, a 1MB section, so the top
|
|
12 bits replace the virtual addresses top 12 bits and our 0x12345678
|
|
is converted to the physical address 0xABC45678.
|
|
|
|
|
|
Now they have this optional thing called a supersection which is a 16MB
|
|
sized thing rather than 1MB and one might think that that would make
|
|
life easier, right? Wrong. No matter what, assuming the N = 0 thing
|
|
the first level descriptor is found using the top 12 bits of the
|
|
virtual address, so in order to do some 16MB thing you need 16 entries
|
|
one for each of the possible 1MB sections. If you are already
|
|
generating 16 descriptors might as well just make them 1MB sections,
|
|
you can read up on the differences between super sections and sections
|
|
and try them if you want. For what I am doing here dont need them,
|
|
just wanted to point out you still need 16 entries per super section.
|
|
|
|
Hopefully I have not lost you yet with this address manipulation,
|
|
and maybe you are one step ahead of me, yes EVERY load and store with
|
|
the mmu enabled requires at least one mmu table lookup, the mmu when it
|
|
accesses this memory does not go through itself, but EVERY other fetch
|
|
and load and store. Which does have a performance hit, they do have
|
|
a bit of a cache in the mmu to store the last so many tlb lookups.
|
|
That helps, but you cannot avoid the mmu having to do the conversion
|
|
on every address.
|
|
|
|
In the ARM ARM I am looking at the subsection on first-level descriptors
|
|
has a table:
|
|
Table B4-1 First-level descriptor format (VMSAv6, subpages enabled)
|
|
What this is telling us is that if the first-level descriptor, the
|
|
32 bit number we place in the right place in the TLB, has the lower
|
|
two bits 0b10 then that entry defines a 1MB section and the mmu can get
|
|
everything it needs from that first level descriptor. But if the
|
|
lower two bits are 0b01 then this is a coarse page table entry and
|
|
we have to go to a second level descriptor to complete the
|
|
conversion from virtual to physical. Not every address will need
|
|
this only the address ranges we want to be more coarsely divided than
|
|
1MB. Or the other way of saying it is of we want to control an
|
|
address range in chunks smaller than 1MB then we need to use pages
|
|
not sections. You can certainly use pages for the whole world, but
|
|
if you do the math, 4096Byte pages would mean your mmu table needs
|
|
to be 4MB+16K worst case. And you have to do more work to set that
|
|
all up.
|
|
|
|
The coarse_translation.ps file I have included in this repo starts
|
|
off the same way as a section, has to the logic doesnt know what
|
|
you want until it sees the first level descriptor. If it sees a
|
|
0b01 as the lower 2 bits of the first level descriptor then this is
|
|
a coarse page table entry and it needs to do a second level fetch.
|
|
The second level fetch does not use the mmu tlb table base address
|
|
bits 31:10 of the second level address plus bits 19:12 of the
|
|
virtual address (times 4) are where the second level descriptor lives.
|
|
Note that is 8 more bits so the section is divided into 256 parts, this
|
|
page table address is similar to the mmu table address, but it needs
|
|
to be aligned on a 1K boundry (lower 10 bits zeros) and can be worst
|
|
case 1KBytes in size.
|
|
|
|
The second level descriptor format defined in the ARM ARM (small pages
|
|
are most interesting here, subpages enabled) is a little different
|
|
than a first level section, we had a domain in the first level
|
|
descriptor to get here, but now have direct access to four sets of
|
|
AP bits you/I would have to read more to know what the difference
|
|
is between the domain defined AP and these additional four, for now
|
|
I dont care this is bare metal, set them to full access (0b11) and
|
|
move on (see below about domain and ap bits).
|
|
|
|
So lets take the virtual address 0x12345678 and the MMUTABLEBASE of
|
|
0x4000 again. The first level descriptor address is the top three
|
|
bits of the virtual address 0x123, times 4, added to the MMUTABLEBASE
|
|
0x448C. But this time when we look it up we find a value in the
|
|
table that has the lower two bits being 0b01. Just to be crazy lets
|
|
say that descriptor was 0xABCDE001 (ignoring the domain and other
|
|
bits just talking address right now). That means we take 0xABCDE000
|
|
the picture shows bits 19:12 (0x45) of the virtual address (0x12345678)
|
|
so the address to the second level descriptor in this crazy case is
|
|
0xABCDE000+(0x45<<2) = 0xABCDE114 why is that crazy? because I
|
|
chose an address where we in theory dont have ram on the raspberry pi
|
|
maybe a mirrored address space, but a sane address would have been
|
|
somewhere close to the MMUTABLEBASE so we can keep the whole of the
|
|
mmu tables in a confined area. Used this address simply for
|
|
demonstration purposes not based on a workable solution.
|
|
|
|
The "other" bits in the descriptors are the domain, the TEX bits,
|
|
the C and B bits, domain and AP.
|
|
|
|
The C bit is the simplest one to start with that means Cacheable. For
|
|
peripherals we absolutely dont want them to be cached. For ram, maybe.
|
|
|
|
The b bit, means bufferable, as in write buffer. Something you may
|
|
not have heard about or thought about ever. It is kind of like a cache
|
|
on the write end of things instead of read end. I digress, when
|
|
a processor writes something everything is known, the address and
|
|
data. So the next level of logic, could, if so designed, accept
|
|
that address and data at that level and release the processor to
|
|
keep doing what it was doing (ideally fetch some more instructions
|
|
and keep running) in parallel that logic could then continue to perform
|
|
the write to the slower peripheral or really slow dram (or faster cache).
|
|
Giving us a small to large performance gain. But, what happens if while
|
|
we are doing that first write another write happens. Well if we only
|
|
have storage for one transaction in this little feature then the
|
|
processor has to wait for us to finish the first write however long
|
|
that takes, then we can grab the information for the second write and
|
|
then release the processor. I call writes "fire and forget" because
|
|
ideally the processor hands off the info to the memory controller
|
|
and keeps going, the memory controller has all the info it needs to
|
|
complete the task. For a read the processor needs that data back so
|
|
basically has to wait. Well a write buffer can store up to some number
|
|
of addresses and data. It can still fill up and have to hold the
|
|
processor off. But it is similar to a cache is to reading, it has
|
|
some faster ram that stages writes so the processor, sometimes, can
|
|
keep on going.
|
|
|
|
Now the TEX bits you just have to look up and there is the rub there
|
|
are likely more than one set of tables for TEX C and B, I am going
|
|
to stick with a TEX of 0b000 and not mess with any fancy features
|
|
there. Now depending on whether this is considered an older arm
|
|
(ARMv5) or an ARMv6 or newer the combination of TEX, C and B have
|
|
some subtle differences. The cache bit in particular does enable
|
|
or disable this space as cacheable. That simply asserts bits on
|
|
the AMDA/AXI (memory) bus that marks the transaction as cacheable,
|
|
you still need a cache and need it setup and enabled for the
|
|
transaction to actually get cached. If you dont have the cache for
|
|
that transaction type enabled then it just does a normal memory (or
|
|
peripheral) operation. So we set TEX to zeros to keep it out of the
|
|
way.
|
|
|
|
Lastly the domain and AP bits. Now you will see a 4 bit domain thing
|
|
and a 2 bit domain thing. These are related. There is a register in
|
|
the MMU right next to the translation table base address register this
|
|
one is a 32 bit register that contains 16 different domain definitions.
|
|
|
|
The two bit domain controls are defined as such (these are AP bits)
|
|
|
|
0b00 No access Any access generates a domain fault
|
|
0b01 Client Accesses are checked against the access permission bits in the TLB entry
|
|
0b10 Reserved Using this value has UNPREDICTABLE results
|
|
0b11 Manager Accesses are not checked against the access permission bits in the TLB
|
|
entry, so a permission fault cannot be generated
|
|
|
|
For starters we are going to set all of the domains to 0b11 dont check
|
|
cant fault. What are these 16 domains though? Notice it takes 4 bits
|
|
to describe one of 16 things. The different domains have no specific
|
|
meaning other than that we can have 16 different definitions that we
|
|
control for whatever reason. You might allow for 16 different
|
|
threads running at once in your operating system, or 16 different
|
|
types of software running (kernel, application, ...) you can mark
|
|
a bunch of sections as belonging to one parituclar domain, and with a
|
|
simple change to that domain control register, a whole domain might
|
|
go from one type of permission to another, from no checking to
|
|
no access for example. By just writing this domain register you can
|
|
quickly change what address spaces have permission and which ones dont
|
|
without necessarily changing the mmu table.
|
|
|
|
Since I usually use the MMU in bare metal to enable data caching on ram
|
|
I set my domain controls to 0b11, no checking and I simply make all
|
|
the MMU sections domain number 0.
|
|
|
|
So we end up with this simple function that allows us to add first level
|
|
descriptors in the MMU translation table.
|
|
|
|
unsigned int mmu_section ( unsigned int vadd, unsigned int padd, unsigned int flags )
|
|
{
|
|
unsigned int ra;
|
|
unsigned int rb;
|
|
unsigned int rc;
|
|
|
|
ra=vadd>>20;
|
|
rb=MMUTABLEBASE|(ra<<2);
|
|
ra=padd>>20;
|
|
rc=(ra<<20)|flags|2;
|
|
PUT32(rb,rc);
|
|
return(0);
|
|
}
|
|
|
|
So what you have to do to turn on the MMU is to first figure out all
|
|
the memory you are going to access, and make sure you have entries
|
|
for that. This is important, if you forget something, and dont have
|
|
a valid entry there, then you fault, your fault handler, if you have
|
|
chosen to write it, may also fault if it isnt placed write or something
|
|
it accesses also faults...(I would assume the fault handler is also
|
|
behind the mmu but would have to read up on that).
|
|
|
|
So the smallest amount of ram on a raspi is 256MB or 0x10000000 bytes.
|
|
|
|
Our program enters at address 0x8000, so that is within the first
|
|
section 0x000xxxxx so we should make that section cacheable and
|
|
bufferable.
|
|
|
|
mmu_section(0x00000000,0x00000000,0x0000|8|4);
|
|
|
|
This is saying map the virtual 0x000xxxxx to the physical 0x000xxxxx
|
|
enable the cache and write buffer. 0x8 is the C bit and 0x4 is the B
|
|
bit. tex, domain, etc are zeros.
|
|
|
|
If we want to use all 256mb we would need to do this for all the
|
|
sections from 0x000xxxxx to 0x100xxxxx. Maybe do that later.
|
|
|
|
We know that for the raspi1 the peripherals, uart and such are in
|
|
arm physical space at 0x20xxxxxx. To allow for more ram on the raspi 2
|
|
they needed to move that and moved it to 0x3Fxxxxxx. So we either need
|
|
16 1MB section sized entries to cover that whole range or we look at
|
|
specific sections for specific things we care to talk to and just add
|
|
those. The uart and the gpio it is associated with is in the 0x202xxxxx
|
|
space. There are a couple of timers in the 0x200xxxxx space so one
|
|
entry can cover those.
|
|
|
|
if we didnt want to allow those to be cached or write buffered then
|
|
|
|
mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
|
|
mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
|
|
mmu_section(0x3F000000,0x3F000000,0x0000); //NOT CACHED!
|
|
mmu_section(0x3F200000,0x3F200000,0x0000); //NOT CACHED!
|
|
|
|
but we may play with that to demonstrate what caching a peripheral
|
|
can do to you, why we need to turn on the mmu if for no other reason
|
|
than to get some bare metal performance by using the d cache.
|
|
|
|
Now you have to think on a system level here, there are a number
|
|
of things in play. We need to plan our memory space, where are we
|
|
putting the MMU table, where are our peripherals, where is our program.
|
|
|
|
If the only reason for using the mmu is to allow the use of the d cache
|
|
then just map the whole world virtual = physical if you want with the
|
|
peripherals not cached and the rest cached.
|
|
|
|
If you are on the raspi 2 with multiple arm cores and are using
|
|
the multiple arm cores you need to do more reading if you want one
|
|
core to talk to another by sharing some of the memory between
|
|
them. Same problem as peripherals basically with multiple masters
|
|
of the ram/peripheral on the far side of my cache, how do I insure
|
|
what is in my cache maches the far side? Easiest way is to not
|
|
cache that space. You need to read up on if the cores share a cache
|
|
or have their own (or if l2 if present is shared but l1 is not),
|
|
ldrex/strex were implemented specifically for multi core, but you
|
|
need to understand the cache effects on these instructions (<grin>
|
|
not documented well, I have an example on just this one topic).
|
|
|
|
So once our tables are setup then we need to actually turn the
|
|
MMU on. Now I cant figure out where I got this from, and I have
|
|
modified it in this repo. According to this manual it was with the
|
|
ARMv6 that we got the DSB feature which says wait for either cache
|
|
or MMU to finish something before continuing. In particular when
|
|
initializing a cache to start it up you want to clean out all the
|
|
entries in a safe way you dont want to evict them and hose memory
|
|
you want to invalidate everything, mark it such that the cache lines
|
|
are empty/available. Likewise that little bit of TLB caching the MMU
|
|
has, we want to invalidate that too so we dont start up the mmu
|
|
with entries in there that dont match our entries.
|
|
|
|
Why are we invalidating the cache in mmu init code? Because first we
|
|
need the mmu to use the d cache (to protect the peripherals from
|
|
being cached) and second the controls that enable the mmu are in the
|
|
same register as the i and d controls so it made sense to do both
|
|
mmu and cache stuff in one function.
|
|
|
|
So after the DSB we set our domain control bits, now in this example
|
|
I have done something different, 15 of the 16 domains have the 0b11
|
|
setting which is dont fault on anything, manager mode. I set domain
|
|
1 such that it has no access, so in the example I will change one
|
|
of the descriptor table entries to use domain one, then I will access
|
|
it and then see the access violation. I am also programming both
|
|
translation table base addresses even though we are using the N = 0
|
|
mode and only one is needed. Depends on which manual you read I guess
|
|
as to whether or not you see the N = 0 and the separate or shared
|
|
i and d mmu tables. (the reason for two is if you want your i and
|
|
d address spaces to be managed separately).
|
|
|
|
Understand I have been running on ARMv6 systems without the DSB and it
|
|
just works, so maybe that is dumb luck...
|
|
|
|
This code relies on the caller to pass in the MMU enable and I and D
|
|
cache enables. This is because this is derived from code where
|
|
sometimes I turn things on or dont turn things on and wanted it
|
|
generic.
|
|
|
|
|
|
.globl start_MMU
|
|
start_MMU:
|
|
mov r2,#0
|
|
mcr p15,0,r2,c7,c7,0 ;@ invalidate caches
|
|
mcr p15,0,r2,c8,c7,0 ;@ invalidate tlb
|
|
mcr p15,0,r2,c7,c10,4 ;@ DSB ??
|
|
|
|
mvn r2,#0
|
|
bic r2,#0xC
|
|
mcr p15,0,r2,c3,c0,0 ;@ domain
|
|
|
|
mcr p15,0,r0,c2,c0,0 ;@ tlb base
|
|
mcr p15,0,r0,c2,c0,1 ;@ tlb base
|
|
|
|
mrc p15,0,r2,c1,c0,0
|
|
orr r2,r2,r1
|
|
mcr p15,0,r2,c1,c0,0
|
|
|
|
bx lr
|
|
|
|
I am going to mess with the translation tables after the MMU is started
|
|
so the easiest way to deal with the TLB cache is to invalidate it, but
|
|
dont need to mess with main L1 cache. ARMv6 introduces a feature to
|
|
help with this, but going with this solution.
|
|
|
|
.globl invalidate_tlbs
|
|
invalidate_tlbs:
|
|
mov r2,#0
|
|
mcr p15,0,r2,c8,c7,0 ;@ invalidate tlb
|
|
mcr p15,0,r2,c7,c10,4 ;@ DSB ??
|
|
bx lr
|
|
|
|
Something to note here. Debugging using the JTAG based on chip debugger
|
|
makes life easier, that removing sd cards or the old days pulling an
|
|
eeprom out and putting it it in an eraser then a programmer. BUT,
|
|
it is not completely without issue. When and where and if you hit this
|
|
depends heavily on the core you are using and the jtag tools and the
|
|
commands you remember/prefer. The basic problem is caches can and
|
|
often do separate instruction I fetches from data D reads and writes.
|
|
So if you have test run A of a program that has executed the instruction
|
|
at address 0xD000. So that instruction is in the I cache. You have
|
|
also executed the instruction at 0xC000 but it has been evicted, but
|
|
you dont actually know what is in the I cache or not, shouldnt even
|
|
try to assume. You stop the processor, you write a new program to
|
|
memory, now these are data D writes, and go through the D cache. Then
|
|
you set the start address and run again. Now there are a number of
|
|
combinations here and only one if them works, the rest can lead to
|
|
failure.
|
|
|
|
For each instruction/address in the program, if the prior instruction
|
|
at that address was in the i cache, and since data writes do not go
|
|
through the i cache then the new instruction for that address is either
|
|
in the d cache or in main ram. When you run the new program you will
|
|
get the stale/old instruction from a prior run when you fetch that
|
|
address (unless an invalidate happens, if a flush happens then you
|
|
write back, but why would an I cache flush?), and if the new instruction
|
|
at that address is not the same as the old one unpredictable results
|
|
will occur. You can start to see the combinations, did the data
|
|
write go through to d cache or to ram, will it flush to ram and is the
|
|
i cache invalid for that address, etc.
|
|
|
|
There is also the quesiton of are the I and D caches shared, they can
|
|
be but that is both specific to the core and your setup. Also does
|
|
the jtag debugger have the ability to disable the caches, has it done
|
|
it for you, can you do it manually.
|
|
|
|
Any time you are using the i or d caches you need to be careful using
|
|
a jtag debugger or even a bootloader type approach depending on its
|
|
design as you might end up doing data writes of instructions and going
|
|
around the i cache or worse. So for this kind of work using a chip
|
|
reset and non volitle rom/flash based bootloader can/will save you
|
|
a lot of headaches. If you know your debugger is solving this for you,
|
|
great, but always make sure as you change from the raspi 2 back to
|
|
a raspi 1 for example it might not be doing it and it will drive you
|
|
nuts when you keep downloading a new program and it either crashes
|
|
in a strange way or simply just keeps running the old program and
|
|
not appearing to take your new changes.
|
|
|
|
So the example is going to start with the mmu off and write to
|
|
addresses in four different 1MB address spaces. So that later we
|
|
can play with the section descriptors and demonstrate virtual to
|
|
physical address conversion.
|
|
|
|
So write some stuff and print it out on the uart.
|
|
|
|
PUT32(0x00045678,0x00045678);
|
|
PUT32(0x00145678,0x00145678);
|
|
PUT32(0x00245678,0x00245678);
|
|
PUT32(0x00345678,0x00345678);
|
|
|
|
hexstring(GET32(0x00045678));
|
|
hexstring(GET32(0x00145678));
|
|
hexstring(GET32(0x00245678));
|
|
hexstring(GET32(0x00345678));
|
|
uart_send(0x0D); uart_send(0x0A);
|
|
|
|
then setup the mmu with at least those four sections and the peripherals
|
|
|
|
mmu_section(0x00000000,0x00000000,0x0000|8|4);
|
|
mmu_section(0x00100000,0x00100000,0x0000);
|
|
mmu_section(0x00200000,0x00200000,0x0000);
|
|
mmu_section(0x00300000,0x00300000,0x0000);
|
|
//peripherals
|
|
mmu_section(0x20000000,0x20000000,0x0000); //NOT CACHED!
|
|
mmu_section(0x20200000,0x20200000,0x0000); //NOT CACHED!
|
|
|
|
and start the mmu with the I and D caches enabled
|
|
|
|
start_mmu(MMUTABLEBASE,0x00000001|0x1000|0x0004);
|
|
|
|
then if we read those four addresses again we get the same output
|
|
as before since we maped virtual = physical.
|
|
|
|
hexstring(GET32(0x00045678));
|
|
hexstring(GET32(0x00145678));
|
|
hexstring(GET32(0x00245678));
|
|
hexstring(GET32(0x00345678));
|
|
uart_send(0x0D); uart_send(0x0A);
|
|
|
|
but what if we swizzle things around. make virtual 0x001xxxxx =
|
|
physical 0x003xxxxx. 0x002 looks at 0x000 and 0x003 looks at 0x001
|
|
(dont mess with the 0x00000000 section, that is where our program is
|
|
running)
|
|
|
|
mmu_section(0x00100000,0x00300000,0x0000);
|
|
mmu_section(0x00200000,0x00000000,0x0000);
|
|
mmu_section(0x00300000,0x00100000,0x0000);
|
|
|
|
and maybe we dont need to do this but do it anyway just in case
|
|
|
|
invalidate_tlbs();
|
|
|
|
read them again.
|
|
|
|
hexstring(GET32(0x00045678));
|
|
hexstring(GET32(0x00145678));
|
|
hexstring(GET32(0x00245678));
|
|
hexstring(GET32(0x00345678));
|
|
uart_send(0x0D); uart_send(0x0A);
|
|
|
|
the 0x000xxxxx entry was not modifed so we get 000045678 as the output
|
|
but the 0x001xxxxx read is now coming from physical 0x003xxxxx so we
|
|
get the 00345678 output, 0x002xxxxx comes from the 0x000xxxxx space
|
|
so that read gives 00045678 and the 0x003xxxxx is mapped to 0x001xxxxx
|
|
physical giving 00145678 as the output.
|
|
|
|
So up to this point the output looks like this.
|
|
|
|
DEADBEEF
|
|
00045678
|
|
00145678
|
|
00245678
|
|
00345678
|
|
|
|
00045678
|
|
00145678
|
|
00245678
|
|
00345678
|
|
|
|
00045678
|
|
00345678
|
|
00045678
|
|
00145678
|
|
|
|
first blob is without the mmu enabled, second with the mmu but
|
|
virtual = physical, third we use the mmu to show virtual != physical
|
|
for some ranges.
|
|
|
|
Now for some small pages, I made this function to help out.
|
|
|
|
unsigned int mmu_small ( unsigned int vadd, unsigned int padd, unsigned int flags, unsigned int mmubase )
|
|
{
|
|
unsigned int ra;
|
|
unsigned int rb;
|
|
unsigned int rc;
|
|
|
|
ra=vadd>>20;
|
|
rb=MMUTABLEBASE|(ra<<2);
|
|
rc=(mmubase&0xFFFFFC00)/*|(domain<<5)*/|1;
|
|
//hexstrings(rb); hexstring(rc);
|
|
PUT32(rb,rc); //first level descriptor
|
|
ra=(vadd>>12)&0xFF;
|
|
rb=(mmubase&0xFFFFFC00)|(ra<<2);
|
|
rc=(padd&0xFFFFF000)|(0xFF0)|flags|2;
|
|
//hexstrings(rb); hexstring(rc);
|
|
PUT32(rb,rc); //second level descriptor
|
|
return(0);
|
|
}
|
|
|
|
So before turning on the mmu some physical addresses were written
|
|
with some data. The function takes the virtual, physical, flags and
|
|
where you want the secondary table to be. Remember secondary tables
|
|
can be up to 1K in size and are aligned on a 1K boundary.
|
|
|
|
|
|
mmu_small(0x0AA45000,0x00145000,0,0x00000400);
|
|
mmu_small(0x0BB45000,0x00245000,0,0x00000800);
|
|
mmu_small(0x0CC45000,0x00345000,0,0x00000C00);
|
|
mmu_small(0x0DD45000,0x00345000,0,0x00001000);
|
|
mmu_small(0x0DD46000,0x00146000,0,0x00001000);
|
|
//put these back
|
|
mmu_section(0x00100000,0x00100000,0x0000);
|
|
mmu_section(0x00200000,0x00200000,0x0000);
|
|
mmu_section(0x00300000,0x00300000,0x0000);
|
|
invalidate_tlbs();
|
|
|
|
Now why did I use different secondary table addresses most of the
|
|
time but not all of the time? A secondary table lookup is the same
|
|
first level descriptor for the top 12 bits of the address, if the
|
|
top 12 bits of the address are different it is a different secondary
|
|
table. So to demonstrate that we actually have separation within a
|
|
section I have two small pages within a 1MB section that I point
|
|
at two different physical address spaces. So in short if the top
|
|
12 bits of the virtual address are the same then they share the same
|
|
coarse page table, the way the function works it writes both first
|
|
and second level descriptors so if you were to do this
|
|
|
|
mmu_small(0x0DD45000,0x00345000,0,0x00001000);
|
|
mmu_small(0x0DD46000,0x00146000,0,0x00001400);
|
|
|
|
Then both of those virtual addresses would go to the 0x1400 table, and
|
|
the first virtual address would not have a secondary entry its
|
|
secondary entry would be in a table at 0x1000 but the first level
|
|
no longer points to 0x1000 so the mmu would get whatever it finds
|
|
in the 0x1400 table.
|
|
|
|
|
|
The last example is just demonstrating an access violation. Changing
|
|
the domain to that one domain we did not set full access to
|
|
|
|
//access violation.
|
|
|
|
mmu_section(0x00100000,0x00100000,0x0020);
|
|
invalidate_tlbs();
|
|
|
|
hexstring(GET32(0x00045678));
|
|
hexstring(GET32(0x00145678));
|
|
hexstring(GET32(0x00245678));
|
|
hexstring(GET32(0x00345678));
|
|
uart_send(0x0D); uart_send(0x0A);
|
|
|
|
The first 0x45678 read comes from that first level descriptor, with
|
|
that domain
|
|
|
|
00045678
|
|
00000010
|
|
|
|
How do I know what that means with that output. Well from my blinker07
|
|
example we touched on exceptions (interrupts). I made a generic test
|
|
fixture such that anything other than a reset prints something out
|
|
and then hangs. In no way shape or form is this a complete handler
|
|
but what it does show is that it is the exception that is at address
|
|
0x00000010 that gets hit which is data abort. So figuring out it was
|
|
a data abort (pretty much expected) have that then read the data fault
|
|
status registers, being a data access we expect the data/combined one
|
|
to show somthing and the instruction one to not. Adding that
|
|
instrumentation resulted in.
|
|
|
|
00045678
|
|
00000010
|
|
00000019
|
|
00000000
|
|
00008110
|
|
E5900000
|
|
00145678
|
|
|
|
Now I switched to the ARM1176JZF-S Technical Reference Manual for more
|
|
detail and that shows the 0x01 was domain 1, the domain we used for
|
|
that access. then the 0x9 means Domain Section Fault.
|
|
|
|
The lr during the abort shows us the instruction, which you would need
|
|
to disassemble to figure out the address, or at least that is one
|
|
way to do it perhaps there is a status register for that.
|
|
|
|
The instruction and the address match our expectations for this fault.
|
|
|
|
This is simply a basic intro. Just enough to be dangerous. The MMU
|
|
is one of the simplest peripherals to program so long as bit
|
|
manipulation is not something that causes you to lose sleep. What makes
|
|
it hard is that if you mess up even one bit, or forget even one thing
|
|
you can crash in spectacular ways (often silently without any way of
|
|
knowing what happened). Debugging can be hard at best.
|
|
|
|
The ARM ARM indicates that the ARMv6 adds the feature of separating
|
|
the I and D from an mmu perspective which is an interesting thought
|
|
(see the jtag debugging comments, and think about how this can affect
|
|
you re-loading a program into ram and running) you have enough ammo
|
|
to try that. The ARMv7 doesnt seem to have a legacy mode yet, still
|
|
reading, the descriptors and how they are addresses looks basically
|
|
the same but this code doesnt yet work on the raspi 2, so I will
|
|
continue to work on that and update this repo when I figure it out.
|
|
|
|
|
|
|
|
|
|
|