cs357 Lecture Notes
Spring 2000
Week 3, Thursday

For today you should have read Chapter 3, and also
picked up the HW2 code, compiled and run it, and
started to think about what to do with the data.

For next time you should work on the practice exam,
finish hw2, and get ready for the exam.

Also, the first exam is next Tuesday!  It will cover Chapters
1-3 and the first homework.  It may not take the whole class
period.


Interprocess protection (continued)
-----------------------

I/O protection
--------------

Special instructions?  Make them priviledged.

Memory-mapped I/O?  Protect that part of memory.

Trap/interrupt vector?  Protect the instructions that write it.


Memory protection
-----------------

Book gives an example using a memory scheme with special
registers for base and bounds.

This is not a common scheme compared to virtual memory,
but it makes for a simple example of how to use dual-mode
operation to build memory protection.

Before starting a user process, OS

1) sets the base and bound registers

   (remember that registers are on chip.  setting registers
   is usually an instruction in the ISA.  It would have to
   be protected.

2) sets the mode bit to USER

3) jumps to a known location in the user code

Is it ok for the user process to _read_ the base and bound
registers (assuming there is an operation that does it)?

Every time the user code makes a memory access, the hardware
checks the base and bounds, causing a trap if the address is
out of bounds.


CPU protection
--------------

How do we make sure the user code eventually returns control
to the OS?

One way is by making a system call.

Since the address of the system call is in system memory,
trying to jump there causes a trap.

Another way is if an interrupt occurs while the user code
is running.

But what if neither one happens for a really long time?

What we need is the ability to generate an interrupt at
a given future time.

The timer does that.  The OS can write a value into the
timer that is decremented in real time.

When it gets to zero, the hardware causes an interrupt,
which returns control of the CPU to the OS.

What does the OS do if there are no other processes to run?


Memory caches reviewed
----------------------

Direct-mapped:

For every memory location, there is exactly one place
in the cache it can go.

For every cache location, there are many possible values
that might be there.

The tag indicates which of the possible values is there.

If two memory locations collide (map to the same cache
location) only one of them can be resident at a time.


Fully-associative:

Any memory location can go anywhere in the cache.

The tags have to be bigger, because there are more possible
values that might be in any location.

If the cache is full, we have to make a decision about
which value to remove (usually random or a coarse approximation
of LRU)

Collisions are less deterministic, pathological behavior
rarer, overall performance better.

Price: bigger tags, more complicated circuitry, smaller
for a given number of gates.


Set associative:

Multiple small fully-associative caches.

Direct mapping from memory locations to caches, but
multiple locations within each cache where the value
can go.


Experimental design
-------------------

What we want to know:

1) how big are the L1 and L2 caches?

2) what is the block size for transferring things from
   memory to L2, L2 to L1, memory to L1?

3) what is the associativity of the caches?


Eventually, I will ask you to design the experiment for
yourself, but this time I am giving you the experimental
apparatus and asking you to interpret the data.

Here's how you should go about it.

1) Start with a simple experiment and figure out what
   behavior you expect in theory.

2) Look at the real data and compare it with your
   theoretical expectations.

3) Gradually make the experiment more complicated.


Simple experiment
-----------------

Create a big array and traverse it many times, reading
and writing each word (4 bytes) once per traversal.

Calculate the average time per read/write

	  (How to measure something small.
	  Repeat many times and divide
	  by the number of repetitions.)

	  (How to handle loop overhead)

What performance do you expect when the array is smaller
than the L1 cache?

What about when it is bigger than the L2 cache?

What about in between?

How would you plot the data in order to test this theory?


Next experiment
---------------

What happens when 

1) the size of the array is bigger than the L1 cache,
   but the stride is two words?  four words?

What happens when the stride is the same as the block size?


And the process repeats from there.  Add something to the
experiment, look at results, check for anomalies and see
if you can explain them, repeat.


This is, by the way, a good way to present the results,
as a series of experiments like this.