Advanced Operating Systems Notes

These are some notes I took while I was taking the Advanced Operating Systems (COMP9242 19T2) at UNSW. Since the lectures cover a variety of somewhat obscure topics, I’m posting them here so I don’t forget about them later on in my career!

Disclaimer: These notes are my own, and they may not fully appreciate the intention of the lecturers or the contents of the course.

The lecture slides are Creative Commons, and they are available here: https://www.cse.unsw.edu.au/~cs9242/19/lectures.shtml

Lecture Week 1 Monday

“Introduction”

Capabilities

Invoke a method on a capability (like an object)

CSpace and CNodes

A CSpace represents Mapping[OID, AccessRight]
Each CNode is an array of capability “slots”
Capabilities: Read, Write, Execute, Grant
Operations on Capabilities: Invoke, Copy/Mint/Grant (Duplicate super/lesser), Move/Mutate (Transfer), Delete, Revoke

IPC (Endpoints)

More like “Cross-Address-Space Invocation”
Endpoints: Strictly synchronous
- Receiver must be ready
- This means data is only copied once
- There is no buffering
- No asynchronous kernel buffering
Do not use IPC for shipping data (don’t use for copy, obviously, copy-cost exists)
It is:
- Protected procedure call
- User-initiated context switch (because it’s synchronous in lock-step)
Endpoint, if the recipient is not highest priority and free, will queue FIFO
Think of Endpoints as a Producer-Consumer Pipe, One-Way
Need two endpoints for a sane two-way communication
How to reply:
- Call automatically generates a Reply capability for the Server
- Once-use
- Reply to a separate Endpoint specially created for this Call
- Hence, two Endpoints for one Call
Call is atomic and cannot be reproduced by Send/Receive
- No automatic creation of Reply Endpoint
- Send / Server Preemption (Higher Priority) / Receive was Preempted?!, etc.

Badges

Maintaining session state for Endpoints without having every Server implement their own session mechanism
A Badged Capability is Minted from the original Capability, and then gives it to a Client
This Badged Capability identifies a consumer of the Server

Virtual Registers

“Argument” storage for Endpoints
Included in thread state

Notifications

Equivalent to Binary Semaphores made using bit fields
Can be polled (multiple, like select)
Maps nicely to interrupts
To alleviate receiving from Endpoint and Notification simultaneously, Notification can be received as an Endpoint through TCB_BindNotification()

Lecture Week 1 Tuesday

Recap of similar lecture content from yesterday

Lecture Week 2 Tuesday

“Operating System Execution Models”

Note: Week 2 Monday was a public holiday

Interrupt Handling

IRQHandler and IRQControl
IRQControl can create IRQHandler
Application given IRQHandler to make them device drivers

Execution Models, Threads, Coroutines

Events
- Event loops and event handler
- Waits for events
- Only requires a single stack (no yields)
- How to handle user level code?
  - A kernel timer can be one of the events (this is the scheduler)
- No blocking in kernel code
- Only works on a uniprocessor
Coroutines
- A stack per routine
- Requires explicit yields
Threads
- Preemptive multithreading
Continuations
- Used in functional languages
- Much more high level than the above
- Basically an explicit context switch, but in functional languages
- Done by getting given a “switch context back to caller” function, and invoking it
- Basically just multiple return in functional languages
- Basically just Python generators
- Can be used to build (yield) where it doesn’t exist

Per-Thread Kernel Stack

The UNIX model
The state (e.g. per syscall) is explicitly stored by the stack
Can postpone execution by yielding (regardless of execution model)
Preemption is easy – just nest on the stack

Single Kernel Stack

How is blocking handled?
- Either continuations
- Or stateless kernels
  - “All it does is it makes it hard to write code” – Kevin

Lecture Week 3 Monday

“Events and Threads”

Two competing papers to guide us to choose which model to use for our own OS project.

Why Threads Are A Bad Idea

(for most purposes)

1995
Threads hard to program in
For most purposes events are better
Only use threads when true CPU concurrency needed
Requires
- Synchronisation
- Deadlock
Hard to debug (data dependencies, timing dependencies)

Why Events Are A Bad Idea

(for high-concurrency servers)

2003
Proposes Thread-Event duality
No inherent problem with threads
Benchmark user-level threads package; still fast

Lecture Week 3 Tuesday

“Caches”

Cache Indexing

Cache lines
Set associative lookup
- Tag | Set | Byte
2-way associative
Fully associative
- No indexing, just compare tag
- Obviously very expensive to compare in parallel
Cache associativity vs Paging
- Traditionally, the 2-bit index bit used for the cache indexing overlaps both page number bit and offset bit, so it comes from both
- Implications:
  - Odd/Even of page number determines top / lower half of 2-way associative cache
  - Pages fall into equivalence classes as caching is concerned
  - Gives rise to an implicit notion of page colouring – any pages that have different colours cannot compete for cache

Cache Misses

“The Four Cs”
- Compulsory miss
  - First access
- Capacity miss
  - Would not miss on infinite-size cache
- Conflict miss
  - Would not miss on fully-associative cache
- Coherence miss
  - Hardware coherence protocol
Cache replacement policy
- Upon replacement, dirty cache lines must be flushed
- Algorithm must be simple for speed
Cache write policy
- Write-back / write-through
- A store that misses cache
  - Write-allocate / no allocate
  - (Bypass cache on write cache-miss?)

Cache Addressing Schemes

Four possible schemes:
- Virtually-indexed, virtually-tagged (VV)
- Virtually-indexed, physically-tagged (VP)
- Physically-indexed, virtually-tagged (PV) <= Usually nonsensical
- Physically-indexed, physically-tagged (PP)
Virtually-indexed, virtually-tagged
- Virtually-addressed cache
- Various incorrect names
  - Virtual cache
  - Virtual address cache
- Can operate concurrently with MMU
- Pitfalls
  - Must be flushed on context switch because virtual address
  - Permissions must be processed via the MMU anyway, fast translation but still need to wait for MMU
  - Rarely used these days
Virtually-indexed, physically-tagged
- Indexing concurrent with MMU
- Used for on-core L1
Physically-indexed, physically-tagged
- Intel claims physically indexed L1 cache
- Cache size is picked so that it behaves like one
- If index bits are all within offset bits, they are invariant under translation

Cache Issues

Structure software for cache performance
If virtual addresses are used for indexing, then these:
Homonyms
- Same address, different data
- May access incorrect data!
- Prevention: tag VA with ASID
Synonyms (aliases)
- Different address, same data
- May access stale data!
- Not a problem for R/O data or I-caches
- Late write-back “Cache bomb”
- Also DMA inconsistency issues kind of like synonym issues
- Fixes
  - Flush cache – doesn’t fix aliasing within address space
  - Detect synonyms in either hardware or software
  - In other words, hardware restrict VM mapping so synonyms map to the same cache set
  - Rely on hardware 😛

Write Buffer

Stores take time to complete
Avoid stalling the CPU by buffering writes FIFO before it hits the cache
Gives rise to weak memory ordering
- Local CPU can grab things out of local write-buffer, remote CPUs may not see this write-buffer

Translation Lookaside Buffer

TLB sizes have remained relatively similar historically

Lecture Week 4 Monday

“Virtualisation”

Virtual Machine

“Efficient, isolated duplicate of a real machine”
Duplicate
- Should behave almost identically, except for resources or timing differences
Isolated
- Multi homed without interference
Efficient
- Run on bare metal
Hypervisor / Virtual Machine Monitor
- Interchangeable terms
- Software layer implementing the virtual machine

Types of Virtualisation

Platform (Type 1) VM
System (Type 2) VM
OS-level VM
Programming Language VM
We’ll mostly look at Type 1, and some of Type 2

Why VMs?

First designed to multiplex the hardware before true time-sharing OS
Then reborn to solve the same problem, poor resource isolation by mainstream OS
Use case: Cloud computing
- Many benefits
Type 2 has very high exception handling cost (e.g. signal handling)

Virtualisation Mechanics

Instruction Emulation
- Trap-and-Emulate approach
  - Expensive (i.e. trap frame)
  - But few instructions trap, so cost is amortised
- Hypercalls
  - Hypervisor-specific ISA?
- Trap-and-Emulate requires specific host ISA support
  - Privileged and Sensitive Instructions
  - T&E virtualisable = All sensitive instructions are privileged
- “Impure” virtualisation if T&E not supported
  - Binary translation of privileged instructions
    - Replace with trap “hypercall”
    - Insert in-line emulation for some others

Virtualisation Address Translation

Two levels of address translation
Introduce a notion of “Guest Physical Memory”
But, must be implemented with single MMU translation
1. Shadow Page Table
- Hypervisor shadows (keeps a copy of) real guest page table
- VMM resolves two page table layers at once
- Only works on hardware-loaded page tables so the VMM can find out where the guest’s page table base pointer is
- (Of course, this instruction is privileged so it’s trapped)
- We write-protect the guest page table, so we trap every write to it
- Shadow PT has TLB semantics (weak consistency)
  - Only gets synced at synchronisation points like a TLB
    - Page fault
    - TLB flush
2. Real Page Table
- Directly translate on every read/write to Guest PT so that it contains the actual physical mapping, but abstract so that it sees only guest-level mappings
- Instead of trapping on every read/write, do inline binary translation so we don’t trap
- Or para-virtualise, use a hypercall

Guest Self-Virtualisation

Minimise traps by holding some virtual state inside guest
For example, interrupt handling virtual interrupt flag
Only if guest and VMM agree on using virtual interrupt flag
No need to trap interrupt flag edits

Device Models

Emulated, Split (Para-Virtualised), Passthrough
Emulated
- Emulate device register accesses
- Very slow, because requires trapping on every device access
- If too slow, real device has time constraints, breaks abstraction
- For example, Linux does boot time speed checks
Split Driver
- Must port drivers to hypervisor
- Low overhead
- Basically a driver RPC
- Example: Xen Dom0
  - Trusted Computing Base runs “Driver OS”
Pass Through
- Can have hardware support for I/O MMU for hardware guest separation
- Can also share hardware itself if hardware itself supports it

ARM Virtualisation Extensions

Hyp Mode
Configurable Traps
- Can configure the hardware to trap directly into the guest OS, instead of the hypervisor having to forward the trap to the guest OS
- Similar for x86
Emulation support
- Because of separation of I-Cache and D-Cache, we get a compulsory cache miss upon any instruction trap
- HW decodes instruction so we don’t have to access cache
2-stage Translation
- Hardware walks both guest and VMM page tables
- Eliminates “Virtual TLB”
- A bit slower, even though VMM PT is shallow
- Worst case is massively slower, i.e. page fault on every single level
- But generally a good idea
Virtual Interrupts
System MMU (I/O MMU)
- Like an MMU for DMA

World Switch

Intel
- VM state is <=4KiB
- VMentry / VMexit
- Hardware saves and restores
ARM
- VM state is 488B
- Save and restore done by hypervisor

KVM

KVM is not Type-2!
A “Hybrid Hypervisor”
Fundamentally a Type-1
Works by using the fact that virtualisation privilege is orthogonal to normal ring privilege
“Really bloody HUGE” Trusted Computing Base
Reuse Linux drivers

Fun Things with VMs

See Slide

Lecture Week 4 Tuesday

“Measuring and Analysing Performance”

Benchmarking in R&D

Conservative: No regression
Progressive: Actual improvement in important cases
Must analyse and explain results
- Discuss model
- Hypothesis for behaviour
- Test and confirm hypothesis
Microbenchmarks (component) and Macrobenchmarks (full system)
Obtaining an overall score
- Absolute score rarely matters
- Compare systems relative to each other
- BUT: Arithmetic mean is meaningless for relative measures
- Use: Geometric mean – invariant under normalisation
Benchmarking crime: Throughput degradation = overhead
- Use proper unit of figure: processing cost per unit data

Profiling

Tool: gprof, but can’t be used inside OS
Performance Monitoring Unit (PMU)
- Hardware-based profiling unit
- Collects certain events
- A counter that counts events
- Linux: oprof
- Mainly CPU based events

Performance Analysis

Benchmarking crime: No indication of significance of data
- Include standard deviation
Vary inputs, check outputs
- Writing zeros may be optimised out
Vary data AND addresses
Hidden parameters and mental models! Large deviation is giveaway.
https://www.cse.unsw.edu.au/~gernot/benchmarking-crimes.html

Understanding Performance

Some benchmarking tables and how to interpret them

Lecture Week 5 Monday

“Real-Time Systems”

Real-Time Tasks

Event, Jitter, Release, Processing Time, Completion, Deadline
Deadlines are frequently implicit – process things without missing an event
Safety-critical: Failure; Death, serious injury
Mission-critical: Failure; Massive financial damage

Weakly-Hard Real

Where occasional deadline misses can be occasionally compensated
A cost of a miss doesn’t go to infinity (i.e. recoverable or probabilistically ok)
But; What does it mean to have a probability? Is it safe to base it on observed times?
To Gernot, probabilistic real time is bogus because you usually can’t assign real probabilities

Firm Real-Time Systems

Typically represented by a gain function (lower latency is advantageous), not a cost function (upon failure)
For example, trading systems

Soft Real-Time

Tail latencies
For example, cloud providers, web services, video streaming
Characterised by Quality of Service
“Bounded tardiness” – just some term

Best-Effort

In practice, duration is rarely totally irrelevant

What does it mean for OS?

Designed to support real-time
- Fast and predictable interrupt handling
Main duty is scheduling tasks to meet their deadline
RT vs OS terminology
- Task vs Thread
- Job vs Event
Problem is harder than bin-packing; because time is not fungible
Feasible – there is a known algorithm that will schedule them
Optimal – can schedule any task that is feasible

On-Line RT Scheduling

Execution order is not predetermined
Fixed priority scheduling – L4
- Scheduler always runs highest-priority thread
- Will preempt to run a higher-priority thread
Rate Monotonic Priority Assignment (RMPA)
- Period, Rate, Priority, Utilisation
- Schedulability test guarantee
  - For infinity, if utilisation is <= 69.3
  - Sufficient but not necessary (can get “lucky”)
- RMPA is optimal for FPS
Dynamic priorities: Earliest Deadline First
- Also schedulability test guarantee of U <=100
- Assumption: All the tasks are completely independent of each other
- Not recommended by Gernot

Resource Sharing

Locking vs Delegation
- RT Terminology: Resource Server
How to implement delegation for seL4
- seL4 IPC for single-core (Hoare-style monitor), notifications for multi-core (semaphores)
Problem: Priority Inversion
- May have multiple layers of dependencies, if high-priority thread depends on low-priority
- How do you minimise it?
Solution 1: Priority inheritance (“Helping”)
- Do what’s most beneficial for highest-priority task
- Bump priority of subtasks to the priority of the waiting thread
- Transitive inheritance => Possibility for long blocking chains
- Potential deadlocks
- Messy, easy to get wrong for safety-critical systems
Solution 2: (Immediate) Priority Ceiling Protocol (IPCP)
- Predefined priority associated with a shared resource (lock)
- Make thread exit critical section as soon as possible (use “ceiling” priority)
- Each task must declare resources at admission time
  - In a real time system, this is okay to require
  - Easy to enforce with capabilities
- FPS: bump priority, EDF: floor of deadlines

Scheduling Overloaded RT Systems

Cannot trust everything to obey declared worst-case execution time
Which job will miss the deadline?
Fixed Priority Scheduling: Loser is well-defined
Priority and Criticality, in FPS the two are the same
In FPS, priority is implicit so there’s no notion of criticality, so loser is not well-defined

Mixed-Criticality Systems

High priority but low criticality tasks (device drivers)
Low priority but high criticality tasks (control loops)

Lecture Week 5 Tuesday

“Microkernel Design & Implementation”

“If you find yourself hand waving a lot this is a sure sign you’re missing a diagram” – Gernot

“Security is no excuse for bad performance” – Gernot

Traditional L4 and seL4 Improvements

Recursive address spaces
- If you have a mapping, you can just give it to any other process through IPC
- Some issues exist with this model apparently
Thread IDs as IPC addressing
- seL4: Mailbox model with badges; communication partners are anonymous
- Also more useful to have badges, which has more flexible semantics
- Traditional: Forces you to have a load balancer thread, which complicates programming
- seL4: Transfer endpoint capability upfront for load balancing, get them back later, like a kernel-managed queue for workers
- Traditional: No interposing; lest violate truth of Thread ID authentication
Long IPC
- Kernel-managed zero-copy IPC
- Page faults? Which takes ages for a microkernel…
- Very complicated
- Don’t need a kernel to have a trusted third-party with access to both processes mapping
Timeouts
- IPC calls with timeouts
- How does a user determine the timeout values? Only really reasonable in networking
- seL4: New semantics for call/reply reduces this problem for server/client cases
IPC Fastpath
- 1. Prologue
- 2. Identify destination
  - Check caps
- 3. Get receiver TCB
  - Check priority
- 4. Mark sender blocked and enqueue
  - Create reply cap
- 5. Switch to receiver
  - Leave Message Registers untouched
- 6. Epilogue
- Notice:
  - No scheduler invocation, just direct lookup
  - No time-slice donation
Kernel runs with interrupts disabled
- No concurrency control means simpler kernel
  - Easier reasoning about correctness
  - Better average case
- Long-running system calls? Bad.
- Interrupt latency? Bad.
- Most protected-mode RTOS are fully preemptible
  - Just save registers (trapframe) and re-enable interrupts
  - According to Gernot, preemption during IPC makes the code very hard to guarantee correctness, not a good idea for high assurance environments
  - It’s also very slow
- How to ensure liveliness if we don’t enable interrupts in the kernel?
- Incremental Consistency Paradigm
  - To ensure liveliness, break operations down into a bunch of O(1) operations, and ensure kernel consistency afterwards, and poll for interrupts in between
  - Consistency, restartable progress
- Worst-Case Execution Time Analysis
  - 1. Program Binary
  - 2. Control-flow graph, Micro-architecture model, Loop bounds
  - 3. Analysis tool
  - 4. Integer linear equations, Infeasible path (mutually exclusive paths) info
  - 5. ILP solver
  - 6. WCET
- ARM stopped providing WCET information…
- RISC-V implementations may supply them!
Scheduler optimisation – Lazy scheduling
- Frequent blocking and unblocking in microkernel model
- Don’t want to modify scheduler queue at every IPC
- Leave blocked threads in ready queue, leave for scheduler to clean up
- Worst case unbounded scheduler operation! O(n) for threads in system
- Not the way to do real time
- Now uses Benno Scheduling
Scheduler optimisation – Direct Process Switch
- Idea: Don’t invoke scheduler if you know who’ll be chosen i.e. if receiver priority has higher priority
- Implication: Because scheduler doesn’t run, implies time slice donation – receiver runs on sender’s time slice
- Solution: Time represented by capabilities
  - Scheduling context capability
  - Represents a right to use time
    - T: Period
    - C: Budget <= T
  - Represents basic unit for mixed criticality
- Problem has been bothering Gernot for 20 years
- Notion of a passive server
  - No scheduling context
  - Runs on donated scheduling context (uses up client’s scheduling context)
  - So-called Migrating Thread Model
  - Problem: Timeslice runs out while server is running, what is the next behaviour?
  - Passive server may be blocked due to a low criticality consumer’s timeslice which will then block the high criticality consumer
  - Policy free method for solving this in seL4: raise an exception and let the user handle it
  - Timeout exceptions – policy-free mechanism for dealing with budget depletion
    - Possible policies
    - Emergency budget to leave critical section
    - Rollback, it’s client’s problem
SOSP ‘93 and SOSP ‘95, two papers that describe tricks and design decisions in the original L4 kernel
- Some kept, some remain

Reflecting on Changes

Original L4 shortcomings
- Insufficient and impractical resource control
- Over-optimised and needlessly complex IPC abstraction
Completes a 30 year dream for a verification of OS

Lecture Week 6 Monday

“Security Fundamentals”

Security Design

Saltzer & Schroeder [SOSP ‘73, CACM ‘74]
Economy of mechanism
Fail-safe defaults
Complete mediation – check everything
Open design – not security by obscurity
Separation of privilege
Least privilege – principle of least authority
Least common mechanism
Psychological acceptability – if it’s hard to use it won’t be

Mechanisms & Policies

Access control is the central domain of operating systems
Confidentiality – prevent read of Hi
Integrity – prevent tampering of Hi
Availability

Risk Management

Mitigate (security mechanisms) or Transfer (insurance)

Covert Channels

Intentional leakage
Storage channel
- Use attribute of shared resource i.e. filename contains secrets
Timing channel
- Temporal order of use of shared resource
Other physical channels
- Power draw etc.
Covert channels vs side channels
- Covert channels are intentional
- Side channels are unintentional

Access-Control Principles

Representing protection state
- Think of a matrix
- (Vertical) Access control list – list for everyone, gets big
  - Compacted – UNIX only stores for user,group,everyone
- (Horizontal) Capabilities
Capability
- (Reference + Access Token) Combined
- Implies a naming system to reference it
Capabilities must be unforge-able
- Either protected by hardware (tagged memory protection)
- Or protected by sparseness (high number of bits)
- Or privileged kernel data (e.g kernel can verify through a tree)
ACLs & Capabilities – Duals of the access control matrix

Differences of ACLs and Capabilities

Naming and name spaces
- ACLs: Objects referenced by name; access is orthogonal to reference
- Capabilities: Caps imply name;
- Problem: Confused deputy
  - Imagine the gcc log file problem
  - Problem arises due to ambient authority
  - Fundamentally unsolvable with ACLs
- Problem: Confused deputy with caps
  - No ambient authority
  - Alice can’t even pass the logfile
Meta-permissions
- ACLs require meta-permissions to modify ACLs
- Caps use delegation and revoking, built-in fine-grained meta-access
- Caps: Right to delegate vs right to use
Process creation
- ACLs: Child inherits all rights – opposite of least privilege
- Caps: Child has no caps by default – defaults to least privilege
Auditing of protection state
- “Who has rights to this thing?”
  - Easy with ACLs
  - Hard to solve with sparse caps
  - seL4: CSpace makes this possible
- “What rights does this thing have?”
  - Caps: Just look at caps
  - ACLs: Close to impossible, need full scan
Interposing access
- Caps are opaque, can masquerade
- Transparent virtualisation of access (e.g. containers)
- “Security Monitor”
- Example: Lazy object creation
  - Objects can be lazy references
  - Server can create the corresponding state on first use

Mandatory vs Discretionary Access Control

Discretionary: As owner of file, I can change access
Mandatory: System administrator defines policy, owner cannot change
Bell & LaPadula (BLP) Model [1966]
- Modelled on national security classification
- Classification – clearances
- Orthogonal compartments (need-to-know)
- Rules:
  - No read up (obviously)
  - No write down (because equivalent to down reading up)
  - Example: Military, WTF, can’t give orders down?!
  - In practice, need exceptions for de-classification
  - Attestation for permissions to send down
  - Trend to over-classify, not good
Biba model – similar, but enforces integrity
- Now, don’t think of secrets but think of data corruption
- No read down
- No write up
- Don’t want data flowing up, this time
Put them together and…..
- Now no information can flow in any direction
- Strong *-Property (Star)
  - Allow read down, by trusting high integrity to be responsible to not corrupt itself
Boebert’s Attack
- Famous argument against capabilities
- “On the inability of an unmodified capability machine to enforce the Star-property”
- Doesn’t work on seL4 because caps are orthogonal to data
- Need both grant right of the cap and on the IPC (and Hi-integrity will deny this)
- Need mechanism to limit cap propagation: take-grant model on seL4
Decidability
- Given initial state, can it ever reach an unsafe state?

Lecture Week 6 Tuesday

“Security: Information Leakage”

Causes of Timing Channels

Algorithmic (e.g. hashing)
Resource contention (e.g. caches)
- Stateless: e.g. interconnect bus speed
  - Limitations:
  - Only during concurrent access
  - Generally reveals no data
  - Requires intentional actor to be used as a covert channel
  - Not really possible to use as side channel
- Stateful:
  - CPU Caches
  - Branch predictor
  - These are all capacity-limited resources

Timing Channels

Prime and Probe
- Challenge: Access times
- L1 is very high bandwidth
- Longer to traverse cache for upper level caches
- Approach:
  - Probe a few cache lines at a time, find “interesting sets”
  - E.g. look for `square` code from GnuPG
  - Can read out bits of exponent! [Liu S&P’15]
Prevention:
- Spatial & temporal partitioning
- Spatial: Cache colouring!
- Temporal: Flush cache!
However: Cannot spatially partition on-core cache (L1 cache)
- L1 usually has 1 colour
- (if virtually-indexed)
Cache flushing for L1 has little effect cross-domain because the cache would be trashed anyway (for the other domain)
However: Flushing useless for multicore / SMT

Evaluating Intra-Core Channels

Methodology: Channel Matrix
- Difference when moving horizontally (changing set size) means channel
Reference: https://ts.data61.csiro.au/publications/csiro_full_text//Ge_YCH_19.pdf
I-Cache Channel (even with flush) demonstration with channel matrix
Branch-History-Buffer (BHB)
- 1-bit channel
Means: Unclosable channel / impossible for temporal protection even with flushing

Intel Spectre Defences

“Indirect Branch Control”
- Tries to close BHB channel by flushing

Speculating Disaster

Instruction Pipelining
- Problem: Dependencies
- Out-of-Order Execution
  - Includes speculative branching
Problem: Speculation pollutes cache

Meltdown

Fix: Trapping will now switch page tables to a kernel page table, expensive
Secrets leak because kernel address space contains physical memory, leaks every single userspace program

Spectre: Branch Prediction (Variant 1)

Speculatively ignore array reference protection via if statements, this pollutes the cache
Invalid array index can leak any address in program
ISA has no contract on microarchitectural behaviour
We need a new hardware-software contract!

Lecture Week 7 Monday

“Linux, Locking, Many Processors”

(No lecture Week 7 Tuesday)

Some History

John Lions, taught UNIX at UNSW
A lot of people got their introduction to UNIX with John Lions annotated source code
BSD, Paged Virtual Memory 1979, etc.
UUCP, Unix-to-Unix Copy, MHSnet using phone calls (lol)
Process model, fork exec, memory sharing
Dup, Open file descriptor, inode, CLOSE_ON_EXEC
Inode model for directories
Namei
Linux extra keywords, __init, __exit, __percpu, __user, __rcu, __kernel, __iomem, __acquires, __releases, __bitwise
Other Linux C dialect / abstractions
Linux memory consistency model
Linux has its own memory model that abstracts the hardware memory model (i.e. loads, fences)

Scheduling

Dispatch O(1)
Pretty fair
Good interactive response
Topology-aware (NUMA)
O(log n) scheduling
Currently CFS – Completely Fair Scheduler
Dual Entitlement Scheduler
- Each thread given entitlement (CPU time)
- Deserving and Undeserving queue

Memory Management

Struct page
Other various structs
How Linux handles page faults

Driver interface

Enumerable and non-enumerable
- Who are you? Or can’t find out
Device tree for userspace

Containers

Namespace isolation
Lecturer did research on this in the 70s, thanks to a fair share scheduler developed here, sold to Cray, then to Sun
Hierarchical scheduling competition
(Sounds like carving out a scheduling capability…?)

Scalability

Performance cliff at high loads due to coherence delay
Amdahl’s law and Gunther’s law
Doing without locks
- Sequence locks, lots of readers, few writers
Read Copy Update

Lecture Week 8 Monday

“Multicore and Memory Ordering”

Types of Multiprocessors

Uniform Memory Access
- Processors with local caches
- Cache coherency!
- Unit of consistency is the cache line
- Sequential Consistency
  - Result should be as if all the processors were executed in some sequential order
  - In other words, all reads see the latest write
- Early solution: Write-through caches
  - CPU snoops bus activity to invalidate stale lines
  - Downside:
    - All writes go to bus
Non-Uniform Memory Access
- Provides high local bandwidth and reduces bus contention
- Other options like MESI (Modified Exclusive Share and Invalidate)
- Directory-based coherence
  - For really big multiprocessing where broadcast bandwidth for MESI is not adequate

Memory Ordering

Example: tail of a critical section
Memory can’t be re-ordered or else it’d be outside the critical section!
Strong Ordering
Note: https://preshing.com/20120930/weak-vs-strong-memory-models/
Each CPU keeps itself consistent, but how is it viewed by others?
Example: Write buffers
- x86: Total Store Ordering, stores are FIFO
- Problem: CPU 0 writes to write buffer, CPU 1 reads from cache, inconsistency! Apparent reordering of instructions!
- Memory fences, effectively drains write buffer
- ARM: Partial Store Order
Existing primitives will have appropriate fences in place
- In practice, correctly synchronised code can ignore memory model
But if you write racey code (e.g. lock free algorithms) must use fences and barriers
x86 is nice: just stores reordered after loads
ARM is not so nice

Kernel Locking

Trylock; prone to livelock
Different types of locks
Spinlocks

Lecture Week 8 Tuesday

Continued…

Common spin lock
- Beware of bus arbitration!
- Bad because requires disabling interrupts for critical paths
Test and test and set is not a panacea
Kinda falls off after 7th processor
Implementation of MCS Locks
- Buuut… MCS locks on ARM MPCore?
TT&S low overhead on low contention
MCS grows less on high contention
Dynamically switch between TT&S and MCS depending on contention level
Case study: Linux performance collapse
- Collapse for short sections
- Non-scalable lock
- Cache coherence protocol
- MCS locks are the best! 🙂

Lecture Week 9 Monday

“Formal Verification and seL4”

Assurance and Verification

Common Criteria
Semi-Formal = use some form of formal language, not just English, not necessarily mathematical
Formal = mathematical
Protection Profiles
Which OS’s have been evaluated?
- Mac OS X – EAL 3
- Windows and RHEL – EAL 4
- Just about anything EAL 4 a.k.a. Not a lot
- 2008 Green Hills INTEGRITY – EAL 6
  - “While the system does not certify common criteria at any evaluation level” says the certification paper, WTF?!
- 2019 Prove & Run PROVENCORE – EAL 7
Green Hills tried One Box One Wire project for commodity hardware VM isolation, NSA, March 2019 said: SKPP validation for commodity hardware platforms infeasible due to complexity. NSA subsequently discontinued certifying >=EAL5
Common Criteria – a very expensive way to prove nothing – $1000 per line of code
Too much focus on development process
Common Criteria – Effectively dead in 5-Eyes defence

Formal Verification

Started a long time ago, over-promised and under-delivered
Two approaches
Abstract Interpretation / Model Checking
- Check a model for certain properties
- Unsoundness – has false positives and false negatives
- Suffers state-space explosion
- Scales to large code bases
- Doesn’t actually prove anything
Theorem Proving
- Can deal with large state spaces
- Requires refinement property (can be modelled in mathematics)
- Can prove functional correctness against a formal specification
- Very labour intensive
Note, recent work automatically proved functional correctness of simple systems using SMT solvers (model checking) [Hyperkernel, SOSP’17]

Model Checking and Linux: A Sad Story

[Chou & al, 2001] Applied static analysis to Linux
Device drivers are very buggy
Linux did nothing lol, purely open source static checkers still fault

Story of seL4

August 2009, birth of seL4
Abstract Model < (Proof) > C Implementation
“Under the semantics of the C language, this program has behaviour that is allowed under the functional specification”
Means we can reason about the abstract model to verify properties
C Implementation < (Proof) > Binary Code
“Translation correctness”
Isolation Properties < (Proof) > Abstract Model
“Isolation Properties”
Also, “Worst-case Execution Time”, never been done for protected O
Initially, had Haskell between AM and C
Kernel never fails, behaviour is always well-defined:
- Assertions never fail
- Will never de-ref null pointer
- Will never access array out of bounds
- Cannot be subverted by mis-formed input
- All syscall terminate, objects don’t overlap, access control is decidable
CompCert – done in France, we can’t use because they use a different mathematical framework and we can’t easily show that us and them are equivalent (axioms, lemmas)
- Being able to use gcc is also a real asset, different ISAs etc

Security Properties of seL4

To prove confidentiality:
- Non-interference proof
- Evolution of Low does not depend on High state
- Infoflow is very strong property. To accomplish, remove non-determinism where it affects confidentiality, e.g. scheduler strictly round-robin.
- May get really hard to work with.

Verification Assumptions

Hardware behaves as expected
Spec matches expectations
Proof checker is correct

Lecture Week 9 Tuesday

“Local OS Research”

“Use of a monolithic OS in security- or safety-critical scenarios is professional malpractice!” – Gernot

Quantifying OS Design Security Impact

Analyse Linux CVEs
If it were a microkernel, is it in the TCB?
- Not in TCB, attack defeated
- Else, try verification, attack defeated
- DoS, strongly mitigated
- Integrity or confidentiality violation, weakly mitigated (i.e. not full system compromise)
- Full system compromise, i.e. firmware reflush, I2C bus hijack

Cogent

Code and proof co-generation to C
Not turing complete

Time Protection

Equivalent of memory protection but for time
i.e. timing channels
Channel through kernel code (i.e. shared kernel code I-cache)
Performance impact of cache flushing and colouring

Hardware Redundancy / Lockstep

Hardware lockstep vs Software lockstep
YanYan student – RCoE Redundant Co-Execution
Instead of just software, the kernel itself is now replicated (similar to how it’s necessary for Resource Manager for time protection), they vote between each other
Two variants:
- Loosely-coupled
  - Don’t synchronise too hard, only when kernel entry-exit
  - Low overhead, but can’t deal with racy code, threads
- Closely-coupled
  - Sync on instruction
  - Precise preemptions
- How closely-coupled works:
  - Leading replica? Wait for barrier
  - Trailing replica? Set breakpoint where leading replica is waiting, wait for barrier
  - Synced

Real World Use

DARPA HACMS
CAmkES
- Component middleware
- Higher level abstractions of low-level seL4 constructs
Cross Domain Desktop Compositor

Lecture Week 10 Monday

“Multiprocessor OS”

Scalability

Serialisation
By the OS
Locking introduces serialisation
Also atomic operations and cache coherency
Memory access can stall
Cache-related serialisation
False sharing – unrelated data share the same cache line accessed from different processors – cache coherence traffic and delay
Cache line bouncing

Multiprocessor Hardware

Contemporary
ARM, big.LITTLE, Nehalem
Interconnects
Cluster on Die
Tile architecture
Rings

Optimisation for Scalability

Reduce critical sections
- Fine grained locking or lock free data structures
- (Lock data not code)
Reduce false sharing (pad to cache lines)
Reduce cache line bouncing
Reduce cache misses; affinity scheduling

OS Design Guidelines for MP

Avoid shared data
Explicit communication (regain control over communication costs)
Reduce synchronisation
Allocate and schedule for locality
Uniprocessor performance vs scalability

Approaches for MP

Divide and conquer
- Just use them as independent bits
- Virtualisation
- Exokernel
Reduced sharing
- Brute force and heroic effort (fixing them one by one)
- By design
No sharing
- Distributed system, do extra work to share

Space-Time Partitioning

Tessellation OS
Gang Scheduling
- Schedule all threads of a process in one scheduling interval

K42: Clustered Objects

Globally valid object reference, resolves to processor local representative
Controlled introduction of locality
Controlled trade-off between synchronise at write or synchronise at read

Corey / Linux Brute Force

Profiling Linux and achieved speedup
Applied lessons from previous research
- Sloppy counters
- Per-core data structs
- Fine grained lock, lock free
- Cache lines
How do we know if a system is scalable? Is there some fundamental blocker?
- How do you really know without infinite benchmarks?
- Is the OS API a potential bottleneck?
- Scalable Commutativity Rule
  - Can’t tell which order the operations were run
  - Counterexample: lowest available file descriptor

FlexSC

Async syscall, batch syscalls

Barrelfish

No sharing
Message passing