Advanced Operating Systems Notes

These are some notes I took while I was taking the Advanced Operating Systems (COMP9242 19T2) at UNSW. Since the lectures cover a variety of somewhat obscure topics, I’m posting them here so I don’t forget about them later on in my career!

Disclaimer: These notes are my own, and they may not fully appreciate the intention of the lecturers or the contents of the course.

The lecture slides are Creative Commons, and they are available here: https://www.cse.unsw.edu.au/~cs9242/19/lectures.shtml


Lecture Week 1 Monday

“Introduction”

Capabilities

  • Invoke a method on a capability (like an object)

CSpace and CNodes

  • A CSpace represents Mapping[OID, AccessRight]
  • Each CNode is an array of capability “slots”
  • Capabilities: Read, Write, Execute, Grant
  • Operations on Capabilities: Invoke, Copy/Mint/Grant (Duplicate super/lesser), Move/Mutate (Transfer), Delete, Revoke

IPC (Endpoints)

  • More like “Cross-Address-Space Invocation”
  • Endpoints: Strictly synchronous
    • Receiver must be ready
    • This means data is only copied once
    • There is no buffering
    • No asynchronous kernel buffering
  • Do not use IPC for shipping data (don’t use for copy, obviously, copy-cost exists)
  • It is:
    • Protected procedure call
    • User-initiated context switch (because it’s synchronous in lock-step)
  • Endpoint, if the recipient is not highest priority and free, will queue FIFO
  • Think of Endpoints as a Producer-Consumer Pipe, One-Way
  • Need two endpoints for a sane two-way communication
  • How to reply:
    • Call automatically generates a Reply capability for the Server
    • Once-use
    • Reply to a separate Endpoint specially created for this Call
    • Hence, two Endpoints for one Call
  • Call is atomic and cannot be reproduced by Send/Receive
    • No automatic creation of Reply Endpoint
    • Send / Server Preemption (Higher Priority) / Receive was Preempted?!, etc.

Badges

  • Maintaining session state for Endpoints without having every Server implement their own session mechanism
  • A Badged Capability is Minted from the original Capability, and then gives it to a Client
  • This Badged Capability identifies a consumer of the Server

Virtual Registers

  • “Argument” storage for Endpoints
  • Included in thread state

Notifications

  • Equivalent to Binary Semaphores made using bit fields
  • Can be polled (multiple, like select)
  • Maps nicely to interrupts
  • To alleviate receiving from Endpoint and Notification simultaneously, Notification can be received as an Endpoint through TCB_BindNotification()

Lecture Week 1 Tuesday

  • Recap of similar lecture content from yesterday

Lecture Week 2 Tuesday

“Operating System Execution Models”

Note: Week 2 Monday was a public holiday

Interrupt Handling

  • IRQHandler and IRQControl
  • IRQControl can create IRQHandler
  • Application given IRQHandler to make them device drivers

Execution Models, Threads, Coroutines

  • Events
    • Event loops and event handler
    • Waits for events
    • Only requires a single stack (no yields)
    • How to handle user level code?
      • A kernel timer can be one of the events (this is the scheduler)
    • No blocking in kernel code
    • Only works on a uniprocessor
  • Coroutines
    • A stack per routine
    • Requires explicit yields
  • Threads
    • Preemptive multithreading
  • Continuations
    • Used in functional languages
    • Much more high level than the above
    • Basically an explicit context switch, but in functional languages
    • Done by getting given a “switch context back to caller” function, and invoking it
    • Basically just multiple return in functional languages
    • Basically just Python generators
    • Can be used to build (yield) where it doesn’t exist

Per-Thread Kernel Stack

  • The UNIX model
  • The state (e.g. per syscall) is explicitly stored by the stack
  • Can postpone execution by yielding (regardless of execution model)
  • Preemption is easy – just nest on the stack

Single Kernel Stack

  • How is blocking handled?
    • Either continuations
    • Or stateless kernels
      • “All it does is it makes it hard to write code” – Kevin

Lecture Week 3 Monday

“Events and Threads”

Two competing papers to guide us to choose which model to use for our own OS project.

Why Threads Are A Bad Idea

(for most purposes)

  • 1995
  • Threads hard to program in
  • For most purposes events are better
  • Only use threads when true CPU concurrency needed
  • Requires
    • Synchronisation
    • Deadlock
  • Hard to debug (data dependencies, timing dependencies)

Why Events Are A Bad Idea

(for high-concurrency servers)

  • 2003
  • Proposes Thread-Event duality
  • No inherent problem with threads
  • Benchmark user-level threads package; still fast

Lecture Week 3 Tuesday

“Caches”

Cache Indexing

  • Cache lines
  • Set associative lookup
    • Tag | Set | Byte
  • 2-way associative
  • Fully associative
    • No indexing, just compare tag
    • Obviously very expensive to compare in parallel
  • Cache associativity vs Paging
    • Traditionally, the 2-bit index bit used for the cache indexing overlaps both page number bit and offset bit, so it comes from both
    • Implications:
      • Odd/Even of page number determines top / lower half of 2-way associative cache
      • Pages fall into equivalence classes as caching is concerned
      • Gives rise to an implicit notion of page colouring – any pages that have different colours cannot compete for cache

Cache Misses

  • “The Four Cs”
    • Compulsory miss
      • First access
    • Capacity miss
      • Would not miss on infinite-size cache
    • Conflict miss
      • Would not miss on fully-associative cache
    • Coherence miss
      • Hardware coherence protocol
  • Cache replacement policy
    • Upon replacement, dirty cache lines must be flushed
    • Algorithm must be simple for speed
  • Cache write policy
    • Write-back / write-through
    • A store that misses cache
      • Write-allocate / no allocate
      • (Bypass cache on write cache-miss?)

Cache Addressing Schemes

  • Four possible schemes:
    • Virtually-indexed, virtually-tagged (VV)
    • Virtually-indexed, physically-tagged (VP)
    • Physically-indexed, virtually-tagged (PV) <= Usually nonsensical
    • Physically-indexed, physically-tagged (PP)
  • Virtually-indexed, virtually-tagged
    • Virtually-addressed cache
    • Various incorrect names
      • Virtual cache
      • Virtual address cache
    • Can operate concurrently with MMU
    • Pitfalls
      • Must be flushed on context switch because virtual address
      • Permissions must be processed via the MMU anyway, fast translation but still need to wait for MMU
      • Rarely used these days
  • Virtually-indexed, physically-tagged
    • Indexing concurrent with MMU
    • Used for on-core L1
  • Physically-indexed, physically-tagged
    • Intel claims physically indexed L1 cache
    • Cache size is picked so that it behaves like one
    • If index bits are all within offset bits, they are invariant under translation

Cache Issues

  • Structure software for cache performance
  • If virtual addresses are used for indexing, then these:
  • Homonyms
    • Same address, different data
    • May access incorrect data!
    • Prevention: tag VA with ASID
  • Synonyms (aliases)
    • Different address, same data
    • May access stale data!
    • Not a problem for R/O data or I-caches
    • Late write-back “Cache bomb”
    • Also DMA inconsistency issues kind of like synonym issues
    • Fixes
      • Flush cache – doesn’t fix aliasing within address space
      • Detect synonyms in either hardware or software
      • In other words, hardware restrict VM mapping so synonyms map to the same cache set
      • Rely on hardware 😛

Write Buffer

  • Stores take time to complete
  • Avoid stalling the CPU by buffering writes FIFO before it hits the cache
  • Gives rise to weak memory ordering
    • Local CPU can grab things out of local write-buffer, remote CPUs may not see this write-buffer

Translation Lookaside Buffer

  • TLB sizes have remained relatively similar historically

Lecture Week 4 Monday

“Virtualisation”

Virtual Machine

  • “Efficient, isolated duplicate of a real machine”
  • Duplicate
    • Should behave almost identically, except for resources or timing differences
  • Isolated
    • Multi homed without interference
  • Efficient
    • Run on bare metal
  • Hypervisor / Virtual Machine Monitor
    • Interchangeable terms
    • Software layer implementing the virtual machine

Types of Virtualisation

  • Platform (Type 1) VM
  • System (Type 2) VM
  • OS-level VM
  • Programming Language VM
  • We’ll mostly look at Type 1, and some of Type 2

Why VMs?

  • First designed to multiplex the hardware before true time-sharing OS
  • Then reborn to solve the same problem, poor resource isolation by mainstream OS
  • Use case: Cloud computing
    • Many benefits
  • Type 2 has very high exception handling cost (e.g. signal handling)

Virtualisation Mechanics

  • Instruction Emulation
    • Trap-and-Emulate approach
      • Expensive (i.e. trap frame)
      • But few instructions trap, so cost is amortised
    • Hypercalls
      • Hypervisor-specific ISA?
    • Trap-and-Emulate requires specific host ISA support
      • Privileged and Sensitive Instructions
      • T&E virtualisable = All sensitive instructions are privileged
    • “Impure” virtualisation if T&E not supported
      • Binary translation of privileged instructions
        • Replace with trap “hypercall”
        • Insert in-line emulation for some others

Virtualisation Address Translation

  • Two levels of address translation
  • Introduce a notion of “Guest Physical Memory”
  • But, must be implemented with single MMU translation
  • 1. Shadow Page Table
    • Hypervisor shadows (keeps a copy of) real guest page table
    • VMM resolves two page table layers at once
    • Only works on hardware-loaded page tables so the VMM can find out where the guest’s page table base pointer is
    • (Of course, this instruction is privileged so it’s trapped)
    • We write-protect the guest page table, so we trap every write to it
    • Shadow PT has TLB semantics (weak consistency)
      • Only gets synced at synchronisation points like a TLB
        • Page fault
        • TLB flush
  • 2. Real Page Table
    • Directly translate on every read/write to Guest PT so that it contains the actual physical mapping, but abstract so that it sees only guest-level mappings
    • Instead of trapping on every read/write, do inline binary translation so we don’t trap
    • Or para-virtualise, use a hypercall

Guest Self-Virtualisation

  • Minimise traps by holding some virtual state inside guest
  • For example, interrupt handling virtual interrupt flag
  • Only if guest and VMM agree on using virtual interrupt flag
  • No need to trap interrupt flag edits

Device Models

  • Emulated, Split (Para-Virtualised), Passthrough
  • Emulated
    • Emulate device register accesses
    • Very slow, because requires trapping on every device access
    • If too slow, real device has time constraints, breaks abstraction
    • For example, Linux does boot time speed checks
  • Split Driver
    • Must port drivers to hypervisor
    • Low overhead
    • Basically a driver RPC
    • Example: Xen Dom0
      • Trusted Computing Base runs “Driver OS”
  • Pass Through
    • Can have hardware support for I/O MMU for hardware guest separation
    • Can also share hardware itself if hardware itself supports it

ARM Virtualisation Extensions

  • Hyp Mode
  • Configurable Traps
    • Can configure the hardware to trap directly into the guest OS, instead of the hypervisor having to forward the trap to the guest OS
    • Similar for x86
  • Emulation support
    • Because of separation of I-Cache and D-Cache, we get a compulsory cache miss upon any instruction trap
    • HW decodes instruction so we don’t have to access cache
  • 2-stage Translation
    • Hardware walks both guest and VMM page tables
    • Eliminates “Virtual TLB”
    • A bit slower, even though VMM PT is shallow
    • Worst case is massively slower, i.e. page fault on every single level
    • But generally a good idea
  • Virtual Interrupts
  • System MMU (I/O MMU)
    • Like an MMU for DMA

World Switch

  • Intel
    • VM state is <=4KiB
    • VMentry / VMexit
    • Hardware saves and restores
  • ARM
    • VM state is 488B
    • Save and restore done by hypervisor

KVM

  • KVM is not Type-2!
  • A “Hybrid Hypervisor”
  • Fundamentally a Type-1
  • Works by using the fact that virtualisation privilege is orthogonal to normal ring privilege
  • “Really bloody HUGE” Trusted Computing Base
  • Reuse Linux drivers

Fun Things with VMs

  • See Slide

Lecture Week 4 Tuesday

“Measuring and Analysing Performance”

Benchmarking in R&D

  • Conservative: No regression
  • Progressive: Actual improvement in important cases
  • Must analyse and explain results
    • Discuss model
    • Hypothesis for behaviour
    • Test and confirm hypothesis
  • Microbenchmarks (component) and Macrobenchmarks (full system)
  • Obtaining an overall score
    • Absolute score rarely matters
    • Compare systems relative to each other
    • BUT: Arithmetic mean is meaningless for relative measures
    • Use: Geometric mean – invariant under normalisation
  • Benchmarking crime: Throughput degradation = overhead
    • Use proper unit of figure: processing cost per unit data

Profiling

  • Tool: gprof, but can’t be used inside OS
  • Performance Monitoring Unit (PMU)
    • Hardware-based profiling unit
    • Collects certain events
    • A counter that counts events
    • Linux: oprof
    • Mainly CPU based events

Performance Analysis

Understanding Performance

  • Some benchmarking tables and how to interpret them

Lecture Week 5 Monday

“Real-Time Systems”

Real-Time Tasks

  • Event, Jitter, Release, Processing Time, Completion, Deadline
  • Deadlines are frequently implicit – process things without missing an event
  • Safety-critical: Failure; Death, serious injury
  • Mission-critical: Failure; Massive financial damage

Weakly-Hard Real

  • Where occasional deadline misses can be occasionally compensated
  • A cost of a miss doesn’t go to infinity (i.e. recoverable or probabilistically ok)
  • But; What does it mean to have a probability? Is it safe to base it on observed times?
  • To Gernot, probabilistic real time is bogus because you usually can’t assign real probabilities

Firm Real-Time Systems

  • Typically represented by a gain function (lower latency is advantageous), not a cost function (upon failure)
  • For example, trading systems

Soft Real-Time

  • Tail latencies
  • For example, cloud providers, web services, video streaming
  • Characterised by Quality of Service
  • “Bounded tardiness” – just some term

Best-Effort

  • In practice, duration is rarely totally irrelevant

What does it mean for OS?

  • Designed to support real-time
    • Fast and predictable interrupt handling
  • Main duty is scheduling tasks to meet their deadline
  • RT vs OS terminology
    • Task vs Thread
    • Job vs Event
  • Problem is harder than bin-packing; because time is not fungible
  • Feasible – there is a known algorithm that will schedule them
  • Optimal – can schedule any task that is feasible

On-Line RT Scheduling

  • Execution order is not predetermined
  • Fixed priority scheduling – L4
    • Scheduler always runs highest-priority thread
    • Will preempt to run a higher-priority thread
  • Rate Monotonic Priority Assignment (RMPA)
    • Period, Rate, Priority, Utilisation
    • Schedulability test guarantee
      • For infinity, if utilisation is <= 69.3
      • Sufficient but not necessary (can get “lucky”)
    • RMPA is optimal for FPS
  • Dynamic priorities: Earliest Deadline First
    • Also schedulability test guarantee of U <=100
    • Assumption: All the tasks are completely independent of each other
    • Not recommended by Gernot

Resource Sharing

  • Locking vs Delegation
    • RT Terminology: Resource Server
  • How to implement delegation for seL4
    • seL4 IPC for single-core (Hoare-style monitor), notifications for multi-core (semaphores)
  • Problem: Priority Inversion
    • May have multiple layers of dependencies, if high-priority thread depends on low-priority
    • How do you minimise it?
  • Solution 1: Priority inheritance (“Helping”)
    • Do what’s most beneficial for highest-priority task
    • Bump priority of subtasks to the priority of the waiting thread
    • Transitive inheritance => Possibility for long blocking chains
    • Potential deadlocks
    • Messy, easy to get wrong for safety-critical systems
  • Solution 2: (Immediate) Priority Ceiling Protocol (IPCP)
    • Predefined priority associated with a shared resource (lock)
    • Make thread exit critical section as soon as possible (use “ceiling” priority)
    • Each task must declare resources at admission time
      • In a real time system, this is okay to require
      • Easy to enforce with capabilities
    • FPS: bump priority, EDF: floor of deadlines

Scheduling Overloaded RT Systems

  • Cannot trust everything to obey declared worst-case execution time
  • Which job will miss the deadline?
  • Fixed Priority Scheduling: Loser is well-defined
  • Priority and Criticality, in FPS the two are the same
  • In FPS, priority is implicit so there’s no notion of criticality, so loser is not well-defined

Mixed-Criticality Systems

  • High priority but low criticality tasks (device drivers)
  • Low priority but high criticality tasks (control loops)

Lecture Week 5 Tuesday

“Microkernel Design & Implementation”

“If you find yourself hand waving a lot this is a sure sign you’re missing a diagram” – Gernot

“Security is no excuse for bad performance” – Gernot

Traditional L4 and seL4 Improvements

  • Recursive address spaces
    • If you have a mapping, you can just give it to any other process through IPC
    • Some issues exist with this model apparently
  • Thread IDs as IPC addressing
    • seL4: Mailbox model with badges; communication partners are anonymous
    • Also more useful to have badges, which has more flexible semantics
    • Traditional: Forces you to have a load balancer thread, which complicates programming
    • seL4: Transfer endpoint capability upfront for load balancing, get them back later, like a kernel-managed queue for workers
    • Traditional: No interposing; lest violate truth of Thread ID authentication
  • Long IPC
    • Kernel-managed zero-copy IPC
    • Page faults? Which takes ages for a microkernel…
    • Very complicated
    • Don’t need a kernel to have a trusted third-party with access to both processes mapping
  • Timeouts
    • IPC calls with timeouts
    • How does a user determine the timeout values? Only really reasonable in networking
    • seL4: New semantics for call/reply reduces this problem for server/client cases
  • IPC Fastpath
    • 1. Prologue
    • 2. Identify destination
      • Check caps
    • 3. Get receiver TCB
      • Check priority
    • 4. Mark sender blocked and enqueue
      • Create reply cap
    • 5. Switch to receiver
      • Leave Message Registers untouched
    • 6. Epilogue
    • Notice:
      • No scheduler invocation, just direct lookup
      • No time-slice donation
  • Kernel runs with interrupts disabled
    • No concurrency control means simpler kernel
      • Easier reasoning about correctness
      • Better average case
    • Long-running system calls? Bad.
    • Interrupt latency? Bad.
    • Most protected-mode RTOS are fully preemptible
      • Just save registers (trapframe) and re-enable interrupts
      • According to Gernot, preemption during IPC makes the code very hard to guarantee correctness, not a good idea for high assurance environments
      • It’s also very slow
    • How to ensure liveliness if we don’t enable interrupts in the kernel?
    • Incremental Consistency Paradigm
      • To ensure liveliness, break operations down into a bunch of O(1) operations, and ensure kernel consistency afterwards, and poll for interrupts in between
      • Consistency, restartable progress
    • Worst-Case Execution Time Analysis
      • 1. Program Binary
      • 2. Control-flow graph, Micro-architecture model, Loop bounds
      • 3. Analysis tool
      • 4. Integer linear equations, Infeasible path (mutually exclusive paths) info
      • 5. ILP solver
      • 6. WCET
    • ARM stopped providing WCET information…
    • RISC-V implementations may supply them!
  • Scheduler optimisation – Lazy scheduling
    • Frequent blocking and unblocking in microkernel model
    • Don’t want to modify scheduler queue at every IPC
    • Leave blocked threads in ready queue, leave for scheduler to clean up
    • Worst case unbounded scheduler operation! O(n) for threads in system
    • Not the way to do real time
    • Now uses Benno Scheduling
  • Scheduler optimisation – Direct Process Switch
    • Idea: Don’t invoke scheduler if you know who’ll be chosen i.e. if receiver priority has higher priority
    • Implication: Because scheduler doesn’t run, implies time slice donation – receiver runs on sender’s time slice
    • Solution: Time represented by capabilities
      • Scheduling context capability
      • Represents a right to use time
        • T: Period
        • C: Budget <= T
      • Represents basic unit for mixed criticality
    • Problem has been bothering Gernot for 20 years
    • Notion of a passive server
      • No scheduling context
      • Runs on donated scheduling context (uses up client’s scheduling context)
      • So-called Migrating Thread Model
      • Problem: Timeslice runs out while server is running, what is the next behaviour?
      • Passive server may be blocked due to a low criticality consumer’s timeslice which will then block the high criticality consumer
      • Policy free method for solving this in seL4: raise an exception and let the user handle it
      • Timeout exceptions – policy-free mechanism for dealing with budget depletion
        • Possible policies
        • Emergency budget to leave critical section
        • Rollback, it’s client’s problem
  • SOSP ‘93 and SOSP ‘95, two papers that describe tricks and design decisions in the original L4 kernel
    • Some kept, some remain

Reflecting on Changes

  • Original L4 shortcomings
    • Insufficient and impractical resource control
    • Over-optimised and needlessly complex IPC abstraction
  • Completes a 30 year dream for a verification of OS

Lecture Week 6 Monday

“Security Fundamentals”

Security Design

  • Saltzer & Schroeder [SOSP ‘73, CACM ‘74]
  • Economy of mechanism
  • Fail-safe defaults
  • Complete mediation – check everything
  • Open design – not security by obscurity
  • Separation of privilege
  • Least privilege – principle of least authority
  • Least common mechanism
  • Psychological acceptability – if it’s hard to use it won’t be

Mechanisms & Policies

  • Access control is the central domain of operating systems
  • Confidentiality – prevent read of Hi
  • Integrity – prevent tampering of Hi
  • Availability

Risk Management

  • Mitigate (security mechanisms) or Transfer (insurance)

Covert Channels

  • Intentional leakage
  • Storage channel
    • Use attribute of shared resource i.e. filename contains secrets
  • Timing channel
    • Temporal order of use of shared resource
  • Other physical channels
    • Power draw etc.
  • Covert channels vs side channels
    • Covert channels are intentional
    • Side channels are unintentional

Access-Control Principles

  • Representing protection state
    • Think of a matrix
    • (Vertical) Access control list – list for everyone, gets big
      • Compacted – UNIX only stores for user,group,everyone
    • (Horizontal) Capabilities
  • Capability
    • (Reference + Access Token) Combined
    • Implies a naming system to reference it
  • Capabilities must be unforge-able
    • Either protected by hardware (tagged memory protection)
    • Or protected by sparseness (high number of bits)
    • Or privileged kernel data (e.g kernel can verify through a tree)
  • ACLs & Capabilities – Duals of the access control matrix

Differences of ACLs and Capabilities

  • Naming and name spaces
    • ACLs: Objects referenced by name; access is orthogonal to reference
    • Capabilities: Caps imply name;
    • Problem: Confused deputy
      • Imagine the gcc log file problem
      • Problem arises due to ambient authority
      • Fundamentally unsolvable with ACLs
    • Problem: Confused deputy with caps
      • No ambient authority
      • Alice can’t even pass the logfile
  • Meta-permissions
    • ACLs require meta-permissions to modify ACLs
    • Caps use delegation and revoking, built-in fine-grained meta-access
    • Caps: Right to delegate vs right to use
  • Process creation
    • ACLs: Child inherits all rights – opposite of least privilege
    • Caps: Child has no caps by default – defaults to least privilege
  • Auditing of protection state
    • “Who has rights to this thing?”
      • Easy with ACLs
      • Hard to solve with sparse caps
      • seL4: CSpace makes this possible
    • “What rights does this thing have?”
      • Caps: Just look at caps
      • ACLs: Close to impossible, need full scan
  • Interposing access
    • Caps are opaque, can masquerade
    • Transparent virtualisation of access (e.g. containers)
    • “Security Monitor”
    • Example: Lazy object creation
      • Objects can be lazy references
      • Server can create the corresponding state on first use

Mandatory vs Discretionary Access Control

  • Discretionary: As owner of file, I can change access
  • Mandatory: System administrator defines policy, owner cannot change
  • Bell & LaPadula (BLP) Model [1966]
    • Modelled on national security classification
    • Classification – clearances
    • Orthogonal compartments (need-to-know)
    • Rules:
      • No read up (obviously)
      • No write down (because equivalent to down reading up)
      • Example: Military, WTF, can’t give orders down?!
      • In practice, need exceptions for de-classification
      • Attestation for permissions to send down
      • Trend to over-classify, not good
  • Biba model – similar, but enforces integrity
    • Now, don’t think of secrets but think of data corruption
    • No read down
    • No write up
    • Don’t want data flowing up, this time
  • Put them together and…..
    • Now no information can flow in any direction :/
    • Strong *-Property (Star)
      • Allow read down, by trusting high integrity to be responsible to not corrupt itself
  • Boebert’s Attack
    • Famous argument against capabilities
    • “On the inability of an unmodified capability machine to enforce the Star-property”
    • Doesn’t work on seL4 because caps are orthogonal to data
    • Need both grant right of the cap and on the IPC (and Hi-integrity will deny this)
    • Need mechanism to limit cap propagation: take-grant model on seL4
  • Decidability
    • Given initial state, can it ever reach an unsafe state?

Lecture Week 6 Tuesday

“Security: Information Leakage”

Causes of Timing Channels

  • Algorithmic (e.g. hashing)
  • Resource contention (e.g. caches)
    • Stateless: e.g. interconnect bus speed
      • Limitations:
      • Only during concurrent access
      • Generally reveals no data
      • Requires intentional actor to be used as a covert channel
      • Not really possible to use as side channel
    • Stateful:
      • CPU Caches
      • Branch predictor
      • These are all capacity-limited resources

Timing Channels

  • Prime and Probe
    • Challenge: Access times
    • L1 is very high bandwidth
    • Longer to traverse cache for upper level caches
    • Approach:
      • Probe a few cache lines at a time, find “interesting sets”
      • E.g. look for `square` code from GnuPG
      • Can read out bits of exponent! [Liu S&P’15]
  • Prevention:
    • Spatial & temporal partitioning
    • Spatial: Cache colouring!
    • Temporal: Flush cache!
  • However: Cannot spatially partition on-core cache (L1 cache)
    • L1 usually has 1 colour
    • (if virtually-indexed)
  • Cache flushing for L1 has little effect cross-domain because the cache would be trashed anyway (for the other domain)
  • However: Flushing useless for multicore / SMT

Evaluating Intra-Core Channels

  • Methodology: Channel Matrix
    • Difference when moving horizontally (changing set size) means channel
  • Reference: https://ts.data61.csiro.au/publications/csiro_full_text//Ge_YCH_19.pdf
  • I-Cache Channel (even with flush) demonstration with channel matrix
  • Branch-History-Buffer (BHB)
    • 1-bit channel
  • Means: Unclosable channel / impossible for temporal protection even with flushing

Intel Spectre Defences

  • “Indirect Branch Control”
    • Tries to close BHB channel by flushing

Speculating Disaster

  • Instruction Pipelining
    • Problem: Dependencies
    • Out-of-Order Execution
      • Includes speculative branching
  • Problem: Speculation pollutes cache

Meltdown

  • Fix: Trapping will now switch page tables to a kernel page table, expensive
  • Secrets leak because kernel address space contains physical memory, leaks every single userspace program

Spectre: Branch Prediction (Variant 1)

  • Speculatively ignore array reference protection via if statements, this pollutes the cache
  • Invalid array index can leak any address in program
  • ISA has no contract on microarchitectural behaviour
  • We need a new hardware-software contract!

Lecture Week 7 Monday

“Linux, Locking, Many Processors”

(No lecture Week 7 Tuesday)

Some History

  • John Lions, taught UNIX at UNSW
  • A lot of people got their introduction to UNIX with John Lions annotated source code
  • BSD, Paged Virtual Memory 1979, etc.
  • UUCP, Unix-to-Unix Copy, MHSnet using phone calls (lol)
  • Process model, fork exec, memory sharing
  • Dup, Open file descriptor, inode, CLOSE_ON_EXEC
  • Inode model for directories
  • Namei
  • Linux extra keywords, __init, __exit, __percpu, __user, __rcu, __kernel, __iomem, __acquires, __releases, __bitwise
  • Other Linux C dialect / abstractions
  • Linux memory consistency model
  • Linux has its own memory model that abstracts the hardware memory model (i.e. loads, fences)

Scheduling

  • Dispatch O(1)
  • Pretty fair
  • Good interactive response
  • Topology-aware (NUMA)
  • O(log n) scheduling
  • Currently CFS – Completely Fair Scheduler
  • Dual Entitlement Scheduler
    • Each thread given entitlement (CPU time)
    • Deserving and Undeserving queue

Memory Management

  • Struct page
  • Other various structs
  • How Linux handles page faults

Driver interface

  • Enumerable and non-enumerable
    • Who are you? Or can’t find out
  • Device tree for userspace

Containers

  • Namespace isolation
  • Lecturer did research on this in the 70s, thanks to a fair share scheduler developed here, sold to Cray, then to Sun
  • Hierarchical scheduling competition
  • (Sounds like carving out a scheduling capability…?)

Scalability

  • Performance cliff at high loads due to coherence delay
  • Amdahl’s law and Gunther’s law
  • Doing without locks
    • Sequence locks, lots of readers, few writers
  • Read Copy Update

Lecture Week 8 Monday

“Multicore and Memory Ordering”

Types of Multiprocessors

  • Uniform Memory Access
    • Processors with local caches
    • Cache coherency!
    • Unit of consistency is the cache line
    • Sequential Consistency
      • Result should be as if all the processors were executed in some sequential order
      • In other words, all reads see the latest write
    • Early solution: Write-through caches
      • CPU snoops bus activity to invalidate stale lines
      • Downside:
        • All writes go to bus
  • Non-Uniform Memory Access
    • Provides high local bandwidth and reduces bus contention
    • Other options like MESI (Modified Exclusive Share and Invalidate)
    • Directory-based coherence
      • For really big multiprocessing where broadcast bandwidth for MESI is not adequate

Memory Ordering

  • Example: tail of a critical section
  • Memory can’t be re-ordered or else it’d be outside the critical section!
  • Strong Ordering
  • Note: https://preshing.com/20120930/weak-vs-strong-memory-models/
  • Each CPU keeps itself consistent, but how is it viewed by others?
  • Example: Write buffers
    • x86: Total Store Ordering, stores are FIFO
    • Problem: CPU 0 writes to write buffer, CPU 1 reads from cache, inconsistency! Apparent reordering of instructions!
    • Memory fences, effectively drains write buffer
    • ARM: Partial Store Order
  • Existing primitives will have appropriate fences in place
    • In practice, correctly synchronised code can ignore memory model
  • But if you write racey code (e.g. lock free algorithms) must use fences and barriers
  • x86 is nice: just stores reordered after loads
  • ARM is not so nice

Kernel Locking

  • Trylock; prone to livelock
  • Different types of locks
  • Spinlocks

Lecture Week 8 Tuesday

Continued…

  • Common spin lock
    • Beware of bus arbitration!
    • Bad because requires disabling interrupts for critical paths
  • Test and test and set is not a panacea
  • Kinda falls off after 7th processor
  • Implementation of MCS Locks
    • Buuut… MCS locks on ARM MPCore?
  • TT&S low overhead on low contention
  • MCS grows less on high contention
  • Dynamically switch between TT&S and MCS depending on contention level
  • Case study: Linux performance collapse
    • Collapse for short sections
    • Non-scalable lock
    • Cache coherence protocol
    • MCS locks are the best! 🙂

Lecture Week 9 Monday

“Formal Verification and seL4”

Assurance and Verification

  • Common Criteria
  • Semi-Formal = use some form of formal language, not just English, not necessarily mathematical
  • Formal = mathematical
  • Protection Profiles
  • Which OS’s have been evaluated?
    • Mac OS X – EAL 3
    • Windows and RHEL – EAL 4
    • Just about anything EAL 4 a.k.a. Not a lot
    • 2008 Green Hills INTEGRITY – EAL 6
      • “While the system does not certify common criteria at any evaluation level” says the certification paper, WTF?!
    • 2019 Prove & Run PROVENCORE – EAL 7
  • Green Hills tried One Box One Wire project for commodity hardware VM isolation, NSA, March 2019 said: SKPP validation for commodity hardware platforms infeasible due to complexity. NSA subsequently discontinued certifying >=EAL5
  • Common Criteria – a very expensive way to prove nothing – $1000 per line of code
  • Too much focus on development process
  • Common Criteria – Effectively dead in 5-Eyes defence

Formal Verification

  • Started a long time ago, over-promised and under-delivered
  • Two approaches
  • Abstract Interpretation / Model Checking
    • Check a model for certain properties
    • Unsoundness – has false positives and false negatives
    • Suffers state-space explosion
    • Scales to large code bases
    • Doesn’t actually prove anything
  • Theorem Proving
    • Can deal with large state spaces
    • Requires refinement property (can be modelled in mathematics)
    • Can prove functional correctness against a formal specification
    • Very labour intensive
  • Note, recent work automatically proved functional correctness of simple systems using SMT solvers (model checking) [Hyperkernel, SOSP’17]

Model Checking and Linux: A Sad Story

  • [Chou & al, 2001] Applied static analysis to Linux
  • Device drivers are very buggy
  • Linux did nothing lol, purely open source static checkers still fault

Story of seL4

  • August 2009, birth of seL4
  • Abstract Model < (Proof) > C Implementation
  • “Under the semantics of the C language, this program has behaviour that is allowed under the functional specification”
  • Means we can reason about the abstract model to verify properties
  • C Implementation < (Proof) > Binary Code
  • “Translation correctness”
  • Isolation Properties < (Proof) > Abstract Model
  • “Isolation Properties”
  • Also, “Worst-case Execution Time”, never been done for protected O
  • Initially, had Haskell between AM and C
  • Kernel never fails, behaviour is always well-defined:
    • Assertions never fail
    • Will never de-ref null pointer
    • Will never access array out of bounds
    • Cannot be subverted by mis-formed input
    • All syscall terminate, objects don’t overlap, access control is decidable
  • CompCert – done in France, we can’t use because they use a different mathematical framework and we can’t easily show that us and them are equivalent (axioms, lemmas)
    • Being able to use gcc is also a real asset, different ISAs etc

Security Properties of seL4

  • To prove confidentiality:
    • Non-interference proof
    • Evolution of Low does not depend on High state
    • Infoflow is very strong property. To accomplish, remove non-determinism where it affects confidentiality, e.g. scheduler strictly round-robin.
    • May get really hard to work with.

Verification Assumptions

  • Hardware behaves as expected
  • Spec matches expectations
  • Proof checker is correct

Lecture Week 9 Tuesday

“Local OS Research”

“Use of a monolithic OS in security- or safety-critical scenarios is professional malpractice!” – Gernot

Quantifying OS Design Security Impact

  • Analyse Linux CVEs
  • If it were a microkernel, is it in the TCB?
    • Not in TCB, attack defeated
    • Else, try verification, attack defeated
    • DoS, strongly mitigated
    • Integrity or confidentiality violation, weakly mitigated (i.e. not full system compromise)
    • Full system compromise, i.e. firmware reflush, I2C bus hijack

Cogent

  • Code and proof co-generation to C
  • Not turing complete

Time Protection

  • Equivalent of memory protection but for time
  • i.e. timing channels
  • Channel through kernel code (i.e. shared kernel code I-cache)
  • Performance impact of cache flushing and colouring

Hardware Redundancy / Lockstep

  • Hardware lockstep vs Software lockstep
  • YanYan student – RCoE Redundant Co-Execution
  • Instead of just software, the kernel itself is now replicated (similar to how it’s necessary for Resource Manager for time protection), they vote between each other
  • Two variants:
    • Loosely-coupled
      • Don’t synchronise too hard, only when kernel entry-exit
      • Low overhead, but can’t deal with racy code, threads
    • Closely-coupled
      • Sync on instruction
      • Precise preemptions
    • How closely-coupled works:
      • Leading replica? Wait for barrier
      • Trailing replica? Set breakpoint where leading replica is waiting, wait for barrier
      • Synced

Real World Use

  • DARPA HACMS
  • CAmkES
    • Component middleware
    • Higher level abstractions of low-level seL4 constructs
  • Cross Domain Desktop Compositor

Lecture Week 10 Monday

“Multiprocessor OS”

Scalability

  • Serialisation
  • By the OS
  • Locking introduces serialisation
  • Also atomic operations and cache coherency
  • Memory access can stall
  • Cache-related serialisation
  • False sharing – unrelated data share the same cache line accessed from different processors – cache coherence traffic and delay
  • Cache line bouncing

Multiprocessor Hardware

  • Contemporary
  • ARM, big.LITTLE, Nehalem
  • Interconnects
  • Cluster on Die
  • Tile architecture
  • Rings

Optimisation for Scalability

  • Reduce critical sections
    • Fine grained locking or lock free data structures
    • (Lock data not code)
  • Reduce false sharing (pad to cache lines)
  • Reduce cache line bouncing
  • Reduce cache misses; affinity scheduling

OS Design Guidelines for MP

  • Avoid shared data
  • Explicit communication (regain control over communication costs)
  • Reduce synchronisation
  • Allocate and schedule for locality
  • Uniprocessor performance vs scalability

Approaches for MP

  • Divide and conquer
    • Just use them as independent bits
    • Virtualisation
    • Exokernel
  • Reduced sharing
    • Brute force and heroic effort (fixing them one by one)
    • By design
  • No sharing
    • Distributed system, do extra work to share

Space-Time Partitioning

  • Tessellation OS
  • Gang Scheduling
    • Schedule all threads of a process in one scheduling interval

K42: Clustered Objects

  • Globally valid object reference, resolves to processor local representative
  • Controlled introduction of locality
  • Controlled trade-off between synchronise at write or synchronise at read

Corey / Linux Brute Force

  • Profiling Linux and achieved speedup
  • Applied lessons from previous research
    • Sloppy counters
    • Per-core data structs
    • Fine grained lock, lock free
    • Cache lines
  • How do we know if a system is scalable? Is there some fundamental blocker?
    • How do you really know without infinite benchmarks?
    • Is the OS API a potential bottleneck?
    • Scalable Commutativity Rule
      • Can’t tell which order the operations were run
      • Counterexample: lowest available file descriptor

FlexSC

  • Async syscall, batch syscalls

Barrelfish

  • No sharing
  • Message passing

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s