Notes on “Frontend” Load Balancing

Chapter 19, “Load Balancing at the Frontend”, The Site Reliability Engineering Book.

Disclaimer: These notes are my own, and they may not fully appreciate the intention of the authors or the contents of the book.

Summary

The chapter describes a possible design to solve challenges when using multiple data-centres and multiple front-line machines within a data-centre to service user requests (collectively referred to as Google’s “Frontend”). Multiple data-centres that are geographically spread-out and internally load-balancing within them is necessary to have redundancy, capacity, and user proximity. But this introduces challenges such as deciding which data-centre serves a particular kind of user request with respect to service-level and technical constraints, load-balancing that scales with operational and capacity constraints, and on how to implement such mechanisms.

A multi-level solution is used, and the first-level mechanic is DNS. Anycast DNS is used on the side of the authoritative nameserver to approximate a user’s location, with heuristics and new techniques (EDNS0) to account for recursive resolvers. This information used to direct the user to the “optimal” data-centre.

The second-level mechanic is virtual IP addresses whereby multiple network interfaces receive requests addressed to one public IP address. Here, virtual means the IP address is as transparent as an IP address bound to a singular network card from a user request perspective. Scalable mechanics (connection tracking, consistent hashing) are used to distribute traffic using a network load balancer to implement virtual IP addresses. Network address translation (NAT) is avoided in favour of tunnelled direct server return (L3 DSR) for scalability.

Frankly Asked Questions

(Things I didn’t quite get – maybe they were answered and I misread, don’t blame me if it was already answered or if I’m wrong about them!)

Question: Why not just use consistent hashing at the network load balancer all the time? Why only use it as a fallback and prefer connection state tracking?

Idea: Consistent hashing is not 100% stable. There is still a possibility that a connection is terminated due to changes in buckets (e.g. during deployments, autoscaling). Connection tracking doesn’t have this problem, which is why consistent hashing is only used as a fallback.

Question: Why not just apply anycast to the Virtual IP?

Idea: 1. Connection resets when roaming to different peering arrangements. 2. Unable to shape traffic, left at mercy of third party routing policies. [See: Anycast Is Not Load Balancing]

Question: What if there was a way to attempt connection to all addresses in a DNS record and just use the first one that accepts your connection? Can’t you just list all of your virtual IP addresses in a single DNS record served globally? (Happy Eyeballs)

Idea: Like anycast, by doing this you lose the ability to shape traffic. Like the book mentions, if SRV records with weights were adopted, maybe this could have worked.

Further Reading

Keeping the Balance: Internet-Scale Loadbalancing Demystified

Programming Challenges Exam Writeup

Problems I managed solve in 6 hours for COMP4128 Programming Challenges. Got me 400 points out of 800 (median score was also 400 if I’m reading this scoreboard correctly), so I’m very happy about that.

After doing pretty badly on the weekly problem sets, I honestly thought I would be screwed, but I guess persevering on the problem sets does make me better at this stuff.

Considering this is my first experience with programming challenges, it feels good to know I can also solve problems like these if I try. (But of course, I know these are really easy problems and I still have a long way to go!)

Problem A: Cryptography

Claude is developing a new system to encrypt information. In his system, a key is a tuple of five integers. The encryption is based on three fundamental operations.

Letting ã = (a1, a2, a3, a4, a5) and b̃ = (b1, b2, b3, b4, b5):
    • rotate(ã) = (a5, a1, a2, a3, a4),
    • multiply(ã, b̃) = (a1*b1, a2*b2, a3*b3, a4*b4, a5*b5),
      where each number in the result is taken modulo 1,000,000,007.
    • combine(ã, b̃) = multiply(ã, rotate(b̃)).

To encrypt a key, Claude iterates the combine operation n times as follows:
    combine(ã, n) = combine(combine(ã, n − 1), ã)
    combine(ã, 0) = ã.

Help Claude produce the encrypted key.

Constraints

• 1 ≤ n ≤ 1018.
• 0 ≤ a1, a2, a3, a4, a5 < 1,000,000,007.

Subtasks

A1 (30 points): 1 ≤ n ≤ 100,000.
A2 (70 points): no restrictions.

How I solved this problem

Subtask A1

Simulation using tuples as a warm up problem.

Subtask A2

The problem text is confusing because there are two combine() functions, so change the iteration function to iter().

iter(a, n) = combine(iter(a, n-1), a)
iter(a, 0) = a

iter(a, n) = combine(iter(a, n-1), a)
= combine(combine(iter(a, n-2), a), a)
= combine(combine(combine(iter(a, n-3), a), a), a)

Expand and realise the terms are basically ((A1*A5)*A5)..repeated n times.

Use fast exponentiation under modulus.

Solution (Original, C++)

#include <iostream>
#include <algorithm>
#include <vector>
#include <sstream>
#include <cassert>
#include <queue>
#include <set>
#include <map>

#ifdef DEBUG
#include <fmt/format.h>
#include <fmt/ranges.h>
#include <fmt/ostream.h>
#else
namespace fmt { void print(...) {} }
#endif

using namespace std;

uint64_t pow(uint64_t x, uint64_t n, uint64_t m) {
    if (n == 0) return 1;

    uint64_t a = pow(x, n/2, m);
    a = a * a % m;
    if (n%2 == 1) a = a * x % m;
    return a;
}

using enctup = tuple<uint64_t, uint64_t, uint64_t, uint64_t, uint64_t>;
const uint64_t M = 1000000007llu;

enctup encryptfast(enctup const& tup, uint64_t n) {
    return enctup(
        (get<0>(tup) * pow(get<4>(tup), n, M)) % M,
        (get<1>(tup) * pow(get<0>(tup), n, M)) % M,
        (get<2>(tup) * pow(get<1>(tup), n, M)) % M,
        (get<3>(tup) * pow(get<2>(tup), n, M)) % M,
        (get<4>(tup) * pow(get<3>(tup), n, M)) % M
    );
}

ostream& operator<<(ostream& os, enctup const& tup) {
    return os << get<0>(tup) << ' '
    << get<1>(tup) << ' '
    << get<2>(tup) << ' '
    << get<3>(tup) << ' '
    << get<4>(tup);
}

int main(void)
{
    uint64_t n, a1, a2, a3, a4, a5;
    cin >> n >> a1 >> a2 >> a3 >> a4 >> a5;
    enctup origtup = enctup(a1, a2, a3, a4, a5);
    enctup anstup = encryptfast(origtup, n);
    cout << anstup << endl;
    return 0;
}

Problem B: Wires

Gilles is trying to wire a circuit board. There are four sides, each with n connection points labelled between 1 and 2n. Each of these labels occurs exactly twice.

For each pair of connection points with the same label, Gilles wants to place a wire to connect them. These wires may be curved, but they cannot go off the board and they cannot overlap or cross each other.

Is it possible for Gilles to place all 2n wires as desired?

Constraints

• 1 ≤ n ≤ 100,000.
• For all 1 ≤ i ≤ 4 and 1 ≤ j ≤ n, 1 ≤ ai,j ≤ 2n.
• Each value ai,j appears exactly twice.

Subtasks

B1 (30 points): 1 ≤ n ≤ 1000.
B2 (70 points): no restrictions.

How I solved this problem

At first, I started thinking with a breadth-first search approach – for every connection point, I can use breadth-first search to find the nearest connection point, with a bound that increases. For example, I would first try all connection points that can connect within 1 search, and then I would connect all those with 2 searches, and so on.

Then I realised that’s basically a greedy approach on the distance of connections and I was assuming that it would produce an optimal solution, which turned out to be right.

Now that we have an algorithm for wiring the board, all we have to figure out is whether any such wiring would be valid.

Subtask B1

Sort connection points by the distance to their counterpart. Greedy on the distance, lowest first. I can loop through all connection points in O(n) time and connect them if there are no issues.

For the decision problem of valid wiring, consider the input array of connection points arranged in clockwise order. For any connection point you consider, there is a counterpart, and between those two points, there is an “inside” and “outside” as shown on the diagram below.

. x . . x . . . x . . x connected?
1 3 4 6 3 2 2 1 5 4 6 5
    |--- inside --|
|--|               |--| outside

Loop through the “inside”. If any are connected, consider whether it is connected to the “outside” and if so, then we cannot connect the two points under consideration (the two points that define the “inside”) as it would cross a prior greedy optimal solution.

Subtask B2

The above linear sweep approach is not good enough, because our worst-case scenario is a linear sweep for every connection point, which will be 10B steps.

But we can simplify the problem. In terms of array indices, we can store the index to its counterpart if it is connected, or the index to itself otherwise.

. x . . x . . . x . . x connected?
0 1 2 3 4 5 6 7 8 9 a b array index
0 4 2 3 1 5 6 7 b 9 a 8 counterpart index
1 3 4 6 3 2 2 1 5 4 6 5
    |--- inside --|
|--|               |--| outside

Use two separate range trees to query the minimum and maximum value of the inside range. Check if either the mininum or maximum lies in the outside range. We have checked whether a conflict exists in O(log n) time.

Tricks: For some reason the range tree implementation I copied from the lecture slides overran its buffer into the one and two array, so I just increased its size massively and it passed… so yeah, I must have screwed up one of the calculations…?

Solution (Original, C++)

#include <iostream>
#include <algorithm>
#include <vector>
#include <sstream>
#include <cassert>
#include <queue>
#include <set>
#include <map>
#include <cmath>
#include <iomanip>

#ifdef DEBUG
#include <fmt/format.h>
#include <fmt/ranges.h>
#include <fmt/ostream.h>
#else
namespace fmt { void print(...) {} }
#endif

using namespace std;

const int MAXN = 100000;

int n;
vector<int> clockwise;

// For given number as index, stores the index in the
// clockwise array, first occurring stored in one, and then two
int one[2*MAXN];
int two[2*MAXN];

// for a given number, returns the distance
// to the counterpart, accounting for counter-clockwise wraparound
int distance(int x) {
    assert(n > 0);
    int cw = two[x] - one[x];
    assert(cw > 0);
    int ccw = one[x] + ((4*n) - two[x]);
    // fmt::print("ccw={} x={} n={} one={} two={}\n", ccw, x, n, one[x], two[x]);
    assert(ccw > 0);
    return min(cw, ccw);
}

/* BEGIN RANGE TREE IMPLEMENTATIONS */

/* MAX RANGE TREE */
// the number of additional nodes created can be as high as the next power of two up from MAX_N (131,072)
int max_tree[(4*MAXN) + 524288 + 1000000];

// a is the index in the array. 0- or 1-based doesn't matter here, as long as it is nonnegative and less than MAX_N.
// v is the value the a-th element will be updated to.
// i is the index in the tree, rooted at 1 so children are 2i and 2i+1.
// instead of storing each node's range of responsibility, we calculate it on the way down.
// the root node is responsible for [0, MAX_N)
void max_update(int a, int v, int i = 1, int start = 0, int end = 4*MAXN) {
  // this node's range of responsibility is 1, so it is a leaf
  if (end - start == 1) {
    max_tree[i] = v;
    return;
  }
  // figure out which child is responsible for the index (a) being updated
  int mid = (start + end) / 2;
  if (a < mid) max_update(a, v, i * 2, start, mid);
  else max_update(a, v, i * 2 + 1, mid, end);
  // once we have updated the correct child, recalculate our stored value.
  max_tree[i] = max(max_tree[i*2], max_tree[i*2+1]);
}


/*
 * range tree query
 */

// query the sum in [a, b)
int max_query(int a, int b, int i = 1, int start = 0, int end = 4*MAXN) {
  // the query range exactly matches this node's range of responsibility
  if (start == a && end == b) return max_tree[i];
  // we might need to query one or both of the children
  int mid = (start + end) / 2;
  int av = 0;
  int bv = 0;
  // the left child can query [a, mid)
  if (a < mid) av = max_query(a, min(b, mid), i * 2, start, mid);
  // the right child can query [mid, b)
  if (b > mid) bv = max_query(max(a, mid), b, i * 2 + 1, mid, end);
  return max(av, bv);
}


/* MIN RANGE TREE */
// the number of additional nodes created can be as high as the next power of two up from MAX_N (131,072)
int min_tree[(4*MAXN) + 524288 + 1000000];

// a is the index in the array. 0- or 1-based doesn't matter here, as long as it is nonnegative and less than MAX_N.
// v is the value the a-th element will be updated to.
// i is the index in the tree, rooted at 1 so children are 2i and 2i+1.
// instead of storing each node's range of responsibility, we calculate it on the way down.
// the root node is responsible for [0, MAX_N)
void min_update(int a, int v, int i = 1, int start = 0, int end = 4*MAXN) {
  // this node's range of responsibility is 1, so it is a leaf
  if (end - start == 1) {
    min_tree[i] = v;
    return;
  }
  // figure out which child is responsible for the index (a) being updated
  int mid = (start + end) / 2;
  if (a < mid) min_update(a, v, i * 2, start, mid);
  else min_update(a, v, i * 2 + 1, mid, end);
  // once we have updated the correct child, recalculate our stored value.
  min_tree[i] = min(min_tree[i*2], min_tree[i*2+1]);
}


/*
 * range tree query
 */

// query the sum in [a, b)
int min_query(int a, int b, int i = 1, int start = 0, int end = 4*MAXN) {
  // the query range exactly matches this node's range of responsibility
  if (start == a && end == b) return min_tree[i];
  // we might need to query one or both of the children
  int mid = (start + end) / 2;
  int av = 1e9;
  int bv = 1e9;
  // the left child can query [a, mid)
  if (a < mid) av = min_query(a, min(b, mid), i * 2, start, mid);
  // the right child can query [mid, b)
  if (b > mid) bv = min_query(max(a, mid), b, i * 2 + 1, mid, end);
  return min(av, bv);
}

void test(void) {
    fmt::print("Testing...\n");
    max_update(0, 1);
    max_update(2, 10);
    assert(max_query(0, 3) == 10);
    assert(max_query(0, 5) == 10);
    assert(max_query(0, 2) == 1);
    assert(max_query(10, 50) == 0);
    fmt::print("Test success!\n");
    abort();
}

/* END RANGE TREE IMPLEMENTATIONS */

int32_t main(void)
{
    cout << fixed << setprecision(10);

    fill(one, one+(2*MAXN), -1);
    fill(two, two+(2*MAXN), -1);

    cin >> n;

    for (int i = 0; i < (4*n); ++i) {
        max_update(i, i);
        min_update(i, i);
    }

    for (int i = 0; i < (4*n); ++i) {
        int v;
        cin >> v;
        v -= 1;
        clockwise.push_back(v);
        if (one[v] == -1) {
            assert(two[v] == -1);
            one[v] = i;
        } else {
            assert(two[v] == -1);
            two[v] = i;
        }
    }

    vector<pair<int, int>> dpts; // (distance to counterpart, number)
    for (int i = 0; i < (2*n); ++i) {
        dpts.push_back(pair<int, int>(distance(i), i));
        fmt::print("distance for {}={}\n", i+1, distance(i));
    }
    sort(dpts.begin(), dpts.end());

    for (auto const& p : dpts) {
        int num = p.second;

        if (max_query(one[num]+1, two[num]) > two[num]) {
            cout << "NO" << endl;
            return 0;
        }
        if (min_query(one[num]+1, two[num]) < one[num]) {
            cout << "NO" << endl;
            return 0;
        }

        max_update(one[num], two[num]);
        min_update(one[num], two[num]);
        max_update(two[num], one[num]);
        min_update(two[num], one[num]);
    }

    cout << "YES" << endl;

    return 0;
}

Problem C: Assemble

Catherine is building a spaceship from a variety of parts. All the parts have been built in various cities, connected by a rail network. The last step is to transport all the spaceship parts to a single city for assembly.

There are n cities in the network, with m railroads each joining a pair of cities. It is possible to travel between any pair of cities by some sequence of railroads. Each railroad has an associated integer denoting its level. Buying a freight pass at level k allows the owner to send unlimited amounts of freight along any railroad of level at most k.

Freight passes are very expensive, so help Catherine find the lowest level of freight pass required to gather all spaceship parts in a single city.

Constraints

• 1 ≤ n ≤ 100,000.
• n − 1 ≤ m ≤ 100,000.
• 1 ≤ p ≤ 100,000.
• For all 1 ≤ i ≤ m, 1 ≤ ai < bi ≤ n and 1 ≤ li ≤ 100,000.
• For all 1 ≤ i ≤ p, 1 ≤ ci ≤ n.

Subtasks

C1 (50 points): 1 ≤ n ≤ 1,000, n − 1 ≤ m ≤ 1,000.
C2 (50 points): no restrictions.

How I solved this problem

Subtask C1

First, clearly a binary search and decision problem on the value of k.

Because the graph is undirected, if I can reach a city from the destination, then I can also reach the destination from that city as well.

That means if I run BFS or DFS, then I can check reachability in O(|V| + |E|) time which is about 200_000.

But the city where I must assemble things is not defined and there are up to 100_000 cities.

Binary search is log2(100_000) = around 17.

For full problem 340B steps vs for subtask 34M steps, so linear sweep of all cities will work for subtask but not for the full problem.

Subtask C2

Let’s be smarter. For each level that is undergoing binary search, I should be able to run something that is not BFS or DFS to find out what cities are reachable from a given city for all cities.

Perhaps I can use a disjoint-set union? If I loop through all the edges, and even if I scan through all cities once afterwards, that’s still only 200_000. A union-find data structure with path compression and subtree balancing runs in O(log(n)). 17*17*200_000 is well within the bounds.

For each binary search decision, only consider edges that can be used with a level k pass.

Solution (Original, C++)

#include <iostream>
#include <algorithm>
#include <vector>
#include <sstream>
#include <cassert>
#include <queue>
#include <set>
#include <map>

#ifdef DEBUG
#include <fmt/format.h>
#include <fmt/ranges.h>
#include <fmt/ostream.h>
#else
namespace fmt { void print(...) {} }
#endif

using namespace std;

class union_find
{
public:
    union_find(size_t size) : parent_(size), subtree_size_(size) {
        for (size_t i = 0; i < size; ++i) {
            parent_[i] = i; subtree_size_[i] = 1;
        }
    }
    auto root(int x) -> int {
        // only roots are their own parents
        // otherwise apply path compression
        return parent_[x] == x ? x : parent_[x] = root(parent_[x]);
    }
    auto join(int x, int y) -> void {
        // size heuristic
        // hang smaller subtree under root of larger subtree
        x = root(x); y = root(y);
        if (x == y) return;
        if (subtree_size_[x] < subtree_size_[y]) {
            parent_[x] = y;
            subtree_size_[y] += subtree_size_[x];
        } else {
            parent_[y] = x;
            subtree_size_[x] += subtree_size_[y];
        }
    }
private:
    vector<int> parent_;
    vector<int> subtree_size_;
};

int n, m, p;
using edge = tuple<int, int, int>;
vector<edge> edges; // city1, city2, level
vector<int> parts; // refers to city

bool decision(int k) {
    auto uf = union_find(n+1);

    for (auto const& e : edges) {
        auto c1 = get<0>(e);
        auto c2 = get<1>(e);
        auto l = get<2>(e);

        if (l > k) continue;

        uf.join(c1, c2);
    }

    int mustbe = -1;
    for (auto const& c : parts) {
        if (mustbe == -1) {
            mustbe = uf.root(c);
        } else if (uf.root(c) != mustbe) {
            return false;
        } else {
            continue;
        }
    }
    
    return true;
}

int binarysearch(void) {
    int l = -1; // invariant: f(l) = false
    int r = 100001; // invariant: f(r) = true
    while (l < r-1) {
        int m = (l+r)/2;
        if (decision(m)) r = m;
        else l = m;
    }
    return r;
}

int main(void)
{
    cin >> n >> m >> p;
    for (int i = 0; i < m; ++i) {
        int a, b, l;
        cin >> a >> b >> l;
        edges.push_back(edge(a, b, l));
    }
    for (int i = 0; i < p; ++i) {
        int c;
        cin >> c;
        parts.push_back(c);
    }
    cout << binarysearch() << endl;
    return 0;
}

Problem D: Interception

Yasmin is the administrator for a network of n computers. There are m connections in the network, each joining a different pair of distinct computers ai and bi and allowing messages to be sent in both directions. However, a message may be intercepted by nefarious hackers, with probability pi.

Yasmin knows that if not for these hackers, a message could be sent from any computer to any other, at least indirectly. Help her figure out the probability that such a message is not intercepted, assuming the optimal route is taken.

Constraints

• 1 ≤ n ≤ 200.
• n − 1 ≤ m ≤ n(n − 1)/2.
• For all 1 ≤ i ≤ m, 1 ≤ ai , bi ≤ n, ai != bi and 0 < pi < 1.

Subtasks

D1 (30 points): for all 1 ≤ i ≤ m, pi = 0.1.
D2 (70 points): no restrictions.

How I solved this problem

The first thing to note, if you didn’t read the output section carefully just like me, is to realise that there is no starting node. We have to find the probability of no interception across all pairs of nodes.

Subtask D1

The probability of interception is a simple function of the shortest path. We can find the all-pairs shortest paths using Floyd-Warshall in O(V3) which will run in time. Precompute a table of probabilities with respect to a given distance.

Subtask D2

Note that Floyd-Warshall is a relaxation algorithm, which means we can pretty easily modify how it builds up the all-pairs shortest paths.

Instead of adding weights at the point of relaxation, multiply the weights instead.

Calculate the graph with inverse probabilities to minimise floating-point error (i.e. instead of 0.01 * 0.01, do 0.99 * 0.99, find the max path).

Tricks: __float128, cout << fixed << setprecision(10).

Solution (Original, C++)

#include <iostream>
#include <algorithm>
#include <vector>
#include <sstream>
#include <cassert>
#include <queue>
#include <set>
#include <map>
#include <cmath>
#include <iomanip>

#ifdef DEBUG
#include <fmt/format.h>
#include <fmt/ranges.h>
#include <fmt/ostream.h>
#else
namespace fmt { void print(...) {} }
#endif

using namespace std;

const int MAXN = 200;

__float128 dist[MAXN][MAXN]; // probability of NO INTERCEPTION
using edge = tuple<int, int, __float128>; // undirected edge, probability of NO INTERCEPTION
vector<edge> edges;

int n, m;

void floydwarshall(void) {
    // initially, the probability of no interception is 0 (no route)
    for (int u = 0; u < n; ++u) for (int v = 0; v < n; ++v) {
        dist[u][v] = 0.0;
    }

    // update the probabilities for every directed edge
    for (auto const& e : edges) {
        int u = get<0>(e), v = get<1>(e);
        __float128 p = get<2>(e);
        dist[u][v] = p;
        dist[v][u] = p;
    }

    // every vertex can reach itself
    for (int u = 0; u < n; ++u) dist[u][u] = 1.0;

    for (int i = 0; i < n; i++) {
        for (int u = 0; u < n; u++) {
            for (int v = 0; v < n; v++) {
                dist[u][v] = max(dist[u][v], dist[u][i] * dist[i][v]);
            }
        }
    }
}

int main(void)
{
    cout << fixed << setprecision(10);
    cin >> n >> m;
    for (int i = 0; i < m; ++i) {
        int a, b;
        long double p;
        cin >> a >> b >> p;
        a -= 1; b -= 1;
        edges.push_back(edge(a, b, (__float128) 1.0 - p));
    }
    floydwarshall();
    for (int i = 0; i < n; ++i) {
        for (int j = 0; j < n; ++j) {
            cout << (long double) dist[i][j] << ' ';
        }
        cout << endl;
    }
    return 0;
}

Quick Notes on Consistent Hashing

What happens to hash functions when the range changes? Can we avoid chaos?

The paper shows the existence of such a “consistent hash function family” with the below properties.

Consistency Properties

Here, the buckets are the total range of the hash function (i.e. the modulus). This can increase to accommodate load.
Here, a view consists of a subset of buckets. As the number of buckets increase, pre-existing knowledge about buckets naturally become subsets of the system.

Balance
– The hash function hashes items within a view with approximately uniform probability.

Monotonicity
– If a view expands and an item hashes to a bucket contained in the previous view, they are hashed to the same bucket.
– (Items are either hashed to a new bucket or stay in the old one.)

Spread
– Across substantial views, items hash to approximately the same bucket (i.e. minimally distinct).

Load
– Across substantial views, there’s a reasonable limit to items hashed to each bucket.

Reading – Device Tree

What is Device Tree and how are they used?

Device Tree is a specification used to perform boot-time configuration of a kernel whereby information about the running system is provided by the firmware to the kernel instead of being hard-coded with the kernel module loading or blacklisting process (i.e. load this kernel module for this device).

This Raspberry Pi document describes the format in introductory form. This blog post describes how it gets used.

Before Device Tree, there existed a format called ATAG (tagged list). It is essentially a variable-length array of variable-length structs with identifying tag types. It is terminated by an ATAG_NONE tag. This document describes this format in detail.

This Linux ARM booting document describes the both methods, as well as the overall booting process as expected by the Linux kernel.

Device Tree is passed to the kernel upon boot as an in-memory structure. With Linux booting on ARM, the r2 register is used for Device Tree with a magic value stored at that address to indicate its usage.

Device Tree overlays exist for further configuration of device tree files. This document describes the methods on Raspberry Pi.

Device Tree is mainly used on systems where device enumeration isn’t possible, so this information must be provided otherwise.

This document lays out how to write a Device Tree source file.

What is an example of information provided by Device Tree?

This is the Flattened Device Tree (DTB, FDT) file for Raspberry Pi 4 B. This is the Device Tree source (DTS) provided by Linux. I think the idea is that the kernel provides the source for the boards it supports (i.e. ensure that it contains the informations it expects like register locations), which is then compiled to a blob as part of the kernel build process, then a version of that is distributed by the Raspberry Pi firmware repository. This has some references to it.

This is a template inclusion file (DTSI). In particular, it contains the ARM memory-mapped register address for interfacing with the system timer.

system_timer: timer@7e003000 {
compatible = "brcm,bcm2835-system-timer";
reg = <0x7e003000 0x1000>;
interrupts = <1 0>, <1 1>, <1 2>, <1 3>;
/* This could be a reference to BCM2835_CLOCK_TIMER,
* but we don't have the driver using the common clock
* support yet.
*/
clock-frequency = <1000000>;
};

The kernel can use this register address and interrupt configuration instead of a hard-coded address. For Linux, it knows to load the timer driver for the bcm2835 platform because of the compatible property, which will then make use of the rest of the information.

Commands used to interact with Device Tree files

Most of the interaction can be done with the dtc command line tool.

Adding columns to an existing JsonLines-backed Athena schema

If you use AWS Athena with JsonLines-backed files whose schema changes occasionally, you may think that the only way to change the corresponding Presto schema is to drop the table and create it again. Normally this is not a problem, except you have to spend hours reloading all the partitions…

Also, ALTER TABLE doesn’t work with Athena because it fails with “no viable alternative at input”. (sigh) [Update Early 2020: Now I know this has to do with Athena’s SQL parser! It just means that such a command simply does not exist.]

Not to worry, there is another way! If your Athena instance is upgraded to be backed by AWS Glue Catalog, then simply changing the schema there automatically reflects it on your AWS Athena instance.

Screenshot_20191008_155722

The button you need to press

This is no different from having an AWS Glue Crawler periodically detect the schema of your data store.

My First Python3 Bug – Breaking Asynchronous Generators

Recently I filed my first bug report for CPython, which turned out to be a release blocker! Here is the code snippet to produce this issue for <=3.7.4.

import asyncio

class TestAsyncGenerator:
    def __init__(self):
        pass

    async def aiter(self):
        yield 1
        yield 2

async def main():
    gen = TestAsyncGenerator()
    async for number in gen.aiter():
        break

asyncio.run(main())

# unhandled exception during asyncio.run() shutdown
# task: <Task finished coro=<()> exception=RuntimeError("can't send non-None value to a just-started coroutine")>
# RuntimeError: can't send non-None value to a just-started coroutine

I’m not too familiar with CPython internals, but looking at the patch it seems that immediately exiting the async for loop without letting the asynchronous iterator “checkpoint” has the effect of triggering a mistaken check thinking that a “just-started coroutine” is not an initialised coroutine.

This bug fix is coming soon to a CPython distribution near you, along with some resemblance of my test code 🙂

Thoughts on the Advanced Operating Systems course

So I’ve reached a point in time where it appears I’ve completed UNSW’s Advanced Operating Systems (COMP9242 2019/T2) course! After many nights of hopelessness towards the end of the course, here’s my thoughts.

Don’t do continuations in C… seriously

In the first few weeks of the course, we had to make a decision on the execution model of our operating system. Because we were building our OS on top of a microkernel (The seL4 Microkernel) we were free to choose any model of execution we wanted. In other words, we weren’t limited to the “kernel-on-top-of-userspace” that you encounter in normal OS courses.

Since the initially provided code came with a while loop, we thought the fastest way to get started was an event loop! Remembering that all system calls are IPC messages on a microkernel, if we had a while loop processing these messages already, then an event loop based server is just a simple switch-statement on top! Surely!

My AOS partner initially suggested that we try a multi-threaded model. He suggested that it’ll be easier to run long-running operations with a threaded model. But I convinced him that our velocity will be higher with the simple event loop model. Oh, how wrong I was…

Turns out, there is so much opportunity for an operating system to be blocked and require continuation at a later time. I thought it would just be limited to I/O syscalls but later I realised that because of demand paging, pretty much any kernel operation required continuations.

Later in the course, we had spent so much time converting our existing code-base to use continuations everywhere (i.e. lot of malloc, manually defining structs, turning loops into recursions and 2-state machines…) that we were burnt out from ever wanting to look at asynchronous C again…

Also, I realised too late that it’s not fun to write an operating system in this way. Having a single OS thread pretty much restricts you to a monolithic kernel, and we missed opportunities to experiment with a more creative design. (Admittedly, it was better to be kinda on-track than be creative, though.)

What I liked about the course

By far the best part about the course is the power you feel as you progress through it. If you have been puzzled by how systems software is written, and just what the hell those deep-down aspects of Linux and SRE war stories mean and how they actually work and how the hell people come to know about them (“are they just too smart?”), this course solves all your thirst issues!

By solving practical OS design issues where there is no-one to back you up (everyone’s project has different design choices – from execution models, to data structures, and project layout; the tutors can’t help you beyond basic guidelines), you have genuine exploratory discussions with your project partner, and you learn how to navigate trade-offs between simplicity of implementation, performance, and your own sanity.

The lectures aren’t too relevant for your project except for the first few weeks, but their contents are topics you wouldn’t really learn from “web learning” a.k.a. blog posts. The breadth of papers mentioned in the course and the final exam format (critique a published paper) makes it clear to you where you can now start to learn more.

Overall, I’m glad the course is over. “I can have my weekends back”, in the words of my project partner. I also won’t miss debugging where the instruction pointer has a bunch of f’s, when your muslibc mysteriously calls munmap, and looking through the 2000 page ARM manual to find the correct Fault Status Register documentation between all the EL levels… On the other hand, it’s been a course that has finally satiated my curiosity on what “OS-level programming” really means. I can’t wait for the distributed systems course now!

Bonus: Here’s what we had to document for our final submission.

Design and Implementation of Simple Operating System

Advanced Operating Systems Notes

These are some notes I took while I was taking the Advanced Operating Systems (COMP9242 19T2) at UNSW. Since the lectures cover a variety of somewhat obscure topics, I’m posting them here so I don’t forget about them later on in my career!

Disclaimer: These notes are my own, and they may not fully appreciate the intention of the lecturers or the contents of the course.

The lecture slides are Creative Commons, and they are available here: https://www.cse.unsw.edu.au/~cs9242/19/lectures.shtml


Lecture Week 1 Monday

“Introduction”

Capabilities

  • Invoke a method on a capability (like an object)

CSpace and CNodes

  • A CSpace represents Mapping[OID, AccessRight]
  • Each CNode is an array of capability “slots”
  • Capabilities: Read, Write, Execute, Grant
  • Operations on Capabilities: Invoke, Copy/Mint/Grant (Duplicate super/lesser), Move/Mutate (Transfer), Delete, Revoke

IPC (Endpoints)

  • More like “Cross-Address-Space Invocation”
  • Endpoints: Strictly synchronous
    • Receiver must be ready
    • This means data is only copied once
    • There is no buffering
    • No asynchronous kernel buffering
  • Do not use IPC for shipping data (don’t use for copy, obviously, copy-cost exists)
  • It is:
    • Protected procedure call
    • User-initiated context switch (because it’s synchronous in lock-step)
  • Endpoint, if the recipient is not highest priority and free, will queue FIFO
  • Think of Endpoints as a Producer-Consumer Pipe, One-Way
  • Need two endpoints for a sane two-way communication
  • How to reply:
    • Call automatically generates a Reply capability for the Server
    • Once-use
    • Reply to a separate Endpoint specially created for this Call
    • Hence, two Endpoints for one Call
  • Call is atomic and cannot be reproduced by Send/Receive
    • No automatic creation of Reply Endpoint
    • Send / Server Preemption (Higher Priority) / Receive was Preempted?!, etc.

Badges

  • Maintaining session state for Endpoints without having every Server implement their own session mechanism
  • A Badged Capability is Minted from the original Capability, and then gives it to a Client
  • This Badged Capability identifies a consumer of the Server

Virtual Registers

  • “Argument” storage for Endpoints
  • Included in thread state

Notifications

  • Equivalent to Binary Semaphores made using bit fields
  • Can be polled (multiple, like select)
  • Maps nicely to interrupts
  • To alleviate receiving from Endpoint and Notification simultaneously, Notification can be received as an Endpoint through TCB_BindNotification()

Lecture Week 1 Tuesday

  • Recap of similar lecture content from yesterday

Lecture Week 2 Tuesday

“Operating System Execution Models”

Note: Week 2 Monday was a public holiday

Interrupt Handling

  • IRQHandler and IRQControl
  • IRQControl can create IRQHandler
  • Application given IRQHandler to make them device drivers

Execution Models, Threads, Coroutines

  • Events
    • Event loops and event handler
    • Waits for events
    • Only requires a single stack (no yields)
    • How to handle user level code?
      • A kernel timer can be one of the events (this is the scheduler)
    • No blocking in kernel code
    • Only works on a uniprocessor
  • Coroutines
    • A stack per routine
    • Requires explicit yields
  • Threads
    • Preemptive multithreading
  • Continuations
    • Used in functional languages
    • Much more high level than the above
    • Basically an explicit context switch, but in functional languages
    • Done by getting given a “switch context back to caller” function, and invoking it
    • Basically just multiple return in functional languages
    • Basically just Python generators
    • Can be used to build (yield) where it doesn’t exist

Per-Thread Kernel Stack

  • The UNIX model
  • The state (e.g. per syscall) is explicitly stored by the stack
  • Can postpone execution by yielding (regardless of execution model)
  • Preemption is easy – just nest on the stack

Single Kernel Stack

  • How is blocking handled?
    • Either continuations
    • Or stateless kernels
      • “All it does is it makes it hard to write code” – Kevin

Lecture Week 3 Monday

“Events and Threads”

Two competing papers to guide us to choose which model to use for our own OS project.

Why Threads Are A Bad Idea

(for most purposes)

  • 1995
  • Threads hard to program in
  • For most purposes events are better
  • Only use threads when true CPU concurrency needed
  • Requires
    • Synchronisation
    • Deadlock
  • Hard to debug (data dependencies, timing dependencies)

Why Events Are A Bad Idea

(for high-concurrency servers)

  • 2003
  • Proposes Thread-Event duality
  • No inherent problem with threads
  • Benchmark user-level threads package; still fast

Lecture Week 3 Tuesday

“Caches”

Cache Indexing

  • Cache lines
  • Set associative lookup
    • Tag | Set | Byte
  • 2-way associative
  • Fully associative
    • No indexing, just compare tag
    • Obviously very expensive to compare in parallel
  • Cache associativity vs Paging
    • Traditionally, the 2-bit index bit used for the cache indexing overlaps both page number bit and offset bit, so it comes from both
    • Implications:
      • Odd/Even of page number determines top / lower half of 2-way associative cache
      • Pages fall into equivalence classes as caching is concerned
      • Gives rise to an implicit notion of page colouring – any pages that have different colours cannot compete for cache

Cache Misses

  • “The Four Cs”
    • Compulsory miss
      • First access
    • Capacity miss
      • Would not miss on infinite-size cache
    • Conflict miss
      • Would not miss on fully-associative cache
    • Coherence miss
      • Hardware coherence protocol
  • Cache replacement policy
    • Upon replacement, dirty cache lines must be flushed
    • Algorithm must be simple for speed
  • Cache write policy
    • Write-back / write-through
    • A store that misses cache
      • Write-allocate / no allocate
      • (Bypass cache on write cache-miss?)

Cache Addressing Schemes

  • Four possible schemes:
    • Virtually-indexed, virtually-tagged (VV)
    • Virtually-indexed, physically-tagged (VP)
    • Physically-indexed, virtually-tagged (PV) <= Usually nonsensical
    • Physically-indexed, physically-tagged (PP)
  • Virtually-indexed, virtually-tagged
    • Virtually-addressed cache
    • Various incorrect names
      • Virtual cache
      • Virtual address cache
    • Can operate concurrently with MMU
    • Pitfalls
      • Must be flushed on context switch because virtual address
      • Permissions must be processed via the MMU anyway, fast translation but still need to wait for MMU
      • Rarely used these days
  • Virtually-indexed, physically-tagged
    • Indexing concurrent with MMU
    • Used for on-core L1
  • Physically-indexed, physically-tagged
    • Intel claims physically indexed L1 cache
    • Cache size is picked so that it behaves like one
    • If index bits are all within offset bits, they are invariant under translation

Cache Issues

  • Structure software for cache performance
  • If virtual addresses are used for indexing, then these:
  • Homonyms
    • Same address, different data
    • May access incorrect data!
    • Prevention: tag VA with ASID
  • Synonyms (aliases)
    • Different address, same data
    • May access stale data!
    • Not a problem for R/O data or I-caches
    • Late write-back “Cache bomb”
    • Also DMA inconsistency issues kind of like synonym issues
    • Fixes
      • Flush cache – doesn’t fix aliasing within address space
      • Detect synonyms in either hardware or software
      • In other words, hardware restrict VM mapping so synonyms map to the same cache set
      • Rely on hardware 😛

Write Buffer

  • Stores take time to complete
  • Avoid stalling the CPU by buffering writes FIFO before it hits the cache
  • Gives rise to weak memory ordering
    • Local CPU can grab things out of local write-buffer, remote CPUs may not see this write-buffer

Translation Lookaside Buffer

  • TLB sizes have remained relatively similar historically

Lecture Week 4 Monday

“Virtualisation”

Virtual Machine

  • “Efficient, isolated duplicate of a real machine”
  • Duplicate
    • Should behave almost identically, except for resources or timing differences
  • Isolated
    • Multi homed without interference
  • Efficient
    • Run on bare metal
  • Hypervisor / Virtual Machine Monitor
    • Interchangeable terms
    • Software layer implementing the virtual machine

Types of Virtualisation

  • Platform (Type 1) VM
  • System (Type 2) VM
  • OS-level VM
  • Programming Language VM
  • We’ll mostly look at Type 1, and some of Type 2

Why VMs?

  • First designed to multiplex the hardware before true time-sharing OS
  • Then reborn to solve the same problem, poor resource isolation by mainstream OS
  • Use case: Cloud computing
    • Many benefits
  • Type 2 has very high exception handling cost (e.g. signal handling)

Virtualisation Mechanics

  • Instruction Emulation
    • Trap-and-Emulate approach
      • Expensive (i.e. trap frame)
      • But few instructions trap, so cost is amortised
    • Hypercalls
      • Hypervisor-specific ISA?
    • Trap-and-Emulate requires specific host ISA support
      • Privileged and Sensitive Instructions
      • T&E virtualisable = All sensitive instructions are privileged
    • “Impure” virtualisation if T&E not supported
      • Binary translation of privileged instructions
        • Replace with trap “hypercall”
        • Insert in-line emulation for some others

Virtualisation Address Translation

  • Two levels of address translation
  • Introduce a notion of “Guest Physical Memory”
  • But, must be implemented with single MMU translation
  • 1. Shadow Page Table
    • Hypervisor shadows (keeps a copy of) real guest page table
    • VMM resolves two page table layers at once
    • Only works on hardware-loaded page tables so the VMM can find out where the guest’s page table base pointer is
    • (Of course, this instruction is privileged so it’s trapped)
    • We write-protect the guest page table, so we trap every write to it
    • Shadow PT has TLB semantics (weak consistency)
      • Only gets synced at synchronisation points like a TLB
        • Page fault
        • TLB flush
  • 2. Real Page Table
    • Directly translate on every read/write to Guest PT so that it contains the actual physical mapping, but abstract so that it sees only guest-level mappings
    • Instead of trapping on every read/write, do inline binary translation so we don’t trap
    • Or para-virtualise, use a hypercall

Guest Self-Virtualisation

  • Minimise traps by holding some virtual state inside guest
  • For example, interrupt handling virtual interrupt flag
  • Only if guest and VMM agree on using virtual interrupt flag
  • No need to trap interrupt flag edits

Device Models

  • Emulated, Split (Para-Virtualised), Passthrough
  • Emulated
    • Emulate device register accesses
    • Very slow, because requires trapping on every device access
    • If too slow, real device has time constraints, breaks abstraction
    • For example, Linux does boot time speed checks
  • Split Driver
    • Must port drivers to hypervisor
    • Low overhead
    • Basically a driver RPC
    • Example: Xen Dom0
      • Trusted Computing Base runs “Driver OS”
  • Pass Through
    • Can have hardware support for I/O MMU for hardware guest separation
    • Can also share hardware itself if hardware itself supports it

ARM Virtualisation Extensions

  • Hyp Mode
  • Configurable Traps
    • Can configure the hardware to trap directly into the guest OS, instead of the hypervisor having to forward the trap to the guest OS
    • Similar for x86
  • Emulation support
    • Because of separation of I-Cache and D-Cache, we get a compulsory cache miss upon any instruction trap
    • HW decodes instruction so we don’t have to access cache
  • 2-stage Translation
    • Hardware walks both guest and VMM page tables
    • Eliminates “Virtual TLB”
    • A bit slower, even though VMM PT is shallow
    • Worst case is massively slower, i.e. page fault on every single level
    • But generally a good idea
  • Virtual Interrupts
  • System MMU (I/O MMU)
    • Like an MMU for DMA

World Switch

  • Intel
    • VM state is <=4KiB
    • VMentry / VMexit
    • Hardware saves and restores
  • ARM
    • VM state is 488B
    • Save and restore done by hypervisor

KVM

  • KVM is not Type-2!
  • A “Hybrid Hypervisor”
  • Fundamentally a Type-1
  • Works by using the fact that virtualisation privilege is orthogonal to normal ring privilege
  • “Really bloody HUGE” Trusted Computing Base
  • Reuse Linux drivers

Fun Things with VMs

  • See Slide

Lecture Week 4 Tuesday

“Measuring and Analysing Performance”

Benchmarking in R&D

  • Conservative: No regression
  • Progressive: Actual improvement in important cases
  • Must analyse and explain results
    • Discuss model
    • Hypothesis for behaviour
    • Test and confirm hypothesis
  • Microbenchmarks (component) and Macrobenchmarks (full system)
  • Obtaining an overall score
    • Absolute score rarely matters
    • Compare systems relative to each other
    • BUT: Arithmetic mean is meaningless for relative measures
    • Use: Geometric mean – invariant under normalisation
  • Benchmarking crime: Throughput degradation = overhead
    • Use proper unit of figure: processing cost per unit data

Profiling

  • Tool: gprof, but can’t be used inside OS
  • Performance Monitoring Unit (PMU)
    • Hardware-based profiling unit
    • Collects certain events
    • A counter that counts events
    • Linux: oprof
    • Mainly CPU based events

Performance Analysis

Understanding Performance

  • Some benchmarking tables and how to interpret them

Lecture Week 5 Monday

“Real-Time Systems”

Real-Time Tasks

  • Event, Jitter, Release, Processing Time, Completion, Deadline
  • Deadlines are frequently implicit – process things without missing an event
  • Safety-critical: Failure; Death, serious injury
  • Mission-critical: Failure; Massive financial damage

Weakly-Hard Real

  • Where occasional deadline misses can be occasionally compensated
  • A cost of a miss doesn’t go to infinity (i.e. recoverable or probabilistically ok)
  • But; What does it mean to have a probability? Is it safe to base it on observed times?
  • To Gernot, probabilistic real time is bogus because you usually can’t assign real probabilities

Firm Real-Time Systems

  • Typically represented by a gain function (lower latency is advantageous), not a cost function (upon failure)
  • For example, trading systems

Soft Real-Time

  • Tail latencies
  • For example, cloud providers, web services, video streaming
  • Characterised by Quality of Service
  • “Bounded tardiness” – just some term

Best-Effort

  • In practice, duration is rarely totally irrelevant

What does it mean for OS?

  • Designed to support real-time
    • Fast and predictable interrupt handling
  • Main duty is scheduling tasks to meet their deadline
  • RT vs OS terminology
    • Task vs Thread
    • Job vs Event
  • Problem is harder than bin-packing; because time is not fungible
  • Feasible – there is a known algorithm that will schedule them
  • Optimal – can schedule any task that is feasible

On-Line RT Scheduling

  • Execution order is not predetermined
  • Fixed priority scheduling – L4
    • Scheduler always runs highest-priority thread
    • Will preempt to run a higher-priority thread
  • Rate Monotonic Priority Assignment (RMPA)
    • Period, Rate, Priority, Utilisation
    • Schedulability test guarantee
      • For infinity, if utilisation is <= 69.3
      • Sufficient but not necessary (can get “lucky”)
    • RMPA is optimal for FPS
  • Dynamic priorities: Earliest Deadline First
    • Also schedulability test guarantee of U <=100
    • Assumption: All the tasks are completely independent of each other
    • Not recommended by Gernot

Resource Sharing

  • Locking vs Delegation
    • RT Terminology: Resource Server
  • How to implement delegation for seL4
    • seL4 IPC for single-core (Hoare-style monitor), notifications for multi-core (semaphores)
  • Problem: Priority Inversion
    • May have multiple layers of dependencies, if high-priority thread depends on low-priority
    • How do you minimise it?
  • Solution 1: Priority inheritance (“Helping”)
    • Do what’s most beneficial for highest-priority task
    • Bump priority of subtasks to the priority of the waiting thread
    • Transitive inheritance => Possibility for long blocking chains
    • Potential deadlocks
    • Messy, easy to get wrong for safety-critical systems
  • Solution 2: (Immediate) Priority Ceiling Protocol (IPCP)
    • Predefined priority associated with a shared resource (lock)
    • Make thread exit critical section as soon as possible (use “ceiling” priority)
    • Each task must declare resources at admission time
      • In a real time system, this is okay to require
      • Easy to enforce with capabilities
    • FPS: bump priority, EDF: floor of deadlines

Scheduling Overloaded RT Systems

  • Cannot trust everything to obey declared worst-case execution time
  • Which job will miss the deadline?
  • Fixed Priority Scheduling: Loser is well-defined
  • Priority and Criticality, in FPS the two are the same
  • In FPS, priority is implicit so there’s no notion of criticality, so loser is not well-defined

Mixed-Criticality Systems

  • High priority but low criticality tasks (device drivers)
  • Low priority but high criticality tasks (control loops)

Lecture Week 5 Tuesday

“Microkernel Design & Implementation”

“If you find yourself hand waving a lot this is a sure sign you’re missing a diagram” – Gernot

“Security is no excuse for bad performance” – Gernot

Traditional L4 and seL4 Improvements

  • Recursive address spaces
    • If you have a mapping, you can just give it to any other process through IPC
    • Some issues exist with this model apparently
  • Thread IDs as IPC addressing
    • seL4: Mailbox model with badges; communication partners are anonymous
    • Also more useful to have badges, which has more flexible semantics
    • Traditional: Forces you to have a load balancer thread, which complicates programming
    • seL4: Transfer endpoint capability upfront for load balancing, get them back later, like a kernel-managed queue for workers
    • Traditional: No interposing; lest violate truth of Thread ID authentication
  • Long IPC
    • Kernel-managed zero-copy IPC
    • Page faults? Which takes ages for a microkernel…
    • Very complicated
    • Don’t need a kernel to have a trusted third-party with access to both processes mapping
  • Timeouts
    • IPC calls with timeouts
    • How does a user determine the timeout values? Only really reasonable in networking
    • seL4: New semantics for call/reply reduces this problem for server/client cases
  • IPC Fastpath
    • 1. Prologue
    • 2. Identify destination
      • Check caps
    • 3. Get receiver TCB
      • Check priority
    • 4. Mark sender blocked and enqueue
      • Create reply cap
    • 5. Switch to receiver
      • Leave Message Registers untouched
    • 6. Epilogue
    • Notice:
      • No scheduler invocation, just direct lookup
      • No time-slice donation
  • Kernel runs with interrupts disabled
    • No concurrency control means simpler kernel
      • Easier reasoning about correctness
      • Better average case
    • Long-running system calls? Bad.
    • Interrupt latency? Bad.
    • Most protected-mode RTOS are fully preemptible
      • Just save registers (trapframe) and re-enable interrupts
      • According to Gernot, preemption during IPC makes the code very hard to guarantee correctness, not a good idea for high assurance environments
      • It’s also very slow
    • How to ensure liveliness if we don’t enable interrupts in the kernel?
    • Incremental Consistency Paradigm
      • To ensure liveliness, break operations down into a bunch of O(1) operations, and ensure kernel consistency afterwards, and poll for interrupts in between
      • Consistency, restartable progress
    • Worst-Case Execution Time Analysis
      • 1. Program Binary
      • 2. Control-flow graph, Micro-architecture model, Loop bounds
      • 3. Analysis tool
      • 4. Integer linear equations, Infeasible path (mutually exclusive paths) info
      • 5. ILP solver
      • 6. WCET
    • ARM stopped providing WCET information…
    • RISC-V implementations may supply them!
  • Scheduler optimisation – Lazy scheduling
    • Frequent blocking and unblocking in microkernel model
    • Don’t want to modify scheduler queue at every IPC
    • Leave blocked threads in ready queue, leave for scheduler to clean up
    • Worst case unbounded scheduler operation! O(n) for threads in system
    • Not the way to do real time
    • Now uses Benno Scheduling
  • Scheduler optimisation – Direct Process Switch
    • Idea: Don’t invoke scheduler if you know who’ll be chosen i.e. if receiver priority has higher priority
    • Implication: Because scheduler doesn’t run, implies time slice donation – receiver runs on sender’s time slice
    • Solution: Time represented by capabilities
      • Scheduling context capability
      • Represents a right to use time
        • T: Period
        • C: Budget <= T
      • Represents basic unit for mixed criticality
    • Problem has been bothering Gernot for 20 years
    • Notion of a passive server
      • No scheduling context
      • Runs on donated scheduling context (uses up client’s scheduling context)
      • So-called Migrating Thread Model
      • Problem: Timeslice runs out while server is running, what is the next behaviour?
      • Passive server may be blocked due to a low criticality consumer’s timeslice which will then block the high criticality consumer
      • Policy free method for solving this in seL4: raise an exception and let the user handle it
      • Timeout exceptions – policy-free mechanism for dealing with budget depletion
        • Possible policies
        • Emergency budget to leave critical section
        • Rollback, it’s client’s problem
  • SOSP ‘93 and SOSP ‘95, two papers that describe tricks and design decisions in the original L4 kernel
    • Some kept, some remain

Reflecting on Changes

  • Original L4 shortcomings
    • Insufficient and impractical resource control
    • Over-optimised and needlessly complex IPC abstraction
  • Completes a 30 year dream for a verification of OS

Lecture Week 6 Monday

“Security Fundamentals”

Security Design

  • Saltzer & Schroeder [SOSP ‘73, CACM ‘74]
  • Economy of mechanism
  • Fail-safe defaults
  • Complete mediation – check everything
  • Open design – not security by obscurity
  • Separation of privilege
  • Least privilege – principle of least authority
  • Least common mechanism
  • Psychological acceptability – if it’s hard to use it won’t be

Mechanisms & Policies

  • Access control is the central domain of operating systems
  • Confidentiality – prevent read of Hi
  • Integrity – prevent tampering of Hi
  • Availability

Risk Management

  • Mitigate (security mechanisms) or Transfer (insurance)

Covert Channels

  • Intentional leakage
  • Storage channel
    • Use attribute of shared resource i.e. filename contains secrets
  • Timing channel
    • Temporal order of use of shared resource
  • Other physical channels
    • Power draw etc.
  • Covert channels vs side channels
    • Covert channels are intentional
    • Side channels are unintentional

Access-Control Principles

  • Representing protection state
    • Think of a matrix
    • (Vertical) Access control list – list for everyone, gets big
      • Compacted – UNIX only stores for user,group,everyone
    • (Horizontal) Capabilities
  • Capability
    • (Reference + Access Token) Combined
    • Implies a naming system to reference it
  • Capabilities must be unforge-able
    • Either protected by hardware (tagged memory protection)
    • Or protected by sparseness (high number of bits)
    • Or privileged kernel data (e.g kernel can verify through a tree)
  • ACLs & Capabilities – Duals of the access control matrix

Differences of ACLs and Capabilities

  • Naming and name spaces
    • ACLs: Objects referenced by name; access is orthogonal to reference
    • Capabilities: Caps imply name;
    • Problem: Confused deputy
      • Imagine the gcc log file problem
      • Problem arises due to ambient authority
      • Fundamentally unsolvable with ACLs
    • Problem: Confused deputy with caps
      • No ambient authority
      • Alice can’t even pass the logfile
  • Meta-permissions
    • ACLs require meta-permissions to modify ACLs
    • Caps use delegation and revoking, built-in fine-grained meta-access
    • Caps: Right to delegate vs right to use
  • Process creation
    • ACLs: Child inherits all rights – opposite of least privilege
    • Caps: Child has no caps by default – defaults to least privilege
  • Auditing of protection state
    • “Who has rights to this thing?”
      • Easy with ACLs
      • Hard to solve with sparse caps
      • seL4: CSpace makes this possible
    • “What rights does this thing have?”
      • Caps: Just look at caps
      • ACLs: Close to impossible, need full scan
  • Interposing access
    • Caps are opaque, can masquerade
    • Transparent virtualisation of access (e.g. containers)
    • “Security Monitor”
    • Example: Lazy object creation
      • Objects can be lazy references
      • Server can create the corresponding state on first use

Mandatory vs Discretionary Access Control

  • Discretionary: As owner of file, I can change access
  • Mandatory: System administrator defines policy, owner cannot change
  • Bell & LaPadula (BLP) Model [1966]
    • Modelled on national security classification
    • Classification – clearances
    • Orthogonal compartments (need-to-know)
    • Rules:
      • No read up (obviously)
      • No write down (because equivalent to down reading up)
      • Example: Military, WTF, can’t give orders down?!
      • In practice, need exceptions for de-classification
      • Attestation for permissions to send down
      • Trend to over-classify, not good
  • Biba model – similar, but enforces integrity
    • Now, don’t think of secrets but think of data corruption
    • No read down
    • No write up
    • Don’t want data flowing up, this time
  • Put them together and…..
    • Now no information can flow in any direction :/
    • Strong *-Property (Star)
      • Allow read down, by trusting high integrity to be responsible to not corrupt itself
  • Boebert’s Attack
    • Famous argument against capabilities
    • “On the inability of an unmodified capability machine to enforce the Star-property”
    • Doesn’t work on seL4 because caps are orthogonal to data
    • Need both grant right of the cap and on the IPC (and Hi-integrity will deny this)
    • Need mechanism to limit cap propagation: take-grant model on seL4
  • Decidability
    • Given initial state, can it ever reach an unsafe state?

Lecture Week 6 Tuesday

“Security: Information Leakage”

Causes of Timing Channels

  • Algorithmic (e.g. hashing)
  • Resource contention (e.g. caches)
    • Stateless: e.g. interconnect bus speed
      • Limitations:
      • Only during concurrent access
      • Generally reveals no data
      • Requires intentional actor to be used as a covert channel
      • Not really possible to use as side channel
    • Stateful:
      • CPU Caches
      • Branch predictor
      • These are all capacity-limited resources

Timing Channels

  • Prime and Probe
    • Challenge: Access times
    • L1 is very high bandwidth
    • Longer to traverse cache for upper level caches
    • Approach:
      • Probe a few cache lines at a time, find “interesting sets”
      • E.g. look for `square` code from GnuPG
      • Can read out bits of exponent! [Liu S&P’15]
  • Prevention:
    • Spatial & temporal partitioning
    • Spatial: Cache colouring!
    • Temporal: Flush cache!
  • However: Cannot spatially partition on-core cache (L1 cache)
    • L1 usually has 1 colour
    • (if virtually-indexed)
  • Cache flushing for L1 has little effect cross-domain because the cache would be trashed anyway (for the other domain)
  • However: Flushing useless for multicore / SMT

Evaluating Intra-Core Channels

  • Methodology: Channel Matrix
    • Difference when moving horizontally (changing set size) means channel
  • Reference: https://ts.data61.csiro.au/publications/csiro_full_text//Ge_YCH_19.pdf
  • I-Cache Channel (even with flush) demonstration with channel matrix
  • Branch-History-Buffer (BHB)
    • 1-bit channel
  • Means: Unclosable channel / impossible for temporal protection even with flushing

Intel Spectre Defences

  • “Indirect Branch Control”
    • Tries to close BHB channel by flushing

Speculating Disaster

  • Instruction Pipelining
    • Problem: Dependencies
    • Out-of-Order Execution
      • Includes speculative branching
  • Problem: Speculation pollutes cache

Meltdown

  • Fix: Trapping will now switch page tables to a kernel page table, expensive
  • Secrets leak because kernel address space contains physical memory, leaks every single userspace program

Spectre: Branch Prediction (Variant 1)

  • Speculatively ignore array reference protection via if statements, this pollutes the cache
  • Invalid array index can leak any address in program
  • ISA has no contract on microarchitectural behaviour
  • We need a new hardware-software contract!

Lecture Week 7 Monday

“Linux, Locking, Many Processors”

(No lecture Week 7 Tuesday)

Some History

  • John Lions, taught UNIX at UNSW
  • A lot of people got their introduction to UNIX with John Lions annotated source code
  • BSD, Paged Virtual Memory 1979, etc.
  • UUCP, Unix-to-Unix Copy, MHSnet using phone calls (lol)
  • Process model, fork exec, memory sharing
  • Dup, Open file descriptor, inode, CLOSE_ON_EXEC
  • Inode model for directories
  • Namei
  • Linux extra keywords, __init, __exit, __percpu, __user, __rcu, __kernel, __iomem, __acquires, __releases, __bitwise
  • Other Linux C dialect / abstractions
  • Linux memory consistency model
  • Linux has its own memory model that abstracts the hardware memory model (i.e. loads, fences)

Scheduling

  • Dispatch O(1)
  • Pretty fair
  • Good interactive response
  • Topology-aware (NUMA)
  • O(log n) scheduling
  • Currently CFS – Completely Fair Scheduler
  • Dual Entitlement Scheduler
    • Each thread given entitlement (CPU time)
    • Deserving and Undeserving queue

Memory Management

  • Struct page
  • Other various structs
  • How Linux handles page faults

Driver interface

  • Enumerable and non-enumerable
    • Who are you? Or can’t find out
  • Device tree for userspace

Containers

  • Namespace isolation
  • Lecturer did research on this in the 70s, thanks to a fair share scheduler developed here, sold to Cray, then to Sun
  • Hierarchical scheduling competition
  • (Sounds like carving out a scheduling capability…?)

Scalability

  • Performance cliff at high loads due to coherence delay
  • Amdahl’s law and Gunther’s law
  • Doing without locks
    • Sequence locks, lots of readers, few writers
  • Read Copy Update

Lecture Week 8 Monday

“Multicore and Memory Ordering”

Types of Multiprocessors

  • Uniform Memory Access
    • Processors with local caches
    • Cache coherency!
    • Unit of consistency is the cache line
    • Sequential Consistency
      • Result should be as if all the processors were executed in some sequential order
      • In other words, all reads see the latest write
    • Early solution: Write-through caches
      • CPU snoops bus activity to invalidate stale lines
      • Downside:
        • All writes go to bus
  • Non-Uniform Memory Access
    • Provides high local bandwidth and reduces bus contention
    • Other options like MESI (Modified Exclusive Share and Invalidate)
    • Directory-based coherence
      • For really big multiprocessing where broadcast bandwidth for MESI is not adequate

Memory Ordering

  • Example: tail of a critical section
  • Memory can’t be re-ordered or else it’d be outside the critical section!
  • Strong Ordering
  • Note: https://preshing.com/20120930/weak-vs-strong-memory-models/
  • Each CPU keeps itself consistent, but how is it viewed by others?
  • Example: Write buffers
    • x86: Total Store Ordering, stores are FIFO
    • Problem: CPU 0 writes to write buffer, CPU 1 reads from cache, inconsistency! Apparent reordering of instructions!
    • Memory fences, effectively drains write buffer
    • ARM: Partial Store Order
  • Existing primitives will have appropriate fences in place
    • In practice, correctly synchronised code can ignore memory model
  • But if you write racey code (e.g. lock free algorithms) must use fences and barriers
  • x86 is nice: just stores reordered after loads
  • ARM is not so nice

Kernel Locking

  • Trylock; prone to livelock
  • Different types of locks
  • Spinlocks

Lecture Week 8 Tuesday

Continued…

  • Common spin lock
    • Beware of bus arbitration!
    • Bad because requires disabling interrupts for critical paths
  • Test and test and set is not a panacea
  • Kinda falls off after 7th processor
  • Implementation of MCS Locks
    • Buuut… MCS locks on ARM MPCore?
  • TT&S low overhead on low contention
  • MCS grows less on high contention
  • Dynamically switch between TT&S and MCS depending on contention level
  • Case study: Linux performance collapse
    • Collapse for short sections
    • Non-scalable lock
    • Cache coherence protocol
    • MCS locks are the best! 🙂

Lecture Week 9 Monday

“Formal Verification and seL4”

Assurance and Verification

  • Common Criteria
  • Semi-Formal = use some form of formal language, not just English, not necessarily mathematical
  • Formal = mathematical
  • Protection Profiles
  • Which OS’s have been evaluated?
    • Mac OS X – EAL 3
    • Windows and RHEL – EAL 4
    • Just about anything EAL 4 a.k.a. Not a lot
    • 2008 Green Hills INTEGRITY – EAL 6
      • “While the system does not certify common criteria at any evaluation level” says the certification paper, WTF?!
    • 2019 Prove & Run PROVENCORE – EAL 7
  • Green Hills tried One Box One Wire project for commodity hardware VM isolation, NSA, March 2019 said: SKPP validation for commodity hardware platforms infeasible due to complexity. NSA subsequently discontinued certifying >=EAL5
  • Common Criteria – a very expensive way to prove nothing – $1000 per line of code
  • Too much focus on development process
  • Common Criteria – Effectively dead in 5-Eyes defence

Formal Verification

  • Started a long time ago, over-promised and under-delivered
  • Two approaches
  • Abstract Interpretation / Model Checking
    • Check a model for certain properties
    • Unsoundness – has false positives and false negatives
    • Suffers state-space explosion
    • Scales to large code bases
    • Doesn’t actually prove anything
  • Theorem Proving
    • Can deal with large state spaces
    • Requires refinement property (can be modelled in mathematics)
    • Can prove functional correctness against a formal specification
    • Very labour intensive
  • Note, recent work automatically proved functional correctness of simple systems using SMT solvers (model checking) [Hyperkernel, SOSP’17]

Model Checking and Linux: A Sad Story

  • [Chou & al, 2001] Applied static analysis to Linux
  • Device drivers are very buggy
  • Linux did nothing lol, purely open source static checkers still fault

Story of seL4

  • August 2009, birth of seL4
  • Abstract Model < (Proof) > C Implementation
  • “Under the semantics of the C language, this program has behaviour that is allowed under the functional specification”
  • Means we can reason about the abstract model to verify properties
  • C Implementation < (Proof) > Binary Code
  • “Translation correctness”
  • Isolation Properties < (Proof) > Abstract Model
  • “Isolation Properties”
  • Also, “Worst-case Execution Time”, never been done for protected O
  • Initially, had Haskell between AM and C
  • Kernel never fails, behaviour is always well-defined:
    • Assertions never fail
    • Will never de-ref null pointer
    • Will never access array out of bounds
    • Cannot be subverted by mis-formed input
    • All syscall terminate, objects don’t overlap, access control is decidable
  • CompCert – done in France, we can’t use because they use a different mathematical framework and we can’t easily show that us and them are equivalent (axioms, lemmas)
    • Being able to use gcc is also a real asset, different ISAs etc

Security Properties of seL4

  • To prove confidentiality:
    • Non-interference proof
    • Evolution of Low does not depend on High state
    • Infoflow is very strong property. To accomplish, remove non-determinism where it affects confidentiality, e.g. scheduler strictly round-robin.
    • May get really hard to work with.

Verification Assumptions

  • Hardware behaves as expected
  • Spec matches expectations
  • Proof checker is correct

Lecture Week 9 Tuesday

“Local OS Research”

“Use of a monolithic OS in security- or safety-critical scenarios is professional malpractice!” – Gernot

Quantifying OS Design Security Impact

  • Analyse Linux CVEs
  • If it were a microkernel, is it in the TCB?
    • Not in TCB, attack defeated
    • Else, try verification, attack defeated
    • DoS, strongly mitigated
    • Integrity or confidentiality violation, weakly mitigated (i.e. not full system compromise)
    • Full system compromise, i.e. firmware reflush, I2C bus hijack

Cogent

  • Code and proof co-generation to C
  • Not turing complete

Time Protection

  • Equivalent of memory protection but for time
  • i.e. timing channels
  • Channel through kernel code (i.e. shared kernel code I-cache)
  • Performance impact of cache flushing and colouring

Hardware Redundancy / Lockstep

  • Hardware lockstep vs Software lockstep
  • YanYan student – RCoE Redundant Co-Execution
  • Instead of just software, the kernel itself is now replicated (similar to how it’s necessary for Resource Manager for time protection), they vote between each other
  • Two variants:
    • Loosely-coupled
      • Don’t synchronise too hard, only when kernel entry-exit
      • Low overhead, but can’t deal with racy code, threads
    • Closely-coupled
      • Sync on instruction
      • Precise preemptions
    • How closely-coupled works:
      • Leading replica? Wait for barrier
      • Trailing replica? Set breakpoint where leading replica is waiting, wait for barrier
      • Synced

Real World Use

  • DARPA HACMS
  • CAmkES
    • Component middleware
    • Higher level abstractions of low-level seL4 constructs
  • Cross Domain Desktop Compositor

Lecture Week 10 Monday

“Multiprocessor OS”

Scalability

  • Serialisation
  • By the OS
  • Locking introduces serialisation
  • Also atomic operations and cache coherency
  • Memory access can stall
  • Cache-related serialisation
  • False sharing – unrelated data share the same cache line accessed from different processors – cache coherence traffic and delay
  • Cache line bouncing

Multiprocessor Hardware

  • Contemporary
  • ARM, big.LITTLE, Nehalem
  • Interconnects
  • Cluster on Die
  • Tile architecture
  • Rings

Optimisation for Scalability

  • Reduce critical sections
    • Fine grained locking or lock free data structures
    • (Lock data not code)
  • Reduce false sharing (pad to cache lines)
  • Reduce cache line bouncing
  • Reduce cache misses; affinity scheduling

OS Design Guidelines for MP

  • Avoid shared data
  • Explicit communication (regain control over communication costs)
  • Reduce synchronisation
  • Allocate and schedule for locality
  • Uniprocessor performance vs scalability

Approaches for MP

  • Divide and conquer
    • Just use them as independent bits
    • Virtualisation
    • Exokernel
  • Reduced sharing
    • Brute force and heroic effort (fixing them one by one)
    • By design
  • No sharing
    • Distributed system, do extra work to share

Space-Time Partitioning

  • Tessellation OS
  • Gang Scheduling
    • Schedule all threads of a process in one scheduling interval

K42: Clustered Objects

  • Globally valid object reference, resolves to processor local representative
  • Controlled introduction of locality
  • Controlled trade-off between synchronise at write or synchronise at read

Corey / Linux Brute Force

  • Profiling Linux and achieved speedup
  • Applied lessons from previous research
    • Sloppy counters
    • Per-core data structs
    • Fine grained lock, lock free
    • Cache lines
  • How do we know if a system is scalable? Is there some fundamental blocker?
    • How do you really know without infinite benchmarks?
    • Is the OS API a potential bottleneck?
    • Scalable Commutativity Rule
      • Can’t tell which order the operations were run
      • Counterexample: lowest available file descriptor

FlexSC

  • Async syscall, batch syscalls

Barrelfish

  • No sharing
  • Message passing

 

Disappearing LogDNA messages on AWS Lambda

At work, I encountered a strange issue where LogDNA was missing some log entries. This was very annoying since the whole point of logs is to have a full picture when debugging issues.

At first I chalked it up to the LogDNA service being flaky, but over time, I noticed this was a systemic issue and so I decided to investigate.

As my assigned projects are in Python, we use the logging module and LogDNA’s Python Handler to ingest our log messages. If you look into the module carefully, you’ll notice that it uses a thread in a self-callback-loop to flush it’s buffers.

Because AWS Lambda currently uses Linux’s process freezer, this is highly problematic if not managed carefully. The Linux freezer will suspend all threads under our process upon the exit of the handler function, which means – the thread callback never happens; the log buffer never gets flushed; and log messages become lost.

To solve this problem, you can implement a simple decorator that flushes all handlers attached to the root logger. Attach this decorator to the Lambda handler function, so that all loggers are flushed before control is returned to the Lambda executor and before our process gets frozen.

This ensures that the flush thread is joined before continuing, flushing the log buffers and ingesting our log messages. No more lost logs!