Notes on “Frontend” Load Balancing

Chapter 19, “Load Balancing at the Frontend”, The Site Reliability Engineering Book.

Disclaimer: These notes are my own, and they may not fully appreciate the intention of the authors or the contents of the book.

Summary

The chapter describes a possible design to solve challenges when using multiple data-centres and multiple front-line machines within a data-centre to service user requests (collectively referred to as Google’s “Frontend”). Multiple data-centres that are geographically spread-out and internally load-balancing within them is necessary to have redundancy, capacity, and user proximity. But this introduces challenges such as deciding which data-centre serves a particular kind of user request with respect to service-level and technical constraints, load-balancing that scales with operational and capacity constraints, and on how to implement such mechanisms.

A multi-level solution is used, and the first-level mechanic is DNS. Anycast DNS is used on the side of the authoritative nameserver to approximate a user’s location, with heuristics and new techniques (EDNS0) to account for recursive resolvers. This information used to direct the user to the “optimal” data-centre.

The second-level mechanic is virtual IP addresses whereby multiple network interfaces receive requests addressed to one public IP address. Here, virtual means the IP address is as transparent as an IP address bound to a singular network card from a user request perspective. Scalable mechanics (connection tracking, consistent hashing) are used to distribute traffic using a network load balancer to implement virtual IP addresses. Network address translation (NAT) is avoided in favour of tunnelled direct server return (L3 DSR) for scalability.

Frankly Asked Questions

(Things I didn’t quite get – maybe they were answered and I misread, don’t blame me if it was already answered or if I’m wrong about them!)

Question: Why not just use consistent hashing at the network load balancer all the time? Why only use it as a fallback and prefer connection state tracking?

Idea: Consistent hashing is not 100% stable. There is still a possibility that a connection is terminated due to changes in buckets (e.g. during deployments, autoscaling). Connection tracking doesn’t have this problem, which is why consistent hashing is only used as a fallback.

Question: Why not just apply anycast to the Virtual IP?

Idea: 1. Connection resets when roaming to different peering arrangements. 2. Unable to shape traffic, left at mercy of third party routing policies. [See: Anycast Is Not Load Balancing]

Question: What if there was a way to attempt connection to all addresses in a DNS record and just use the first one that accepts your connection? Can’t you just list all of your virtual IP addresses in a single DNS record served globally? (Happy Eyeballs)

Idea: Like anycast, by doing this you lose the ability to shape traffic. Like the book mentions, if SRV records with weights were adopted, maybe this could have worked.

Programming Challenges Exam Writeup

Leave a reply

Problems I managed solve in 6 hours for COMP4128 Programming Challenges. Got me 400 points out of 800 (median score was also 400 if I’m reading this scoreboard correctly), so I’m very happy about that.

After doing pretty badly on the weekly problem sets, I honestly thought I would be screwed, but I guess persevering on the problem sets does make me better at this stuff.

Considering this is my first experience with programming challenges, it feels good to know I can also solve problems like these if I try. (But of course, I know these are really easy problems and I still have a long way to go!)

Problem A: Cryptography

Claude is developing a new system to encrypt information. In his system, a key is a tuple of five integers. The encryption is based on three fundamental operations.

Letting ã = (a1, a2, a3, a4, a5) and b̃ = (b1, b2, b3, b4, b5):
    • rotate(ã) = (a5, a1, a2, a3, a4),
    • multiply(ã, b̃) = (a1*b1, a2*b2, a3*b3, a4*b4, a5*b5),
      where each number in the result is taken modulo 1,000,000,007.
    • combine(ã, b̃) = multiply(ã, rotate(b̃)).

To encrypt a key, Claude iterates the combine operation n times as follows:
    combine(ã, n) = combine(combine(ã, n − 1), ã)
    combine(ã, 0) = ã.

Help Claude produce the encrypted key.

prob-Cryptography (full).pdf Download

Constraints

• 1 ≤ n ≤ 10¹⁸.
• 0 ≤ a1, a2, a3, a4, a5 < 1,000,000,007.

Subtasks

A1 (30 points): 1 ≤ n ≤ 100,000.
A2 (70 points): no restrictions.

How I solved this problem

Subtask A1

Simulation using tuples as a warm up problem.

Subtask A2

The problem text is confusing because there are two combine() functions, so change the iteration function to iter().

iter(a, n) = combine(iter(a, n-1), a)
iter(a, 0) = a

iter(a, n) = combine(iter(a, n-1), a)
= combine(combine(iter(a, n-2), a), a)
= combine(combine(combine(iter(a, n-3), a), a), a)

Expand and realise the terms are basically ((A1*A5)*A5)..repeated n times.

Use fast exponentiation under modulus.

Solution (Original, C++)

#include <iostream>
#include <algorithm>
#include <vector>
#include <sstream>
#include <cassert>
#include <queue>
#include <set>
#include <map>

#ifdef DEBUG
#include <fmt/format.h>
#include <fmt/ranges.h>
#include <fmt/ostream.h>
#else
namespace fmt { void print(...) {} }
#endif

using namespace std;

uint64_t pow(uint64_t x, uint64_t n, uint64_t m) {
    if (n == 0) return 1;

    uint64_t a = pow(x, n/2, m);
    a = a * a % m;
    if (n%2 == 1) a = a * x % m;
    return a;
}

using enctup = tuple<uint64_t, uint64_t, uint64_t, uint64_t, uint64_t>;
const uint64_t M = 1000000007llu;

enctup encryptfast(enctup const& tup, uint64_t n) {
    return enctup(
        (get<0>(tup) * pow(get<4>(tup), n, M)) % M,
        (get<1>(tup) * pow(get<0>(tup), n, M)) % M,
        (get<2>(tup) * pow(get<1>(tup), n, M)) % M,
        (get<3>(tup) * pow(get<2>(tup), n, M)) % M,
        (get<4>(tup) * pow(get<3>(tup), n, M)) % M
    );
}

ostream& operator<<(ostream& os, enctup const& tup) {
    return os << get<0>(tup) << ' '
    << get<1>(tup) << ' '
    << get<2>(tup) << ' '
    << get<3>(tup) << ' '
    << get<4>(tup);
}

int main(void)
{
    uint64_t n, a1, a2, a3, a4, a5;
    cin >> n >> a1 >> a2 >> a3 >> a4 >> a5;
    enctup origtup = enctup(a1, a2, a3, a4, a5);
    enctup anstup = encryptfast(origtup, n);
    cout << anstup << endl;
    return 0;
}

Problem B: Wires

Gilles is trying to wire a circuit board. There are four sides, each with n connection points labelled between 1 and 2n. Each of these labels occurs exactly twice.

For each pair of connection points with the same label, Gilles wants to place a wire to connect them. These wires may be curved, but they cannot go off the board and they cannot overlap or cross each other.

Is it possible for Gilles to place all 2n wires as desired?

prob-Wires (full).pdf Download

Constraints

• 1 ≤ n ≤ 100,000.
• For all 1 ≤ i ≤ 4 and 1 ≤ j ≤ n, 1 ≤ ai,j ≤ 2n.
• Each value ai,j appears exactly twice.

Subtasks

B1 (30 points): 1 ≤ n ≤ 1000.
B2 (70 points): no restrictions.

How I solved this problem

At first, I started thinking with a breadth-first search approach – for every connection point, I can use breadth-first search to find the nearest connection point, with a bound that increases. For example, I would first try all connection points that can connect within 1 search, and then I would connect all those with 2 searches, and so on.

Then I realised that’s basically a greedy approach on the distance of connections and I was assuming that it would produce an optimal solution, which turned out to be right.

Now that we have an algorithm for wiring the board, all we have to figure out is whether any such wiring would be valid.

Subtask B1

Sort connection points by the distance to their counterpart. Greedy on the distance, lowest first. I can loop through all connection points in O(n) time and connect them if there are no issues.

For the decision problem of valid wiring, consider the input array of connection points arranged in clockwise order. For any connection point you consider, there is a counterpart, and between those two points, there is an “inside” and “outside” as shown on the diagram below.

. x . . x . . . x . . x connected?
1 3 4 6 3 2 2 1 5 4 6 5
    |--- inside --|
|--|               |--| outside

Loop through the “inside”. If any are connected, consider whether it is connected to the “outside” and if so, then we cannot connect the two points under consideration (the two points that define the “inside”) as it would cross a prior greedy optimal solution.

Subtask B2

The above linear sweep approach is not good enough, because our worst-case scenario is a linear sweep for every connection point, which will be 10B steps.

But we can simplify the problem. In terms of array indices, we can store the index to its counterpart if it is connected, or the index to itself otherwise.

. x . . x . . . x . . x connected?
0 1 2 3 4 5 6 7 8 9 a b array index
0 4 2 3 1 5 6 7 b 9 a 8 counterpart index
1 3 4 6 3 2 2 1 5 4 6 5
    |--- inside --|
|--|               |--| outside

Use two separate range trees to query the minimum and maximum value of the inside range. Check if either the mininum or maximum lies in the outside range. We have checked whether a conflict exists in O(log n) time.

Tricks: For some reason the range tree implementation I copied from the lecture slides overran its buffer into the one and two array, so I just increased its size massively and it passed… so yeah, I must have screwed up one of the calculations…?

Solution (Original, C++)

#include <iostream>
#include <algorithm>
#include <vector>
#include <sstream>
#include <cassert>
#include <queue>
#include <set>
#include <map>
#include <cmath>
#include <iomanip>

#ifdef DEBUG
#include <fmt/format.h>
#include <fmt/ranges.h>
#include <fmt/ostream.h>
#else
namespace fmt { void print(...) {} }
#endif

using namespace std;

const int MAXN = 100000;

int n;
vector<int> clockwise;

// For given number as index, stores the index in the
// clockwise array, first occurring stored in one, and then two
int one[2*MAXN];
int two[2*MAXN];

// for a given number, returns the distance
// to the counterpart, accounting for counter-clockwise wraparound
int distance(int x) {
    assert(n > 0);
    int cw = two[x] - one[x];
    assert(cw > 0);
    int ccw = one[x] + ((4*n) - two[x]);
    // fmt::print("ccw={} x={} n={} one={} two={}\n", ccw, x, n, one[x], two[x]);
    assert(ccw > 0);
    return min(cw, ccw);
}

/* BEGIN RANGE TREE IMPLEMENTATIONS */

/* MAX RANGE TREE */
// the number of additional nodes created can be as high as the next power of two up from MAX_N (131,072)
int max_tree[(4*MAXN) + 524288 + 1000000];

// a is the index in the array. 0- or 1-based doesn't matter here, as long as it is nonnegative and less than MAX_N.
// v is the value the a-th element will be updated to.
// i is the index in the tree, rooted at 1 so children are 2i and 2i+1.
// instead of storing each node's range of responsibility, we calculate it on the way down.
// the root node is responsible for [0, MAX_N)
void max_update(int a, int v, int i = 1, int start = 0, int end = 4*MAXN) {
  // this node's range of responsibility is 1, so it is a leaf
  if (end - start == 1) {
    max_tree[i] = v;
    return;
  }
  // figure out which child is responsible for the index (a) being updated
  int mid = (start + end) / 2;
  if (a < mid) max_update(a, v, i * 2, start, mid);
  else max_update(a, v, i * 2 + 1, mid, end);
  // once we have updated the correct child, recalculate our stored value.
  max_tree[i] = max(max_tree[i*2], max_tree[i*2+1]);
}


/*
 * range tree query
 */

// query the sum in [a, b)
int max_query(int a, int b, int i = 1, int start = 0, int end = 4*MAXN) {
  // the query range exactly matches this node's range of responsibility
  if (start == a && end == b) return max_tree[i];
  // we might need to query one or both of the children
  int mid = (start + end) / 2;
  int av = 0;
  int bv = 0;
  // the left child can query [a, mid)
  if (a < mid) av = max_query(a, min(b, mid), i * 2, start, mid);
  // the right child can query [mid, b)
  if (b > mid) bv = max_query(max(a, mid), b, i * 2 + 1, mid, end);
  return max(av, bv);
}


/* MIN RANGE TREE */
// the number of additional nodes created can be as high as the next power of two up from MAX_N (131,072)
int min_tree[(4*MAXN) + 524288 + 1000000];

// a is the index in the array. 0- or 1-based doesn't matter here, as long as it is nonnegative and less than MAX_N.
// v is the value the a-th element will be updated to.
// i is the index in the tree, rooted at 1 so children are 2i and 2i+1.
// instead of storing each node's range of responsibility, we calculate it on the way down.
// the root node is responsible for [0, MAX_N)
void min_update(int a, int v, int i = 1, int start = 0, int end = 4*MAXN) {
  // this node's range of responsibility is 1, so it is a leaf
  if (end - start == 1) {
    min_tree[i] = v;
    return;
  }
  // figure out which child is responsible for the index (a) being updated
  int mid = (start + end) / 2;
  if (a < mid) min_update(a, v, i * 2, start, mid);
  else min_update(a, v, i * 2 + 1, mid, end);
  // once we have updated the correct child, recalculate our stored value.
  min_tree[i] = min(min_tree[i*2], min_tree[i*2+1]);
}


/*
 * range tree query
 */

// query the sum in [a, b)
int min_query(int a, int b, int i = 1, int start = 0, int end = 4*MAXN) {
  // the query range exactly matches this node's range of responsibility
  if (start == a && end == b) return min_tree[i];
  // we might need to query one or both of the children
  int mid = (start + end) / 2;
  int av = 1e9;
  int bv = 1e9;
  // the left child can query [a, mid)
  if (a < mid) av = min_query(a, min(b, mid), i * 2, start, mid);
  // the right child can query [mid, b)
  if (b > mid) bv = min_query(max(a, mid), b, i * 2 + 1, mid, end);
  return min(av, bv);
}

void test(void) {
    fmt::print("Testing...\n");
    max_update(0, 1);
    max_update(2, 10);
    assert(max_query(0, 3) == 10);
    assert(max_query(0, 5) == 10);
    assert(max_query(0, 2) == 1);
    assert(max_query(10, 50) == 0);
    fmt::print("Test success!\n");
    abort();
}

/* END RANGE TREE IMPLEMENTATIONS */

int32_t main(void)
{
    cout << fixed << setprecision(10);

    fill(one, one+(2*MAXN), -1);
    fill(two, two+(2*MAXN), -1);

    cin >> n;

    for (int i = 0; i < (4*n); ++i) {
        max_update(i, i);
        min_update(i, i);
    }

    for (int i = 0; i < (4*n); ++i) {
        int v;
        cin >> v;
        v -= 1;
        clockwise.push_back(v);
        if (one[v] == -1) {
            assert(two[v] == -1);
            one[v] = i;
        } else {
            assert(two[v] == -1);
            two[v] = i;
        }
    }

    vector<pair<int, int>> dpts; // (distance to counterpart, number)
    for (int i = 0; i < (2*n); ++i) {
        dpts.push_back(pair<int, int>(distance(i), i));
        fmt::print("distance for {}={}\n", i+1, distance(i));
    }
    sort(dpts.begin(), dpts.end());

    for (auto const& p : dpts) {
        int num = p.second;

        if (max_query(one[num]+1, two[num]) > two[num]) {
            cout << "NO" << endl;
            return 0;
        }
        if (min_query(one[num]+1, two[num]) < one[num]) {
            cout << "NO" << endl;
            return 0;
        }

        max_update(one[num], two[num]);
        min_update(one[num], two[num]);
        max_update(two[num], one[num]);
        min_update(two[num], one[num]);
    }

    cout << "YES" << endl;

    return 0;
}

Problem C: Assemble

Catherine is building a spaceship from a variety of parts. All the parts have been built in various cities, connected by a rail network. The last step is to transport all the spaceship parts to a single city for assembly.

There are n cities in the network, with m railroads each joining a pair of cities. It is possible to travel between any pair of cities by some sequence of railroads. Each railroad has an associated integer denoting its level. Buying a freight pass at level k allows the owner to send unlimited amounts of freight along any railroad of level at most k.

Freight passes are very expensive, so help Catherine find the lowest level of freight pass required to gather all spaceship parts in a single city.

prob-Assemble (full).pdf Download

Constraints

• 1 ≤ n ≤ 100,000.
• n − 1 ≤ m ≤ 100,000.
• 1 ≤ p ≤ 100,000.
• For all 1 ≤ i ≤ m, 1 ≤ ai < bi ≤ n and 1 ≤ li ≤ 100,000.
• For all 1 ≤ i ≤ p, 1 ≤ ci ≤ n.

Subtasks

C1 (50 points): 1 ≤ n ≤ 1,000, n − 1 ≤ m ≤ 1,000.
C2 (50 points): no restrictions.

How I solved this problem

Subtask C1

First, clearly a binary search and decision problem on the value of k.

Because the graph is undirected, if I can reach a city from the destination, then I can also reach the destination from that city as well.

That means if I run BFS or DFS, then I can check reachability in O(|V| + |E|) time which is about 200_000.

But the city where I must assemble things is not defined and there are up to 100_000 cities.

Binary search is log2(100_000) = around 17.

For full problem 340B steps vs for subtask 34M steps, so linear sweep of all cities will work for subtask but not for the full problem.

Subtask C2

Let’s be smarter. For each level that is undergoing binary search, I should be able to run something that is not BFS or DFS to find out what cities are reachable from a given city for all cities.

Perhaps I can use a disjoint-set union? If I loop through all the edges, and even if I scan through all cities once afterwards, that’s still only 200_000. A union-find data structure with path compression and subtree balancing runs in O(log(n)). 17*17*200_000 is well within the bounds.

For each binary search decision, only consider edges that can be used with a level k pass.

Solution (Original, C++)

#include <iostream>
#include <algorithm>
#include <vector>
#include <sstream>
#include <cassert>
#include <queue>
#include <set>
#include <map>

#ifdef DEBUG
#include <fmt/format.h>
#include <fmt/ranges.h>
#include <fmt/ostream.h>
#else
namespace fmt { void print(...) {} }
#endif

using namespace std;

class union_find
{
public:
    union_find(size_t size) : parent_(size), subtree_size_(size) {
        for (size_t i = 0; i < size; ++i) {
            parent_[i] = i; subtree_size_[i] = 1;
        }
    }
    auto root(int x) -> int {
        // only roots are their own parents
        // otherwise apply path compression
        return parent_[x] == x ? x : parent_[x] = root(parent_[x]);
    }
    auto join(int x, int y) -> void {
        // size heuristic
        // hang smaller subtree under root of larger subtree
        x = root(x); y = root(y);
        if (x == y) return;
        if (subtree_size_[x] < subtree_size_[y]) {
            parent_[x] = y;
            subtree_size_[y] += subtree_size_[x];
        } else {
            parent_[y] = x;
            subtree_size_[x] += subtree_size_[y];
        }
    }
private:
    vector<int> parent_;
    vector<int> subtree_size_;
};

int n, m, p;
using edge = tuple<int, int, int>;
vector<edge> edges; // city1, city2, level
vector<int> parts; // refers to city

bool decision(int k) {
    auto uf = union_find(n+1);

    for (auto const& e : edges) {
        auto c1 = get<0>(e);
        auto c2 = get<1>(e);
        auto l = get<2>(e);

        if (l > k) continue;

        uf.join(c1, c2);
    }

    int mustbe = -1;
    for (auto const& c : parts) {
        if (mustbe == -1) {
            mustbe = uf.root(c);
        } else if (uf.root(c) != mustbe) {
            return false;
        } else {
            continue;
        }
    }
    
    return true;
}

int binarysearch(void) {
    int l = -1; // invariant: f(l) = false
    int r = 100001; // invariant: f(r) = true
    while (l < r-1) {
        int m = (l+r)/2;
        if (decision(m)) r = m;
        else l = m;
    }
    return r;
}

int main(void)
{
    cin >> n >> m >> p;
    for (int i = 0; i < m; ++i) {
        int a, b, l;
        cin >> a >> b >> l;
        edges.push_back(edge(a, b, l));
    }
    for (int i = 0; i < p; ++i) {
        int c;
        cin >> c;
        parts.push_back(c);
    }
    cout << binarysearch() << endl;
    return 0;
}

Problem D: Interception

Yasmin is the administrator for a network of n computers. There are m connections in the network, each joining a different pair of distinct computers ai and bi and allowing messages to be sent in both directions. However, a message may be intercepted by nefarious hackers, with probability pi.

Yasmin knows that if not for these hackers, a message could be sent from any computer to any other, at least indirectly. Help her figure out the probability that such a message is not intercepted, assuming the optimal route is taken.

prob-Interception (full).pdf Download

Constraints

• 1 ≤ n ≤ 200.
• n − 1 ≤ m ≤ n(n − 1)/2.
• For all 1 ≤ i ≤ m, 1 ≤ ai , bi ≤ n, ai != bi and 0 < pi < 1.

Subtasks

D1 (30 points): for all 1 ≤ i ≤ m, pi = 0.1.
D2 (70 points): no restrictions.

How I solved this problem

The first thing to note, if you didn’t read the output section carefully just like me, is to realise that there is no starting node. We have to find the probability of no interception across all pairs of nodes.

Subtask D1

The probability of interception is a simple function of the shortest path. We can find the all-pairs shortest paths using Floyd-Warshall in O(V³) which will run in time. Precompute a table of probabilities with respect to a given distance.

Subtask D2

Note that Floyd-Warshall is a relaxation algorithm, which means we can pretty easily modify how it builds up the all-pairs shortest paths.

Instead of adding weights at the point of relaxation, multiply the weights instead.

Calculate the graph with inverse probabilities to minimise floating-point error (i.e. instead of 0.01 * 0.01, do 0.99 * 0.99, find the max path).

Tricks: __float128, cout << fixed << setprecision(10).

Solution (Original, C++)

#include <iostream>
#include <algorithm>
#include <vector>
#include <sstream>
#include <cassert>
#include <queue>
#include <set>
#include <map>
#include <cmath>
#include <iomanip>

#ifdef DEBUG
#include <fmt/format.h>
#include <fmt/ranges.h>
#include <fmt/ostream.h>
#else
namespace fmt { void print(...) {} }
#endif

using namespace std;

const int MAXN = 200;

__float128 dist[MAXN][MAXN]; // probability of NO INTERCEPTION
using edge = tuple<int, int, __float128>; // undirected edge, probability of NO INTERCEPTION
vector<edge> edges;

int n, m;

void floydwarshall(void) {
    // initially, the probability of no interception is 0 (no route)
    for (int u = 0; u < n; ++u) for (int v = 0; v < n; ++v) {
        dist[u][v] = 0.0;
    }

    // update the probabilities for every directed edge
    for (auto const& e : edges) {
        int u = get<0>(e), v = get<1>(e);
        __float128 p = get<2>(e);
        dist[u][v] = p;
        dist[v][u] = p;
    }

    // every vertex can reach itself
    for (int u = 0; u < n; ++u) dist[u][u] = 1.0;

    for (int i = 0; i < n; i++) {
        for (int u = 0; u < n; u++) {
            for (int v = 0; v < n; v++) {
                dist[u][v] = max(dist[u][v], dist[u][i] * dist[i][v]);
            }
        }
    }
}

int main(void)
{
    cout << fixed << setprecision(10);
    cin >> n >> m;
    for (int i = 0; i < m; ++i) {
        int a, b;
        long double p;
        cin >> a >> b >> p;
        a -= 1; b -= 1;
        edges.push_back(edge(a, b, (__float128) 1.0 - p));
    }
    floydwarshall();
    for (int i = 0; i < n; ++i) {
        for (int j = 0; j < n; ++j) {
            cout << (long double) dist[i][j] << ' ';
        }
        cout << endl;
    }
    return 0;
}

Quick Notes on Consistent Hashing

Leave a reply

What happens to hash functions when the range changes? Can we avoid chaos?

The paper shows the existence of such a “consistent hash function family” with the below properties.

Consistency Properties

Here, the buckets are the total range of the hash function (i.e. the modulus). This can increase to accommodate load.
Here, a view consists of a subset of buckets. As the number of buckets increase, pre-existing knowledge about buckets naturally become subsets of the system.

Balance
– The hash function hashes items within a view with approximately uniform probability.

Monotonicity
– If a view expands and an item hashes to a bucket contained in the previous view, they are hashed to the same bucket.
– (Items are either hashed to a new bucket or stay in the old one.)

Spread
– Across substantial views, items hash to approximately the same bucket (i.e. minimally distinct).

Load
– Across substantial views, there’s a reasonable limit to items hashed to each bucket.

Reading – Device Tree

Leave a reply

What is Device Tree and how are they used?

Device Tree is a specification used to perform boot-time configuration of a kernel whereby information about the running system is provided by the firmware to the kernel instead of being hard-coded with the kernel module loading or blacklisting process (i.e. load this kernel module for this device).

This Raspberry Pi document describes the format in introductory form. This blog post describes how it gets used.

Before Device Tree, there existed a format called ATAG (tagged list). It is essentially a variable-length array of variable-length structs with identifying tag types. It is terminated by an ATAG_NONE tag. This document describes this format in detail.

This Linux ARM booting document describes the both methods, as well as the overall booting process as expected by the Linux kernel.

Device Tree is passed to the kernel upon boot as an in-memory structure. With Linux booting on ARM, the r2 register is used for Device Tree with a magic value stored at that address to indicate its usage.

Device Tree overlays exist for further configuration of device tree files. This document describes the methods on Raspberry Pi.

Device Tree is mainly used on systems where device enumeration isn’t possible, so this information must be provided otherwise.

This document lays out how to write a Device Tree source file.

What is an example of information provided by Device Tree?

This is the Flattened Device Tree (DTB, FDT) file for Raspberry Pi 4 B. This is the Device Tree source (DTS) provided by Linux. I think the idea is that the kernel provides the source for the boards it supports (i.e. ensure that it contains the informations it expects like register locations), which is then compiled to a blob as part of the kernel build process, then a version of that is distributed by the Raspberry Pi firmware repository. This has some references to it.

This is a template inclusion file (DTSI). In particular, it contains the ARM memory-mapped register address for interfacing with the system timer.

system_timer: timer@7e003000 {
compatible = "brcm,bcm2835-system-timer";
reg = <0x7e003000 0x1000>;
interrupts = <1 0>, <1 1>, <1 2>, <1 3>;
/* This could be a reference to BCM2835_CLOCK_TIMER,
* but we don't have the driver using the common clock
* support yet.
*/
clock-frequency = <1000000>;
};

The kernel can use this register address and interrupt configuration instead of a hard-coded address. For Linux, it knows to load the timer driver for the bcm2835 platform because of the compatible property, which will then make use of the rest of the information.

Commands used to interact with Device Tree files

Most of the interaction can be done with the dtc command line tool.

Trip to Mount Fuji, Japan

Leave a reply

View of Mount Fuji. 1/1250s F5 ISO125 (Full auto)

Adding columns to an existing JsonLines-backed Athena schema

Leave a reply

If you use AWS Athena with JsonLines-backed files whose schema changes occasionally, you may think that the only way to change the corresponding Presto schema is to drop the table and create it again. Normally this is not a problem, except you have to spend hours reloading all the partitions…

Also, ALTER TABLE doesn’t work with Athena because it fails with “no viable alternative at input”. (sigh) [Update Early 2020: Now I know this has to do with Athena’s SQL parser! It just means that such a command simply does not exist.]

Not to worry, there is another way! If your Athena instance is upgraded to be backed by AWS Glue Catalog, then simply changing the schema there automatically reflects it on your AWS Athena instance.

The button you need to press

This is no different from having an AWS Glue Crawler periodically detect the schema of your data store.

My First Python3 Bug – Breaking Asynchronous Generators

Leave a reply

Recently I filed my first bug report for CPython, which turned out to be a release blocker! Here is the code snippet to produce this issue for <=3.7.4.

import asyncio

class TestAsyncGenerator:
    def __init__(self):
        pass

    async def aiter(self):
        yield 1
        yield 2

async def main():
    gen = TestAsyncGenerator()
    async for number in gen.aiter():
        break

asyncio.run(main())

# unhandled exception during asyncio.run() shutdown
# task: <Task finished coro=<()> exception=RuntimeError("can't send non-None value to a just-started coroutine")>
# RuntimeError: can't send non-None value to a just-started coroutine

I’m not too familiar with CPython internals, but looking at the patch it seems that immediately exiting the async for loop without letting the asynchronous iterator “checkpoint” has the effect of triggering a mistaken check thinking that a “just-started coroutine” is not an initialised coroutine.

This bug fix is coming soon to a CPython distribution near you, along with some resemblance of my test code 🙂

Thoughts on the Advanced Operating Systems course

Leave a reply

So I’ve reached a point in time where it appears I’ve completed UNSW’s Advanced Operating Systems (COMP9242 2019/T2) course! After many nights of hopelessness towards the end of the course, here’s my thoughts.

Don’t do continuations in C… seriously

In the first few weeks of the course, we had to make a decision on the execution model of our operating system. Because we were building our OS on top of a microkernel (The seL4 Microkernel) we were free to choose any model of execution we wanted. In other words, we weren’t limited to the “kernel-on-top-of-userspace” that you encounter in normal OS courses.

Since the initially provided code came with a while loop, we thought the fastest way to get started was an event loop! Remembering that all system calls are IPC messages on a microkernel, if we had a while loop processing these messages already, then an event loop based server is just a simple switch-statement on top! Surely!

My AOS partner initially suggested that we try a multi-threaded model. He suggested that it’ll be easier to run long-running operations with a threaded model. But I convinced him that our velocity will be higher with the simple event loop model. Oh, how wrong I was…

Turns out, there is so much opportunity for an operating system to be blocked and require continuation at a later time. I thought it would just be limited to I/O syscalls but later I realised that because of demand paging, pretty much any kernel operation required continuations.

Later in the course, we had spent so much time converting our existing code-base to use continuations everywhere (i.e. lot of malloc, manually defining structs, turning loops into recursions and 2-state machines…) that we were burnt out from ever wanting to look at asynchronous C again…

Also, I realised too late that it’s not fun to write an operating system in this way. Having a single OS thread pretty much restricts you to a monolithic kernel, and we missed opportunities to experiment with a more creative design. (Admittedly, it was better to be kinda on-track than be creative, though.)

What I liked about the course

By far the best part about the course is the power you feel as you progress through it. If you have been puzzled by how systems software is written, and just what the hell those deep-down aspects of Linux and SRE war stories mean and how they actually work and how the hell people come to know about them (“are they just too smart?”), this course solves all your thirst issues!

By solving practical OS design issues where there is no-one to back you up (everyone’s project has different design choices – from execution models, to data structures, and project layout; the tutors can’t help you beyond basic guidelines), you have genuine exploratory discussions with your project partner, and you learn how to navigate trade-offs between simplicity of implementation, performance, and your own sanity.

The lectures aren’t too relevant for your project except for the first few weeks, but their contents are topics you wouldn’t really learn from “web learning” a.k.a. blog posts. The breadth of papers mentioned in the course and the final exam format (critique a published paper) makes it clear to you where you can now start to learn more.

Overall, I’m glad the course is over. “I can have my weekends back”, in the words of my project partner. I also won’t miss debugging where the instruction pointer has a bunch of f’s, when your muslibc mysteriously calls munmap, and looking through the 2000 page ARM manual to find the correct Fault Status Register documentation between all the EL levels… On the other hand, it’s been a course that has finally satiated my curiosity on what “OS-level programming” really means. I can’t wait for the distributed systems course now!

Bonus: Here’s what we had to document for our final submission.

Design and Implementation of Simple Operating System

Advanced Operating Systems Notes

Leave a reply

These are some notes I took while I was taking the Advanced Operating Systems (COMP9242 19T2) at UNSW. Since the lectures cover a variety of somewhat obscure topics, I’m posting them here so I don’t forget about them later on in my career!

Disclaimer: These notes are my own, and they may not fully appreciate the intention of the lecturers or the contents of the course.

The lecture slides are Creative Commons, and they are available here: https://www.cse.unsw.edu.au/~cs9242/19/lectures.shtml

Lecture Week 1 Monday

“Introduction”

Capabilities

Invoke a method on a capability (like an object)

CSpace and CNodes

A CSpace represents Mapping[OID, AccessRight]
Each CNode is an array of capability “slots”
Capabilities: Read, Write, Execute, Grant
Operations on Capabilities: Invoke, Copy/Mint/Grant (Duplicate super/lesser), Move/Mutate (Transfer), Delete, Revoke

IPC (Endpoints)

More like “Cross-Address-Space Invocation”
Endpoints: Strictly synchronous
- Receiver must be ready
- This means data is only copied once
- There is no buffering
- No asynchronous kernel buffering
Do not use IPC for shipping data (don’t use for copy, obviously, copy-cost exists)
It is:
- Protected procedure call
- User-initiated context switch (because it’s synchronous in lock-step)
Endpoint, if the recipient is not highest priority and free, will queue FIFO
Think of Endpoints as a Producer-Consumer Pipe, One-Way
Need two endpoints for a sane two-way communication
How to reply:
- Call automatically generates a Reply capability for the Server
- Once-use
- Reply to a separate Endpoint specially created for this Call
- Hence, two Endpoints for one Call
Call is atomic and cannot be reproduced by Send/Receive
- No automatic creation of Reply Endpoint
- Send / Server Preemption (Higher Priority) / Receive was Preempted?!, etc.

Badges

Maintaining session state for Endpoints without having every Server implement their own session mechanism
A Badged Capability is Minted from the original Capability, and then gives it to a Client
This Badged Capability identifies a consumer of the Server

Virtual Registers

“Argument” storage for Endpoints
Included in thread state

Notifications

Equivalent to Binary Semaphores made using bit fields
Can be polled (multiple, like select)
Maps nicely to interrupts
To alleviate receiving from Endpoint and Notification simultaneously, Notification can be received as an Endpoint through TCB_BindNotification()

Lecture Week 1 Tuesday

Recap of similar lecture content from yesterday

Lecture Week 2 Tuesday

“Operating System Execution Models”

Note: Week 2 Monday was a public holiday

Interrupt Handling

IRQHandler and IRQControl
IRQControl can create IRQHandler
Application given IRQHandler to make them device drivers

Execution Models, Threads, Coroutines

Events
- Event loops and event handler
- Waits for events
- Only requires a single stack (no yields)
- How to handle user level code?
  - A kernel timer can be one of the events (this is the scheduler)
- No blocking in kernel code
- Only works on a uniprocessor
Coroutines
- A stack per routine
- Requires explicit yields
Threads
- Preemptive multithreading
Continuations
- Used in functional languages
- Much more high level than the above
- Basically an explicit context switch, but in functional languages
- Done by getting given a “switch context back to caller” function, and invoking it
- Basically just multiple return in functional languages
- Basically just Python generators
- Can be used to build (yield) where it doesn’t exist

Per-Thread Kernel Stack

The UNIX model
The state (e.g. per syscall) is explicitly stored by the stack
Can postpone execution by yielding (regardless of execution model)
Preemption is easy – just nest on the stack

Single Kernel Stack

How is blocking handled?
- Either continuations
- Or stateless kernels
  - “All it does is it makes it hard to write code” – Kevin

Lecture Week 3 Monday

“Events and Threads”

Two competing papers to guide us to choose which model to use for our own OS project.

Why Threads Are A Bad Idea

(for most purposes)

1995
Threads hard to program in
For most purposes events are better
Only use threads when true CPU concurrency needed
Requires
- Synchronisation
- Deadlock
Hard to debug (data dependencies, timing dependencies)

Why Events Are A Bad Idea

(for high-concurrency servers)

2003
Proposes Thread-Event duality
No inherent problem with threads
Benchmark user-level threads package; still fast

Lecture Week 3 Tuesday

“Caches”

Cache Indexing

Cache lines
Set associative lookup
- Tag | Set | Byte
2-way associative
Fully associative
- No indexing, just compare tag
- Obviously very expensive to compare in parallel
Cache associativity vs Paging
- Traditionally, the 2-bit index bit used for the cache indexing overlaps both page number bit and offset bit, so it comes from both
- Implications:
  - Odd/Even of page number determines top / lower half of 2-way associative cache
  - Pages fall into equivalence classes as caching is concerned
  - Gives rise to an implicit notion of page colouring – any pages that have different colours cannot compete for cache

Cache Misses

“The Four Cs”
- Compulsory miss
  - First access
- Capacity miss
  - Would not miss on infinite-size cache
- Conflict miss
  - Would not miss on fully-associative cache
- Coherence miss
  - Hardware coherence protocol
Cache replacement policy
- Upon replacement, dirty cache lines must be flushed
- Algorithm must be simple for speed
Cache write policy
- Write-back / write-through
- A store that misses cache
  - Write-allocate / no allocate
  - (Bypass cache on write cache-miss?)

Cache Addressing Schemes

Four possible schemes:
- Virtually-indexed, virtually-tagged (VV)
- Virtually-indexed, physically-tagged (VP)
- Physically-indexed, virtually-tagged (PV) <= Usually nonsensical
- Physically-indexed, physically-tagged (PP)
Virtually-indexed, virtually-tagged
- Virtually-addressed cache
- Various incorrect names
  - Virtual cache
  - Virtual address cache
- Can operate concurrently with MMU
- Pitfalls
  - Must be flushed on context switch because virtual address
  - Permissions must be processed via the MMU anyway, fast translation but still need to wait for MMU
  - Rarely used these days
Virtually-indexed, physically-tagged
- Indexing concurrent with MMU
- Used for on-core L1
Physically-indexed, physically-tagged
- Intel claims physically indexed L1 cache
- Cache size is picked so that it behaves like one
- If index bits are all within offset bits, they are invariant under translation

Cache Issues

Structure software for cache performance
If virtual addresses are used for indexing, then these:
Homonyms
- Same address, different data
- May access incorrect data!
- Prevention: tag VA with ASID
Synonyms (aliases)
- Different address, same data
- May access stale data!
- Not a problem for R/O data or I-caches
- Late write-back “Cache bomb”
- Also DMA inconsistency issues kind of like synonym issues
- Fixes
  - Flush cache – doesn’t fix aliasing within address space
  - Detect synonyms in either hardware or software
  - In other words, hardware restrict VM mapping so synonyms map to the same cache set
  - Rely on hardware 😛

Write Buffer

Stores take time to complete
Avoid stalling the CPU by buffering writes FIFO before it hits the cache
Gives rise to weak memory ordering
- Local CPU can grab things out of local write-buffer, remote CPUs may not see this write-buffer

Translation Lookaside Buffer

TLB sizes have remained relatively similar historically

Lecture Week 4 Monday

“Virtualisation”

Virtual Machine

“Efficient, isolated duplicate of a real machine”
Duplicate
- Should behave almost identically, except for resources or timing differences
Isolated
- Multi homed without interference
Efficient
- Run on bare metal
Hypervisor / Virtual Machine Monitor
- Interchangeable terms
- Software layer implementing the virtual machine

Types of Virtualisation

Platform (Type 1) VM
System (Type 2) VM
OS-level VM
Programming Language VM
We’ll mostly look at Type 1, and some of Type 2

Why VMs?

First designed to multiplex the hardware before true time-sharing OS
Then reborn to solve the same problem, poor resource isolation by mainstream OS
Use case: Cloud computing
- Many benefits
Type 2 has very high exception handling cost (e.g. signal handling)

Virtualisation Mechanics

Instruction Emulation
- Trap-and-Emulate approach
  - Expensive (i.e. trap frame)
  - But few instructions trap, so cost is amortised
- Hypercalls
  - Hypervisor-specific ISA?
- Trap-and-Emulate requires specific host ISA support
  - Privileged and Sensitive Instructions
  - T&E virtualisable = All sensitive instructions are privileged
- “Impure” virtualisation if T&E not supported
  - Binary translation of privileged instructions
    - Replace with trap “hypercall”
    - Insert in-line emulation for some others

Virtualisation Address Translation

Two levels of address translation
Introduce a notion of “Guest Physical Memory”
But, must be implemented with single MMU translation
1. Shadow Page Table
- Hypervisor shadows (keeps a copy of) real guest page table
- VMM resolves two page table layers at once
- Only works on hardware-loaded page tables so the VMM can find out where the guest’s page table base pointer is
- (Of course, this instruction is privileged so it’s trapped)
- We write-protect the guest page table, so we trap every write to it
- Shadow PT has TLB semantics (weak consistency)
  - Only gets synced at synchronisation points like a TLB
    - Page fault
    - TLB flush
2. Real Page Table
- Directly translate on every read/write to Guest PT so that it contains the actual physical mapping, but abstract so that it sees only guest-level mappings
- Instead of trapping on every read/write, do inline binary translation so we don’t trap
- Or para-virtualise, use a hypercall

Guest Self-Virtualisation

Minimise traps by holding some virtual state inside guest
For example, interrupt handling virtual interrupt flag
Only if guest and VMM agree on using virtual interrupt flag
No need to trap interrupt flag edits

Device Models

Emulated, Split (Para-Virtualised), Passthrough
Emulated
- Emulate device register accesses
- Very slow, because requires trapping on every device access
- If too slow, real device has time constraints, breaks abstraction
- For example, Linux does boot time speed checks
Split Driver
- Must port drivers to hypervisor
- Low overhead
- Basically a driver RPC
- Example: Xen Dom0
  - Trusted Computing Base runs “Driver OS”
Pass Through
- Can have hardware support for I/O MMU for hardware guest separation
- Can also share hardware itself if hardware itself supports it

ARM Virtualisation Extensions

Hyp Mode
Configurable Traps
- Can configure the hardware to trap directly into the guest OS, instead of the hypervisor having to forward the trap to the guest OS
- Similar for x86
Emulation support
- Because of separation of I-Cache and D-Cache, we get a compulsory cache miss upon any instruction trap
- HW decodes instruction so we don’t have to access cache
2-stage Translation
- Hardware walks both guest and VMM page tables
- Eliminates “Virtual TLB”
- A bit slower, even though VMM PT is shallow
- Worst case is massively slower, i.e. page fault on every single level
- But generally a good idea
Virtual Interrupts
System MMU (I/O MMU)
- Like an MMU for DMA

World Switch

Intel
- VM state is <=4KiB
- VMentry / VMexit
- Hardware saves and restores
ARM
- VM state is 488B
- Save and restore done by hypervisor

KVM

KVM is not Type-2!
A “Hybrid Hypervisor”
Fundamentally a Type-1
Works by using the fact that virtualisation privilege is orthogonal to normal ring privilege
“Really bloody HUGE” Trusted Computing Base
Reuse Linux drivers

Fun Things with VMs

See Slide

Lecture Week 4 Tuesday

“Measuring and Analysing Performance”

Benchmarking in R&D

Conservative: No regression
Progressive: Actual improvement in important cases
Must analyse and explain results
- Discuss model
- Hypothesis for behaviour
- Test and confirm hypothesis
Microbenchmarks (component) and Macrobenchmarks (full system)
Obtaining an overall score
- Absolute score rarely matters
- Compare systems relative to each other
- BUT: Arithmetic mean is meaningless for relative measures
- Use: Geometric mean – invariant under normalisation
Benchmarking crime: Throughput degradation = overhead
- Use proper unit of figure: processing cost per unit data

Profiling

Tool: gprof, but can’t be used inside OS
Performance Monitoring Unit (PMU)
- Hardware-based profiling unit
- Collects certain events
- A counter that counts events
- Linux: oprof
- Mainly CPU based events

Performance Analysis

Benchmarking crime: No indication of significance of data
- Include standard deviation
Vary inputs, check outputs
- Writing zeros may be optimised out
Vary data AND addresses
Hidden parameters and mental models! Large deviation is giveaway.
https://www.cse.unsw.edu.au/~gernot/benchmarking-crimes.html

Understanding Performance

Some benchmarking tables and how to interpret them

Lecture Week 5 Monday

“Real-Time Systems”

Real-Time Tasks

Event, Jitter, Release, Processing Time, Completion, Deadline
Deadlines are frequently implicit – process things without missing an event
Safety-critical: Failure; Death, serious injury
Mission-critical: Failure; Massive financial damage

Weakly-Hard Real

Where occasional deadline misses can be occasionally compensated
A cost of a miss doesn’t go to infinity (i.e. recoverable or probabilistically ok)
But; What does it mean to have a probability? Is it safe to base it on observed times?
To Gernot, probabilistic real time is bogus because you usually can’t assign real probabilities

Firm Real-Time Systems

Typically represented by a gain function (lower latency is advantageous), not a cost function (upon failure)
For example, trading systems

Soft Real-Time

Tail latencies
For example, cloud providers, web services, video streaming
Characterised by Quality of Service
“Bounded tardiness” – just some term

Best-Effort

In practice, duration is rarely totally irrelevant

What does it mean for OS?

Designed to support real-time
- Fast and predictable interrupt handling
Main duty is scheduling tasks to meet their deadline
RT vs OS terminology
- Task vs Thread
- Job vs Event
Problem is harder than bin-packing; because time is not fungible
Feasible – there is a known algorithm that will schedule them
Optimal – can schedule any task that is feasible

On-Line RT Scheduling

Execution order is not predetermined
Fixed priority scheduling – L4
- Scheduler always runs highest-priority thread
- Will preempt to run a higher-priority thread
Rate Monotonic Priority Assignment (RMPA)
- Period, Rate, Priority, Utilisation
- Schedulability test guarantee
  - For infinity, if utilisation is <= 69.3
  - Sufficient but not necessary (can get “lucky”)
- RMPA is optimal for FPS
Dynamic priorities: Earliest Deadline First
- Also schedulability test guarantee of U <=100
- Assumption: All the tasks are completely independent of each other
- Not recommended by Gernot

Resource Sharing

Locking vs Delegation
- RT Terminology: Resource Server
How to implement delegation for seL4
- seL4 IPC for single-core (Hoare-style monitor), notifications for multi-core (semaphores)
Problem: Priority Inversion
- May have multiple layers of dependencies, if high-priority thread depends on low-priority
- How do you minimise it?
Solution 1: Priority inheritance (“Helping”)
- Do what’s most beneficial for highest-priority task
- Bump priority of subtasks to the priority of the waiting thread
- Transitive inheritance => Possibility for long blocking chains
- Potential deadlocks
- Messy, easy to get wrong for safety-critical systems
Solution 2: (Immediate) Priority Ceiling Protocol (IPCP)
- Predefined priority associated with a shared resource (lock)
- Make thread exit critical section as soon as possible (use “ceiling” priority)
- Each task must declare resources at admission time
  - In a real time system, this is okay to require
  - Easy to enforce with capabilities
- FPS: bump priority, EDF: floor of deadlines

Scheduling Overloaded RT Systems

Cannot trust everything to obey declared worst-case execution time
Which job will miss the deadline?
Fixed Priority Scheduling: Loser is well-defined
Priority and Criticality, in FPS the two are the same
In FPS, priority is implicit so there’s no notion of criticality, so loser is not well-defined

Mixed-Criticality Systems

High priority but low criticality tasks (device drivers)
Low priority but high criticality tasks (control loops)

Lecture Week 5 Tuesday

“Microkernel Design & Implementation”

“If you find yourself hand waving a lot this is a sure sign you’re missing a diagram” – Gernot

“Security is no excuse for bad performance” – Gernot

Traditional L4 and seL4 Improvements

Recursive address spaces
- If you have a mapping, you can just give it to any other process through IPC
- Some issues exist with this model apparently
Thread IDs as IPC addressing
- seL4: Mailbox model with badges; communication partners are anonymous
- Also more useful to have badges, which has more flexible semantics
- Traditional: Forces you to have a load balancer thread, which complicates programming
- seL4: Transfer endpoint capability upfront for load balancing, get them back later, like a kernel-managed queue for workers
- Traditional: No interposing; lest violate truth of Thread ID authentication
Long IPC
- Kernel-managed zero-copy IPC
- Page faults? Which takes ages for a microkernel…
- Very complicated
- Don’t need a kernel to have a trusted third-party with access to both processes mapping
Timeouts
- IPC calls with timeouts
- How does a user determine the timeout values? Only really reasonable in networking
- seL4: New semantics for call/reply reduces this problem for server/client cases
IPC Fastpath
- 1. Prologue
- 2. Identify destination
  - Check caps
- 3. Get receiver TCB
  - Check priority
- 4. Mark sender blocked and enqueue
  - Create reply cap
- 5. Switch to receiver
  - Leave Message Registers untouched
- 6. Epilogue
- Notice:
  - No scheduler invocation, just direct lookup
  - No time-slice donation
Kernel runs with interrupts disabled
- No concurrency control means simpler kernel
  - Easier reasoning about correctness
  - Better average case
- Long-running system calls? Bad.
- Interrupt latency? Bad.
- Most protected-mode RTOS are fully preemptible
  - Just save registers (trapframe) and re-enable interrupts
  - According to Gernot, preemption during IPC makes the code very hard to guarantee correctness, not a good idea for high assurance environments
  - It’s also very slow
- How to ensure liveliness if we don’t enable interrupts in the kernel?
- Incremental Consistency Paradigm
  - To ensure liveliness, break operations down into a bunch of O(1) operations, and ensure kernel consistency afterwards, and poll for interrupts in between
  - Consistency, restartable progress
- Worst-Case Execution Time Analysis
  - 1. Program Binary
  - 2. Control-flow graph, Micro-architecture model, Loop bounds
  - 3. Analysis tool
  - 4. Integer linear equations, Infeasible path (mutually exclusive paths) info
  - 5. ILP solver
  - 6. WCET
- ARM stopped providing WCET information…
- RISC-V implementations may supply them!
Scheduler optimisation – Lazy scheduling
- Frequent blocking and unblocking in microkernel model
- Don’t want to modify scheduler queue at every IPC
- Leave blocked threads in ready queue, leave for scheduler to clean up
- Worst case unbounded scheduler operation! O(n) for threads in system
- Not the way to do real time
- Now uses Benno Scheduling
Scheduler optimisation – Direct Process Switch
- Idea: Don’t invoke scheduler if you know who’ll be chosen i.e. if receiver priority has higher priority
- Implication: Because scheduler doesn’t run, implies time slice donation – receiver runs on sender’s time slice
- Solution: Time represented by capabilities
  - Scheduling context capability
  - Represents a right to use time
    - T: Period
    - C: Budget <= T
  - Represents basic unit for mixed criticality
- Problem has been bothering Gernot for 20 years
- Notion of a passive server
  - No scheduling context
  - Runs on donated scheduling context (uses up client’s scheduling context)
  - So-called Migrating Thread Model
  - Problem: Timeslice runs out while server is running, what is the next behaviour?
  - Passive server may be blocked due to a low criticality consumer’s timeslice which will then block the high criticality consumer
  - Policy free method for solving this in seL4: raise an exception and let the user handle it
  - Timeout exceptions – policy-free mechanism for dealing with budget depletion
    - Possible policies
    - Emergency budget to leave critical section
    - Rollback, it’s client’s problem
SOSP ‘93 and SOSP ‘95, two papers that describe tricks and design decisions in the original L4 kernel
- Some kept, some remain

Reflecting on Changes

Original L4 shortcomings
- Insufficient and impractical resource control
- Over-optimised and needlessly complex IPC abstraction
Completes a 30 year dream for a verification of OS

Lecture Week 6 Monday

“Security Fundamentals”

Security Design

Saltzer & Schroeder [SOSP ‘73, CACM ‘74]
Economy of mechanism
Fail-safe defaults
Complete mediation – check everything
Open design – not security by obscurity
Separation of privilege
Least privilege – principle of least authority
Least common mechanism
Psychological acceptability – if it’s hard to use it won’t be

Mechanisms & Policies

Access control is the central domain of operating systems
Confidentiality – prevent read of Hi
Integrity – prevent tampering of Hi
Availability

Risk Management

Mitigate (security mechanisms) or Transfer (insurance)

Covert Channels

Intentional leakage
Storage channel
- Use attribute of shared resource i.e. filename contains secrets
Timing channel
- Temporal order of use of shared resource
Other physical channels
- Power draw etc.
Covert channels vs side channels
- Covert channels are intentional
- Side channels are unintentional

Access-Control Principles

Representing protection state
- Think of a matrix
- (Vertical) Access control list – list for everyone, gets big
  - Compacted – UNIX only stores for user,group,everyone
- (Horizontal) Capabilities
Capability
- (Reference + Access Token) Combined
- Implies a naming system to reference it
Capabilities must be unforge-able
- Either protected by hardware (tagged memory protection)
- Or protected by sparseness (high number of bits)
- Or privileged kernel data (e.g kernel can verify through a tree)
ACLs & Capabilities – Duals of the access control matrix

Differences of ACLs and Capabilities

Naming and name spaces
- ACLs: Objects referenced by name; access is orthogonal to reference
- Capabilities: Caps imply name;
- Problem: Confused deputy
  - Imagine the gcc log file problem
  - Problem arises due to ambient authority
  - Fundamentally unsolvable with ACLs
- Problem: Confused deputy with caps
  - No ambient authority
  - Alice can’t even pass the logfile
Meta-permissions
- ACLs require meta-permissions to modify ACLs
- Caps use delegation and revoking, built-in fine-grained meta-access
- Caps: Right to delegate vs right to use
Process creation
- ACLs: Child inherits all rights – opposite of least privilege
- Caps: Child has no caps by default – defaults to least privilege
Auditing of protection state
- “Who has rights to this thing?”
  - Easy with ACLs
  - Hard to solve with sparse caps
  - seL4: CSpace makes this possible
- “What rights does this thing have?”
  - Caps: Just look at caps
  - ACLs: Close to impossible, need full scan
Interposing access
- Caps are opaque, can masquerade
- Transparent virtualisation of access (e.g. containers)
- “Security Monitor”
- Example: Lazy object creation
  - Objects can be lazy references
  - Server can create the corresponding state on first use

Mandatory vs Discretionary Access Control

Discretionary: As owner of file, I can change access
Mandatory: System administrator defines policy, owner cannot change
Bell & LaPadula (BLP) Model [1966]
- Modelled on national security classification
- Classification – clearances
- Orthogonal compartments (need-to-know)
- Rules:
  - No read up (obviously)
  - No write down (because equivalent to down reading up)
  - Example: Military, WTF, can’t give orders down?!
  - In practice, need exceptions for de-classification
  - Attestation for permissions to send down
  - Trend to over-classify, not good
Biba model – similar, but enforces integrity
- Now, don’t think of secrets but think of data corruption
- No read down
- No write up
- Don’t want data flowing up, this time
Put them together and…..
- Now no information can flow in any direction
- Strong *-Property (Star)
  - Allow read down, by trusting high integrity to be responsible to not corrupt itself
Boebert’s Attack
- Famous argument against capabilities
- “On the inability of an unmodified capability machine to enforce the Star-property”
- Doesn’t work on seL4 because caps are orthogonal to data
- Need both grant right of the cap and on the IPC (and Hi-integrity will deny this)
- Need mechanism to limit cap propagation: take-grant model on seL4
Decidability
- Given initial state, can it ever reach an unsafe state?

Lecture Week 6 Tuesday

“Security: Information Leakage”

Causes of Timing Channels

Algorithmic (e.g. hashing)
Resource contention (e.g. caches)
- Stateless: e.g. interconnect bus speed
  - Limitations:
  - Only during concurrent access
  - Generally reveals no data
  - Requires intentional actor to be used as a covert channel
  - Not really possible to use as side channel
- Stateful:
  - CPU Caches
  - Branch predictor
  - These are all capacity-limited resources

Timing Channels

Prime and Probe
- Challenge: Access times
- L1 is very high bandwidth
- Longer to traverse cache for upper level caches
- Approach:
  - Probe a few cache lines at a time, find “interesting sets”
  - E.g. look for `square` code from GnuPG
  - Can read out bits of exponent! [Liu S&P’15]
Prevention:
- Spatial & temporal partitioning
- Spatial: Cache colouring!
- Temporal: Flush cache!
However: Cannot spatially partition on-core cache (L1 cache)
- L1 usually has 1 colour
- (if virtually-indexed)
Cache flushing for L1 has little effect cross-domain because the cache would be trashed anyway (for the other domain)
However: Flushing useless for multicore / SMT

Evaluating Intra-Core Channels

Methodology: Channel Matrix
- Difference when moving horizontally (changing set size) means channel
Reference: https://ts.data61.csiro.au/publications/csiro_full_text//Ge_YCH_19.pdf
I-Cache Channel (even with flush) demonstration with channel matrix
Branch-History-Buffer (BHB)
- 1-bit channel
Means: Unclosable channel / impossible for temporal protection even with flushing

Intel Spectre Defences

“Indirect Branch Control”
- Tries to close BHB channel by flushing

Speculating Disaster

Instruction Pipelining
- Problem: Dependencies
- Out-of-Order Execution
  - Includes speculative branching
Problem: Speculation pollutes cache

Meltdown

Fix: Trapping will now switch page tables to a kernel page table, expensive
Secrets leak because kernel address space contains physical memory, leaks every single userspace program

Spectre: Branch Prediction (Variant 1)

Speculatively ignore array reference protection via if statements, this pollutes the cache
Invalid array index can leak any address in program
ISA has no contract on microarchitectural behaviour
We need a new hardware-software contract!

Lecture Week 7 Monday

“Linux, Locking, Many Processors”

(No lecture Week 7 Tuesday)

Some History

John Lions, taught UNIX at UNSW
A lot of people got their introduction to UNIX with John Lions annotated source code
BSD, Paged Virtual Memory 1979, etc.
UUCP, Unix-to-Unix Copy, MHSnet using phone calls (lol)
Process model, fork exec, memory sharing
Dup, Open file descriptor, inode, CLOSE_ON_EXEC
Inode model for directories
Namei
Linux extra keywords, __init, __exit, __percpu, __user, __rcu, __kernel, __iomem, __acquires, __releases, __bitwise
Other Linux C dialect / abstractions
Linux memory consistency model
Linux has its own memory model that abstracts the hardware memory model (i.e. loads, fences)

Scheduling

Dispatch O(1)
Pretty fair
Good interactive response
Topology-aware (NUMA)
O(log n) scheduling
Currently CFS – Completely Fair Scheduler
Dual Entitlement Scheduler
- Each thread given entitlement (CPU time)
- Deserving and Undeserving queue

Memory Management

Struct page
Other various structs
How Linux handles page faults

Driver interface

Enumerable and non-enumerable
- Who are you? Or can’t find out
Device tree for userspace

Containers

Namespace isolation
Lecturer did research on this in the 70s, thanks to a fair share scheduler developed here, sold to Cray, then to Sun
Hierarchical scheduling competition
(Sounds like carving out a scheduling capability…?)

Scalability

Performance cliff at high loads due to coherence delay
Amdahl’s law and Gunther’s law
Doing without locks
- Sequence locks, lots of readers, few writers
Read Copy Update

Lecture Week 8 Monday

“Multicore and Memory Ordering”

Types of Multiprocessors

Uniform Memory Access
- Processors with local caches
- Cache coherency!
- Unit of consistency is the cache line
- Sequential Consistency
  - Result should be as if all the processors were executed in some sequential order
  - In other words, all reads see the latest write
- Early solution: Write-through caches
  - CPU snoops bus activity to invalidate stale lines
  - Downside:
    - All writes go to bus
Non-Uniform Memory Access
- Provides high local bandwidth and reduces bus contention
- Other options like MESI (Modified Exclusive Share and Invalidate)
- Directory-based coherence
  - For really big multiprocessing where broadcast bandwidth for MESI is not adequate

Memory Ordering

Example: tail of a critical section
Memory can’t be re-ordered or else it’d be outside the critical section!
Strong Ordering
Note: https://preshing.com/20120930/weak-vs-strong-memory-models/
Each CPU keeps itself consistent, but how is it viewed by others?
Example: Write buffers
- x86: Total Store Ordering, stores are FIFO
- Problem: CPU 0 writes to write buffer, CPU 1 reads from cache, inconsistency! Apparent reordering of instructions!
- Memory fences, effectively drains write buffer
- ARM: Partial Store Order
Existing primitives will have appropriate fences in place
- In practice, correctly synchronised code can ignore memory model
But if you write racey code (e.g. lock free algorithms) must use fences and barriers
x86 is nice: just stores reordered after loads
ARM is not so nice

Kernel Locking

Trylock; prone to livelock
Different types of locks
Spinlocks

Lecture Week 8 Tuesday

Continued…

Common spin lock
- Beware of bus arbitration!
- Bad because requires disabling interrupts for critical paths
Test and test and set is not a panacea
Kinda falls off after 7th processor
Implementation of MCS Locks
- Buuut… MCS locks on ARM MPCore?
TT&S low overhead on low contention
MCS grows less on high contention
Dynamically switch between TT&S and MCS depending on contention level
Case study: Linux performance collapse
- Collapse for short sections
- Non-scalable lock
- Cache coherence protocol
- MCS locks are the best! 🙂

Lecture Week 9 Monday

“Formal Verification and seL4”

Assurance and Verification

Common Criteria
Semi-Formal = use some form of formal language, not just English, not necessarily mathematical
Formal = mathematical
Protection Profiles
Which OS’s have been evaluated?
- Mac OS X – EAL 3
- Windows and RHEL – EAL 4
- Just about anything EAL 4 a.k.a. Not a lot
- 2008 Green Hills INTEGRITY – EAL 6
  - “While the system does not certify common criteria at any evaluation level” says the certification paper, WTF?!
- 2019 Prove & Run PROVENCORE – EAL 7
Green Hills tried One Box One Wire project for commodity hardware VM isolation, NSA, March 2019 said: SKPP validation for commodity hardware platforms infeasible due to complexity. NSA subsequently discontinued certifying >=EAL5
Common Criteria – a very expensive way to prove nothing – $1000 per line of code
Too much focus on development process
Common Criteria – Effectively dead in 5-Eyes defence

Formal Verification

Started a long time ago, over-promised and under-delivered
Two approaches
Abstract Interpretation / Model Checking
- Check a model for certain properties
- Unsoundness – has false positives and false negatives
- Suffers state-space explosion
- Scales to large code bases
- Doesn’t actually prove anything
Theorem Proving
- Can deal with large state spaces
- Requires refinement property (can be modelled in mathematics)
- Can prove functional correctness against a formal specification
- Very labour intensive
Note, recent work automatically proved functional correctness of simple systems using SMT solvers (model checking) [Hyperkernel, SOSP’17]

Model Checking and Linux: A Sad Story

[Chou & al, 2001] Applied static analysis to Linux
Device drivers are very buggy
Linux did nothing lol, purely open source static checkers still fault

Story of seL4

August 2009, birth of seL4
Abstract Model < (Proof) > C Implementation
“Under the semantics of the C language, this program has behaviour that is allowed under the functional specification”
Means we can reason about the abstract model to verify properties
C Implementation < (Proof) > Binary Code
“Translation correctness”
Isolation Properties < (Proof) > Abstract Model
“Isolation Properties”
Also, “Worst-case Execution Time”, never been done for protected O
Initially, had Haskell between AM and C
Kernel never fails, behaviour is always well-defined:
- Assertions never fail
- Will never de-ref null pointer
- Will never access array out of bounds
- Cannot be subverted by mis-formed input
- All syscall terminate, objects don’t overlap, access control is decidable
CompCert – done in France, we can’t use because they use a different mathematical framework and we can’t easily show that us and them are equivalent (axioms, lemmas)
- Being able to use gcc is also a real asset, different ISAs etc

Security Properties of seL4

To prove confidentiality:
- Non-interference proof
- Evolution of Low does not depend on High state
- Infoflow is very strong property. To accomplish, remove non-determinism where it affects confidentiality, e.g. scheduler strictly round-robin.
- May get really hard to work with.

Verification Assumptions

Hardware behaves as expected
Spec matches expectations
Proof checker is correct

Lecture Week 9 Tuesday

“Local OS Research”

“Use of a monolithic OS in security- or safety-critical scenarios is professional malpractice!” – Gernot

Quantifying OS Design Security Impact

Analyse Linux CVEs
If it were a microkernel, is it in the TCB?
- Not in TCB, attack defeated
- Else, try verification, attack defeated
- DoS, strongly mitigated
- Integrity or confidentiality violation, weakly mitigated (i.e. not full system compromise)
- Full system compromise, i.e. firmware reflush, I2C bus hijack

Cogent

Code and proof co-generation to C
Not turing complete

Time Protection

Equivalent of memory protection but for time
i.e. timing channels
Channel through kernel code (i.e. shared kernel code I-cache)
Performance impact of cache flushing and colouring

Hardware Redundancy / Lockstep

Hardware lockstep vs Software lockstep
YanYan student – RCoE Redundant Co-Execution
Instead of just software, the kernel itself is now replicated (similar to how it’s necessary for Resource Manager for time protection), they vote between each other
Two variants:
- Loosely-coupled
  - Don’t synchronise too hard, only when kernel entry-exit
  - Low overhead, but can’t deal with racy code, threads
- Closely-coupled
  - Sync on instruction
  - Precise preemptions
- How closely-coupled works:
  - Leading replica? Wait for barrier
  - Trailing replica? Set breakpoint where leading replica is waiting, wait for barrier
  - Synced

Real World Use

DARPA HACMS
CAmkES
- Component middleware
- Higher level abstractions of low-level seL4 constructs
Cross Domain Desktop Compositor

Lecture Week 10 Monday

“Multiprocessor OS”

Scalability

Serialisation
By the OS
Locking introduces serialisation
Also atomic operations and cache coherency
Memory access can stall
Cache-related serialisation
False sharing – unrelated data share the same cache line accessed from different processors – cache coherence traffic and delay
Cache line bouncing

Multiprocessor Hardware

Contemporary
ARM, big.LITTLE, Nehalem
Interconnects
Cluster on Die
Tile architecture
Rings

Optimisation for Scalability

Reduce critical sections
- Fine grained locking or lock free data structures
- (Lock data not code)
Reduce false sharing (pad to cache lines)
Reduce cache line bouncing
Reduce cache misses; affinity scheduling

OS Design Guidelines for MP

Avoid shared data
Explicit communication (regain control over communication costs)
Reduce synchronisation
Allocate and schedule for locality
Uniprocessor performance vs scalability

Approaches for MP

Divide and conquer
- Just use them as independent bits
- Virtualisation
- Exokernel
Reduced sharing
- Brute force and heroic effort (fixing them one by one)
- By design
No sharing
- Distributed system, do extra work to share

Space-Time Partitioning

Tessellation OS
Gang Scheduling
- Schedule all threads of a process in one scheduling interval

K42: Clustered Objects

Globally valid object reference, resolves to processor local representative
Controlled introduction of locality
Controlled trade-off between synchronise at write or synchronise at read

Corey / Linux Brute Force

Profiling Linux and achieved speedup
Applied lessons from previous research
- Sloppy counters
- Per-core data structs
- Fine grained lock, lock free
- Cache lines
How do we know if a system is scalable? Is there some fundamental blocker?
- How do you really know without infinite benchmarks?
- Is the OS API a potential bottleneck?
- Scalable Commutativity Rule
  - Can’t tell which order the operations were run
  - Counterexample: lowest available file descriptor

FlexSC

Async syscall, batch syscalls

Barrelfish

No sharing
Message passing

Disappearing LogDNA messages on AWS Lambda

Leave a reply

At work, I encountered a strange issue where LogDNA was missing some log entries. This was very annoying since the whole point of logs is to have a full picture when debugging issues.

At first I chalked it up to the LogDNA service being flaky, but over time, I noticed this was a systemic issue and so I decided to investigate.

As my assigned projects are in Python, we use the logging module and LogDNA’s Python Handler to ingest our log messages. If you look into the module carefully, you’ll notice that it uses a thread in a self-callback-loop to flush it’s buffers.

Because AWS Lambda currently uses Linux’s process freezer, this is highly problematic if not managed carefully. The Linux freezer will suspend all threads under our process upon the exit of the handler function, which means – the thread callback never happens; the log buffer never gets flushed; and log messages become lost.

To solve this problem, you can implement a simple decorator that flushes all handlers attached to the root logger. Attach this decorator to the Lambda handler function, so that all loggers are flushed before control is returned to the Lambda executor and before our process gets frozen.

This ensures that the flush thread is joined before continuing, flushing the log buffers and ingesting our log messages. No more lost logs!

Summary

Frankly Asked Questions

Further Reading

Problem A: Cryptography

Constraints

Subtasks

How I solved this problem

Subtask A1

Subtask A2

Solution (Original, C++)

Problem B: Wires

Constraints

Subtasks

How I solved this problem

Subtask B1

Subtask B2

Solution (Original, C++)

Problem C: Assemble

Constraints

Subtasks

How I solved this problem

Subtask C1

Subtask C2

Solution (Original, C++)

Problem D: Interception

Constraints

Subtasks

How I solved this problem

Subtask D1

Subtask D2

Solution (Original, C++)

Consistency Properties

What is Device Tree and how are they used?

What is an example of information provided by Device Tree?

Commands used to interact with Device Tree files

Don’t do continuations in C… seriously

What I liked about the course

Lecture Week 1 Monday

“Introduction”

Capabilities

CSpace and CNodes

IPC (Endpoints)

Badges

Virtual Registers

Notifications

Lecture Week 1 Tuesday

Lecture Week 2 Tuesday

“Operating System Execution Models”

Interrupt Handling

Execution Models, Threads, Coroutines

Per-Thread Kernel Stack

Single Kernel Stack

Lecture Week 3 Monday

“Events and Threads”

Why Threads Are A Bad Idea

Why Events Are A Bad Idea

Lecture Week 3 Tuesday

“Caches”

Cache Indexing

Cache Misses

Cache Addressing Schemes

Cache Issues

Write Buffer

Translation Lookaside Buffer

Lecture Week 4 Monday

“Virtualisation”

Virtual Machine

Types of Virtualisation

Why VMs?

Virtualisation Mechanics

Virtualisation Address Translation

Guest Self-Virtualisation

Device Models

ARM Virtualisation Extensions

World Switch

KVM

Fun Things with VMs

Lecture Week 4 Tuesday

“Measuring and Analysing Performance”

Benchmarking in R&D