C++ is such a contrived language

I’ve moved from C to C++ for a couple of years now, and every time I try to do something a little bit fancy I reach the limitations of the language, where I have to waste time to do really contrived things to make it work, because “architected” language features are actually hacked solutions more than a design with intent. It looks OK in a textbook, not so much when you try to work.

If the web search engines and Stack Overflow were not readily available, it would be nearly impossible to figure out the arcane behavior of the compiler.

C++ always promise to allow me to do something and in the end the little devil comes out and makes it all super complicated and forces me to give up or write some obscure, hard to parse solution.

Right now I want to make a simple macro that prints into a JSON format into a stream, and pass any type numeric, string or class object, and have it properly pretty printed. It almost works except char is printed as a character value, not an integer value, so it’s screwed up. I can use the unary + to print it properly, but it’s not defined for all types so it can’t be used in a macro that doesn’t differentiate the input type. And it’s impossible to special case char, use to_string(), reinterpret_cast<> or any other hack without being a master of C++ arcane-bullshit.
Here’s an example of C++ BS.

Another example is the irksome declaration of classes in headers that must include private sections, which breaks makefile dependency checks and trigger full rebuilds in large projects. Leeching shortcomings of the compiler into the language is a fallacy. I don’t care if the private section entries are needed by the compiler to properly handle linking. Hide that shit. And same with templates, which force you in some cases to declare source code in headers.

C++ is powerful, but also obscure and with obscure side effects that are artifacts of piling up features on top of a language over time without working them properly to become seamless.

With all the clutter of features and piles of standard libraries more or less optimized and leaking memory and corner case behaviors, C++ still compile code that is faster and more efficient than all the newer languages. The bar is not very high…

It’s amazing we’re able to build massive projects in C++.

We need a genius to revisit C and make it object again, but without all the failures of C++ or Java or other interpreted languages.

Advertisements

Buck converters

I’ve become more aware of buck converters, and started buying a crap ton of different ones for multiple projects where I need to change the voltage. They are super easy to use, Connect Vin, Vout and ground. Often there’s 2 ground pins to help you wire your load and your battery.

Random comparisons of buck converters to other solutions

A buck converter is a lot more efficient (up to 96% efficient) than a linear LM78Lxx voltage regulator (LM13700N, LM78L05…) though the efficiency curve is often unclear: 80% to 96%, with the heat dissipation issue implications when it is not on its efficiency curve. A buck converter basically charges capacitors and switches them  with a clock to the output to dump a lower voltage (340kHz). The output current is smoothed/regulated using a choke and capacitors.

I believe an LM chip burns the watts you don’t use into heat. Wdissipated = (Vdd-Vload) * Iload. If your chip uses more than a few milli amps, it runs your batteries and becomes hot. For reference, a 50mW controller uses 10mA at 5V. With a 12V supply the LM78L05 wastes 70mW. That is the example of an AtTiny85 driving 12V WS2811 LEDs.

An LM voltage controller is still better than using a plain resistor bridge to lower the voltage. Vload = Vdd (R1 / R1 + R2), because that relation holds only if Iload is negligible vs Ibridge, and the current in the bridge is just wasted energy.

Other bad alternatives: Connecting your controller to Vdd directly with a resistor but if the current draw of your chip varies a lot, its voltage would too and could blow the chip if the current drops (because Vload = Vdd – R * Iload). Connecting your controller to Vdd with diodes, since each one drops 0.6V: You’d need 12 diodes in series to connect your 5V controller to a 12V power supply and would waste energy (Vdd – Vload) * Iload.

Cool buck converters

There’s a couple of tiny buck converters that pack a punch:

  • AMS1117
    Perfect to power a micro controller or something low power piggy backed on a high power higher voltage setup.
    Those support up to 1A,  and come either in adjustable voltage (you use a resistor bridge on the outputs knowing Vref is 2.25V), or in multiple voltage, 3.3V and 5V are common, which means you wire the ref pin directly to ground to get that output voltage, which is super convenient, precise and stable. They come pre-wired on mini boards for a dime with the required capacitors (SMD 104 and 106 caps in parallel on each side).
    If you have a 3.3V board, you can use 3 diodes to connect its ground to your main ground to allow it to regulate to 5V if you don’t have a 5V board.
    Vin is limited at 17V. However with the trick above, for 5V regulation that can become 18.8V
  • Mini 360
    It is higher power at 1.8A. It does 3A peak, but heats up real fast and may do a thermal shut down to not melt.
    It is adjustable (1 – 17V) via a tiny rheostat you tweak (so maybe more prone to de-tuning over time).
    The board is about as wide as an AMS117, but is shorter, so it is super tiny (1/2 a postage stamp if that)! It costs about 30cts/piece.

It’s too bad, you can’t wire buck converters in parallel to get more amps. This causes instability. However even 5A buck converter boards are relatively small and cheap still (a couple of dollars).

So many programmable LED types!

As time goes by, I’m discovering more and more types of programmable LEDs!

Do you know of any other kind and their specificity or compatibility? Please comment on those!

  • Non-programmable, analog (not supported by FAB_LED)
    • Standard single color LED, requires a resistor, usually (Vdd – Vdiode) / Idiode, and Vdiode, often 3.4V, Idiode often 20mA (regular) or 30mA (bright), sor for 5V power, 68 Ohm to 86Ohm.
    • RGB with 4 pins, one pin per color plus common ground, which may have different voltage requirements from 2.2V to 3.6V, aka different resistors on each pin.
    • Fast or slow blink RGB LED, wired the same way as above, super convenient for quick color fun. You can’t dim them because it would reset their sequence.
  • One wire (all have compatible clocks, but different color orders)
    • WS2812 / WS2812B
    • WS2811 (often 12V power, with 5V signal)
    • SK6812 (supports up to 4 colors, commonly GRBW)
    • APA104
    • APA106
  • SPI (data+clock pins, easy to control, faster, but more wires)
    • APA102 (first byte has 3 top bits set, plus low 5 bits is global brightness using a different PWM rate superimposed)
    • WS2801 (no first byte)
    • P9813 (first byte has 2-bit checksum of the other color bytes)

Since I will need those for projects, I plan to support WS2801 and P9813 in FAB_LED. That requires me revisiting the SPI support I have, as part of my overhaul of the library.

The hidden math of extra mortgage payments

Don’t pay by anticipation to shorten the loan, and invest the money!

It’s good to lock a 30Y fixed because interest rates are going to go up and up, and the bank allows you to repay more principal than the mortgage premiums to end the mortgage early. But is it to your benefit?

One thing I was told AFTER is that paying by anticipation DOES NOT change your payment schedule… So what does that mean? It means that if you’re paying by anticipation, you’re loaning money to the bank for free. You would be better off holding onto that money (possibly invest it), and using it only at the end to close your mortgage a few years early.

If I’m right then mortgage calculators like this nice one don’t tell you the truth about early payments.

It turns out that when you ask around, your realtor or other folks supposedly here to help you purchase a house won’t really help you, they may not even know where the trickery is, it’s hidden in the pudding, but it’s simple if I figured it out correctly.

Here’s some generic loan numbers, for a $500K loan at 3.5%:

$500K, 30Y fixed 3.5% = 2,245.22/month, for 360 months,
$308,280.34 interest paid total (61.65% total)

Now what about shortening the loan by a decade?

$500K, 20Y fixed 3.5% = $2,899.80/month, for 240 months,
$195,951.66 interest paid total ( 39.19% total)

So with a 20Y fixed, you save $112K. Fact is you would likely have had a lower 3.25% or 3% loan ($180,634.91 or $165,517.12 interest, so that was another 30K in savings).

What about paying by anticipation same monthly sum as a 20Y monthly payment?Well silly me I thought you could just pay in anticipation and get the same math, just at the cost of a slightly higher interest rate of a longer loan… So let’s do the math: What if you pay every month $2,899.80 on the 30Y loan? That’s $654.58 per month extra…

Instead of ending in 2047 the loan end in Nov 2039, two years after the 20 year mortgage, paying $284,226.20 in interest (that’s because the schedule is fixed!), $88.6K more to the bank.

Ok so if you over pay 2 more years, you should compare to a 22 year loan then! Well that loan would have cost $217,657.34. Still 67K less.

To make it clear:

I think a loan schedule calculates the 3.5% interest you owe every year and splits the interest by 12… 500K * (3.5% / 12) = $1458 for the first month interest. Now we want the loan to last 30 years so you have to pay a given principal amount.

This generates  a schedule like so. Now the schedule is fixed. This means that if you pay early, the interest you owe at that time is not recalculated (it should since you now owe less premium), it stays the same just following the schedule, you only reduced the principal, without changing the interest you pay monthly.

If you kept the money in the bank and just plunked these $176,736.60 to close the mortgage on Nov 2039, you would not have given the lender a penny more…

In fact if you kept that money invested at 3.5% you would break with a 20Y mortgage, gaining $91,538.79 of interest for a total of $268,275.39 in oc 2039. My math must be wrong because it shows you could end the loan in April 2027, not Dec 2037…

Windows hidden tricks

I came across this old article and discovered few things:

  • Microsoft Virtual Wifi Miniport allows a PC to be a wifi access point by installing Connectify.
  • Erase free disk space on C: with “cypher /w:c”
  • Move active window with Window-Arrow keys
  • Pin folders on taskbar folder with righ-click + drag to task bar. Access them with right click on task-bar file-explorer folder
  • Get power efficiency report by searching & launching cmd.exe as admin (right click), then run “powerconfig -energy”, move \windows\system32\energy-report.html and open it.
  • Record steps you take in windows to send to someone, search and run”psr” which creates a mhtml file to share.
  • Verify reliability of apps by searching and launching “view reliability history”. It shows a score graph with the issues listed.
  • Wordpad supports .docx and OpenOffice .odt files.
  • Calculator has units conversion, date calculation, mpg and mortgage.
  • Create a godmod directory anywhere and it will list all the windows configuration options.

GodMode.{ED7BA470-8E54-465E-825C-99712043E01C}

 

C++ typeof and typeid

This is a small discussion about a solution I posted on stack overflow, related to my need for FAB_LED coding.

I came across the need to compare variable types, while writing my C++ LED library class, because of my reliance on template types. This capability becomes useful when you have a template class which itself defines a template function: The user may set both to the same type or to different types.

If you read through my example, and think of other ways to implement the same thing through simple function polymorphism, you will notice that this solution is more elegant: There is only one code block handling all use cases, which can be automatically specialized/optimized at compile time. It is also efficient, compact, as well as much simpler to maintain.

 C++ shortcomings

  • typeid() (C++ only) is meant to work like sizeof() would, but is apparently a runtime check, which works on classes as they automatically define a type field with a unique type ID. It is implemented also for base types (int, float, etc.). If it doesn’t resolve statically for cases where all the info is known at compile time, it  can’t be used for code optimization, and would be of little use for my arduino library.
  • typeof() seems rather useless to me, except maybe to write contrived C macros with hidden temporary variables. You cannot use it to do a compile-time type check from my attempts. It can only be used as a replacement for a type during a variable declaration.

So the built-in support is not there. Bummer 😦 It would have avoided managing types.

My solution…

I hence forth came up with a hybrid solution, which works in my use case. My structures now all declare a static type field, loaded with a unique value:

class X { static const uint8_t type = 2; ...}

Because it is a static const, it is not allocated with each instance of the class/struct. Once the compiler analyses the code, it also ends up being deleted completely once it’s filled its purpose of guarding unused code blocks.

Now I can use that new static property to handle all my use cases:

template <class TC> class foo {
    private:
    TC * buffer;
    static inline template <class TF> void bar(TF * data) {
        if (data.type == buffer.type) {
            bcopy(data, buffer, ...);
        } else {
            slowConvert(data);
        }
    }

Now I can use that static type property to handle all my use cases:

foo<type1> foo1;

type1 * v1;
type2 * v2;
uint8_t * v3;
foo1.bar<type1>(v1); // replaced by fast bcopy()
foo1.bar<type2>(v2); // replaced by slow convert()
foo1.bar<uint8_t>(v3); // should fail!

Of course as an API this is still very ugly: your template has to declare the type of your input parameter. The next step is to provide a clean API of all types supported by the class. Notice how it becomes simple to support basic types because this circumvents the need to compare those types, using a built-in C++ mechanism for that:

    static inline void bar(rgb * data) { bar<rgb>(data); }
    static inline void bar(bgrw * data) { bar<bgrw>(data); }
    static inline void bar(uint8_t * data) { bar<rgb>((rgb *) data); }

You can now cleanly call the API:

foo1.bar(v1); // replaced by fast bcopy()
foo1.bar(v2); // replaced by slow convert()
foo1.bar(v3); // replaced by fast bcopy()

The user now has no need to worry about his data types unless performance matters for the use case.

 

 

Classic data structures

Data Structures

http://cseweb.ucsd.edu/~kube/cls/100/Lectures

Terminology: Child, Descendant, Parent, Ancestor, Node, Root, Leaf, sub-tree, Depth, Height.

Best/Worst/Average performance, and memory cost estimations:

Big-O: g(N) = O( f(N)) if g(n>n0) <= k f(n>n0).
Big-Omega: f(N) = Omega(g(N))
Big-Theta : if both above are true.

sorting heuristics:
http://www.sorting-algorithms.com/

interview questions:
http://www.geeksforgeeks.org/write-a-c-program-to-print-all-permutations-of-a-given-string/
(follow links on right)

C++ STL containers:

  • vector
  • deque
  • list linked list
  • stack
  • queue
  • priority_queue
  • set BST
  • multiset
  • map
  • multimap
  • bitset

Trees

  • Linked lists: A->B->C
    • Skip lists A->B->C->D, and A->D accelerator
      If node 2^n skips ahead to 2^n+1 then same as a binary search tree. But overkill.
      node-level = # of forward ptr. Node level of node inserted is chosen randomly.
      Insert/Delete/Search O(log N)
    • Double-linked list
  • Binary Trees:  c<-a->b
    Each node holds a key to manipulate. Height = log2(nodes) => H = O(Log N).

    • BST Binary Search Tree: keys k are ordered, all keys left are smaller, right are bigger. A node holds (*p, *l, *r, key/data, property)
      • Find: compare key vs k.
      • Insert: find till NULL (no dupes)
      • Delete:
      • Find next key: may be right child, or may be a parent linked via a left link.
    • Balanced BST
      • AVL tree: A balanced BST with depth off by no more than 1.
        keep sub-trees balanced with a rotation of the nodes: Swap a child and its parent & the middle sub tree changes parent for l-l and r-r childs. If it’s a l-r or r-l do a double swap to pull up the node. Keep sub-tree height in each node.
      • Red-black tree: O(log N) worst case for insert, find, delete. More lax rule than AVL: Root is black. If parent is red, children are black. All paths from a node to NULL must have same number of black.
    • Splay tree: A BST,but on a find, move the node found as root, to improve speed of next search, using splaying.
    • K-D K-dimensional trees: Branch logic at level L is done with key L mod k. Used to store nodes with k-dimension coordinates, to allow sorting and region partitions.
    • Heap: A partially sorted binary tree or equivalent, where A is parent of B if A is greater (maxHeap), or lesser (minHeap). It’s an efficient implementation of a priority queue. It allows fast find/delete of max or min.
      Often implemented with an array, putting the child of n at 2n+1,2n+2, or parent at (i-1)/2, and swaps to move nodes up/down. Insertion is done at 1st free spot and fixing the heap with swaps.
    • Treaps: It is a BST for the keys, but node also has a priority, where parent has priority >= all children. In other words, instead of AVL, nodes are swapped based on priority. Inserts are O(Height of tree). Used for split/join operations. To split at key K, insert node (K, infinity). To join, make a dummy root with both trees, and remove it.
      • RST: Randomized search tree: Use a priority that is random at insert. Depth = O(long N) = cost of  find, delete, insert, split, join…
  • B-tree balanced trees: Nodes have variable m to M children so all leaves are at same level. B-tree node fits in a 4KB block, high branch-out limit number of disk IO.
    • B+ tree:
    • 2-3 tree: M=L=3, a node has 2 or 3 child.
  • Tries (or radix tree):
    Finds a key, held in the leaf node by traversing the tree, of which each node represent one letter/bit of the word/number. The digit or letter used is the radix, and tells you how many edges are out of a node (R=2^edges). If an element has at most B digits -> B comparisons/height.

    • Alphabet trie: {a..z} -> {a…z} (represent word dictionaries, 26 edges)
    • Ternary trie: Each node is a digit, and move down middle on a match and do next digit, else left/right.
    • Bitwise tree (represent numbers, 2 edges/binary)
      Note the shortcut is to have a root with 32 edges, and encoding the leading zeroes in it to reduce the search depth.
    • (DST (digital search trees): use the ith bit to choose the ith vertex, each node holds a key)

Graphs

node is a vertex, connected by edge

  • DAG Directed Acyclic Graph
    edges are directed, and the flow does not loop. Circular graphs are often broken into DAGs in compilers.
  • Dense graph: V^2 edges
    adjacency matrix: boolean 2D array saying if 2 vertices are connected by an edge.
  • Sparse graph: ~ V edges
    adjacency list: each vertex has a list of other adjecent vertices.
  • Weighted graph: edge has weight

Shortest path in a graph G(V,E) from a node to any other.

  • depth first
    pre-order traversal of a tree. You need to back track using a stack. If you find a shorter path to a node, you need to retry the paths from that node
  • breadth first O(E + V)
    In a tree it means scan all the same level first. It is better for shortest path detection in a graph. Visit all nodes adjacent using a queue. Once visited don’t revisit a node.
  • best first (for weighted shortest path)

Shortest path in weighted graph G(V,E)

Dijkstra’s greedy algorithm extends the best path found so far. It uses a priority queue<pointer to vertex, cost>, and marks a vertex with current best cost/distance, previous node on that best path and done if solved.

Hash Tables & Maps

A map is called table, dictionary or associative memory. Hold <key, value> pairs.

Searching an index is O(1), a sorted list is O(log N), O(N) if not. To implement create B buckets slightly more than the number of keys stored. On collision use a linked list or store in an unused bucket.

load factor lambda: number of entries n / number of buckets m l =n/m.

hash function: must be deterministic, fast, uniform distribution, consistent with equal.

Hash(string) = sum of ascii value *31^pos // use horner’s rule to reduce math work

Searching:

  • binary search of sorted data: O(log N)

Kernel Memory Management notes

More study notes.

Pages: Physical memory is managed in pages by the CPU’s MMU, usually 4K or 8K. The kernel tracks every physical page with <linux/mm.h> struct page, and holds the status flags, the count of references to that page (see page_count(),  zero = unused), the kernel’s virtual address of the page, and owner (user, dyn-alloc kernel, static kernel, page cache, etc.).

Zones: Physical address range limits some activity, like DMA devices may have access to a smaller range (ZONE_DMA) like ISA which can only access top 16MB, or there may be more physical than can be mapped virtual so not always mapped to kernel (ZONE_HIGHMEM), on x86 all memory above 896MB, or just normal use (ZONE_NORMAL).

struct page * alloc_pages(mask, pow) allocates 2^pow contiguous physical pages. Convert to a logical/virtual page with void * page_address(struct page * phys);. See also get_zeroed_page(mask), free_pages(vaddr, order).

kmalloc()/kfree() returns byte granularity, physically contiguous memory for the kernel. It has flags controlling actions (no sleep, no retry, etc) , zones (dma, high) and types (atomic for interrupt handlers, dma, user space…).

vmalloc()/vfree() returns memory contiguous in kernel virtual memory. It may fix up the page table to align physical pages, it may sleep.This causes TLB thrashing so is used only for very large allocations when it is unlikey to work with kmalloc.

Slab allocator: a system managed allocator used to allocate specific structures and cache them for reuse. This replaces usage of custom free lists. Some of the caching can be per CPU, removing the need for locks, if it is NUMA aware it can allocate memory from the same node as the CPU. Objects can be colored to not map to the same cache line.

static stack allocation: Kernel threads have small stacks unlike user space, default is usually 2 physically contiguous pages (8k ~ 16k). Linux moved to 1 page, so interrupt handlers don’t fit, so a one page, per processor interrupt stack was added. Recursion and alloca() are not allowed. Overruning the stack will damage thread_info.

High memory: Physical pages past 896MB are only temporarily mapped in the kernel’s VM, in the 3GB~4GB range. Permanent mapping is done in process context via void * kmap(struct page * p). Temporary mapping is done without sleeping (for interrupts) and done via void * kmap-atomic(struct page *p, km_type type) and disables preemption because the map is bound to a CPU.

per-CPU allocation: For SMP to avoid locking, you can use CPU local variables. While accessing the variable, your thread must not change CPU, hence use a primitive that turns off preemption

Legacy method: Declare a variable as an array with one entry per CPU. myType myVar[NR_CPUS], and detect you current CPU with get_cpu()/put_cpu(). get_cpu will turn off preemption to stay on the same CPU (else use smp_processor_id()).

percpu: Simpler more systematic and powerful. Cache alignment is implicitly handled to avoid thrashing. Consider the section after the call as critical, you cannot sleep until the variable is released.

Static declarations won’t work for dynamic kernel modules.

#include <linux/percpu.h>
DEFINE_PER_CPU(myType, myVar);
// DECLARE_PER_CPU(myType, myVar); for the header file
get_cpu_var(myVar)++;
// Critical section with preemption off.
put_cpu_var(myVar);
// Updates var of another CPU without changing preemption:
per_cpu(myVar, cpu)++;

dynamic dynamically allocated version, optionally force the compiler to align

void * ptr = alloc_percpu(myType) __alignof__ (myType);
myType * myVar = get_cpu_ptr(ptr);
// Critical section with preemption off.
put_cpu_ptr(myVar);

// Updates var of another CPU without changing preemption:
per_cpu_ptr(myVar, cpu);

 

Multi-processor synchronization notes

These are brief refresher on code synchronization concepts, which apply to Unix/Linux/Windows/OSX operating systems.

  • Single processor (uni-processor or UP) concurrency comes from interrupts, or cooperative multi-tasking (the code yields the CPU). Variables shared with interrupts must be preempt-safe.
  • Multi-processor can do cooperative multi-tasking, where a thread yields the CPU when it decides to, and domain locks (only one CPU can control a feature).
  • Symmetric Multi-Processing (SMP), same code can run on multiple CPUs, so shared data access must be protected/synchronized to keep coherency. Most PCs and servers are now SMP (multiple sockets, multiple cores, multiple hyper-threads).
  • All general purpose OSes for SMP systems use Preemptive multi-tasking schedulers can start/stop a thread any time on any CPU causing all possible coherency scenarios to eventually happen. Shared data must be SMP-safe.

A critical region is a code block accessing shared data. Data access for a region must be atomic. If 2 threads run the same critical region, it is a race condition and likely a bug. It is avoided with synchronization primitives (i.e. code). A goo practice is to write the smallest critical section possible, aka just access the data needed, and do all the processing outside of the critical section.

Atomic instructions are in the instruction-set for all SMP CPUs, and run a complex operation as one indivisible action as seen by any other CPU. They apply to native integer types and bit manipulations. Typical ones are fetch-and-add, compare-and-swap, test-and-set, load-link/store-conditional. Load and store instructions for basic types like a word are usually atomic by nature. Note that a 32-bit CPU will likely not have an atomic 64-bit load or store!

Memory barriers: Atomicity only ensures that the memory access is consistent and was not corrupted by another CPU. It does not enforce ordering, aka which one comes first. Barriers enforce an action is done before, or after a certain point.

Linux kernel: uses the atomic_t integer type, see <asm/atomic.h> for ints or <asm/bitops.h> for bits, and defines inline ASM macros for common atomic instructions.

One can use atomic instructions to create lock-free code (pay attention to cache line thrashing). Usually programmers will declare shared variables in lock-free code as volatile, which tells the compiler that if the source code does a load or a store on that variable, the compiled code must perform that same exact action, without trying to optimize the code away. Otherwise, if the compiler can keep the variable in a register to make the code go fast, it will. For lock-free code, the “critical region” is always accessing a native data type with an atomic instruction.

Locks are higher constructs implemented using atomic instructions (their use has higher overhead).

Locks are advisory and voluntary. The programmer must know a variable is shared and use the corresponding lock to protect the critical section. Note: A lock protects shared data, and not a critical region of code. If 2 critical regions access the same data, they must grab the corresponding lock.

  • is it a global variable
  • is it shared with an interrupt handler
  • is it shared with another process (like a child process)
  • can process sleep (block on a syscall), if so what state is the data?
  • can it be freed by another thread
  • can the same code be called by other threads?

A deadlock is when the code cannot make progress because it waits for  resource that is not released. For example a thread tries to grab a lock it already holds. Or two threads own one of two resources and need to grab the other resource before releasing both (deadly embrace or ABBA). These are hard to debug.

Lock order: To avoid ABBA deadlocks, always acquire the locks in the same order. HP-UX actually enforces ordering by giving each lock a priority, and flagging at compile time an order reversal. Lock release order is not important though common practice to do in reverse order.

Scalability: Lock contention when 2 threads try to acquire at the same time will cause poor performance because work of multiple threads is serialized. Contention comes from frequency of access AND size of the critical section. Scalability implies increasing number of threads, size of memory and/or number of CPUs.

Lock granularity: Finer grain means that a lock controls access to a smaller region of memory. For example a field in a structure, instead of an instance of the structure, or even a whole list. Finer grain locking is only needed when performance is impacted otherwise. We call this breaking a lock, and may incur a small penalty for systems with few CPUs.

Recursive locks: allow a thread to grab the lock multiple times. (avoid)

Spinlocks: If a thread tries to grab a lock already held/contended, it will spin in a busy loop trying to acquire it. In UP, spinlocks are compiled out of the code completely.

Locks can be used in interrupt handlers (no sleep) but in that case require disabling interrupts prior to locking on that CPU or an interrupt could try to acquire the lock too while it is held by interrupted code (see spin_lock_irqsave()). In UP, the en/dis of interrupts would be kept, even with the lock compiled out. Note that locking disables kernel preemption. A critical section is not stopped. Interrupt handlers with bottom halves will have similar requirements (see spin_lock_bh()), and behaviors and solutions will vary with tasklets and softIRQs.

<linux/spinlock.h> // see CONFIG_DEBUG_SPINLOCK macro for debugging
spinlock_t l = SPIN_LOCK_UNLOCKED;
// unsigned long flags;
// spin_trylock(&l); spin_is_locked(&l);
spin_lock(&l);  // or spin_lock_irqsave(&l, flags);
// critical section
spin_unlock(&l); // or spin_unlock_irqrestore(&l, flags);

read-write locks: It’s a spin lock that allows multiple readers to access the data at the same time but exclusive access for writers. They have the same flavors are spinlocks. You cannot upgrade a read to a write lock. However you can recursively obtain a read lock (for interrupts it means you can use read_lock() instead of read_lock_irqsave(), however it is still needed for a write to avoid a deadlock).

<linux/spinlock.h>
rwlock_t l = RW_LOCK_UNLOCKED;
read_lock(&l); 
// critical section
read_unlock(&l);
write_lock(&l); 
// critical section
write_unlock(&l);

The thread waiting can instead of spinning, go to sleep until the lock becomes available.

Semaphores: Waiting threads are put to sleep, with the overhead of 2 context switches. A  process holding a semaphore can sleep and can be preempted by the scheduler. Other waiters will sleep too so there is no deadlock. Well suited for locks held a long time. They can’t be used for interrupts or code that is not subjected to the OS scheduler. They are often used for locking held while sync’ing with user space.

Semaphores can’t be acquired while holding a spinlock, as the process may sleep.

Counting semaphores can have a use count greater than one to allow multiple processes to hold the lock at the same time, and are used to enforce limits.
A binary semaphore or mutex has a count of one, like spinlocks.

Semaphore operators are often called: P() to probe, and V(), to to increment, or down() to decrement and acquire it, and up() to increment and release it. If down() retruns a negative value, the lock is not acquired, and the process goes into a wait queue and sleeps.

#include <asm/semaphore.h>
static DECLARE_SEMAPHORE_GENERIC(s, count);
// static DECLARE_MUTEX(s); // binary semaphore shortcut
// semaphore s;
// sema_init(&s, count);
// init_MUTEX(&s);
down_interruptible(&s); // acquire sema or sleep in TASK_INTERRUPTIBLE state
// down_trylock(&s);
// critical section
up_interruptible(&s); // release sema

Read-write semaphores: Same as semaphore, but with multiple readers and single writer. Note these come with a downgrade_writer() method to convert holding a write lock to a read lock.

#include <linux/rwsem.h>
static DECLARE_RWSEM(s);
down_read(&s);
// critical section
up_read(&s);

Completion variables: Somewhat similar to a semaphore, threads waiting for another to finish can sleep on one. Used by vfork().

#include <linux/completion.h>
DECLARE_COMPLETION(v);
// init_completion(&v);
wait_for_completion(&v);

// other thread runs:
complete(&v);

Big Kernel Lock (BKL): In Linux it is a global spinlock, used to transition from simple SMP to fine grained locking. Process can sleep holding BKL, it is dropped and reacquired automatically for the process when it is scheduled. It is recursive. It disables scheduler process preemption. It works only in process context. It is controlled with <linux/smo_lock.h> lock_kernel()/unlock_kernel().

Seq locks: simple shared data access mechanism that favors writers which never stall. On write, the data is locked and a sequence number is incremented. On a read, the sequence number is checked before and after the read, if the value is unchanged the data was not modified during the read.If the value is even, no write is pending. Taking a write lock makes the sequence odd.

#include <>
seqlock_t l = SEQLOCK_UNLOCKED;
write_seqlock(&l);
// critical section
write_sequnlock(&l);

do {
  n = read_seqbegin(&l);
  // critical section
} while (read_seqretry(&l, n);

Disabling preemption: spinlocks disable preemption. However per-CPU data don’t need locking, but may need to disable preemption for a critical region to prevent another thread do be scheduled on the CPU while data should be updated atomically.

preempt_disable();
// CPU specific critical section
preempt_enable();

int cpu;
cpu = get_cpu(); // disables preemtion
// CPU specific critical section
put_cpu(); // reenables preemption

Barriers: Both compiler optimizations and out-of-order CPU execution can reorder memory accesses without dependencies. Barriers instructions will enforce ordering.

read barrier (rmb()): No load can be reordered across that point. Previous loads will all complete before, and those after will not start before it. Write barrier (wmb()) and read/write barriers (mb()) perform similarly.

read_barrier_depends() is a read barrier that applies only to loads for which subsequent loads have a dependency. On non out-of-order CPUs this is a noop.

Other topics:

Alpha and Beta semaphores (HP-UX)
Lock queue (order the threads spinning to avoid thundering herd effect).
Posix threads and locking primitives
Conditional Variables