The hidden math of extra mortgage payments

Don’t pay by anticipation to shorten the loan, and invest the money!

It’s good to lock a 30Y fixed because interest rates are going to go up and up, and the bank allows you to repay more principal than the mortgage premiums to end the mortgage early. But is it to your benefit?

One thing I was told AFTER is that paying by anticipation DOES NOT change your payment schedule… So what does that mean? It means that if you’re paying by anticipation, you’re loaning money to the bank for free. You would be better off holding onto that money (possibly invest it), and using it only at the end to close your mortgage a few years early.

If I’m right then mortgage calculators like this nice one don’t tell you the truth about early payments.

It turns out that when you ask around, your realtor or other folks supposedly here to help you purchase a house won’t really help you, they may not even know where the trickery is, it’s hidden in the pudding, but it’s simple if I figured it out correctly.

Here’s some generic loan numbers, for a $500K loan at 3.5%:

$500K, 30Y fixed 3.5% = 2,245.22/month, for 360 months,
$308,280.34 interest paid total (61.65% total)

Now what about shortening the loan by a decade?

$500K, 20Y fixed 3.5% = $2,899.80/month, for 240 months,
$195,951.66 interest paid total ( 39.19% total)

So with a 20Y fixed, you save $112K. Fact is you would likely have had a lower 3.25% or 3% loan ($180,634.91 or $165,517.12 interest, so that was another 30K in savings).

What about paying by anticipation same monthly sum as a 20Y monthly payment?Well silly me I thought you could just pay in anticipation and get the same math, just at the cost of a slightly higher interest rate of a longer loan… So let’s do the math: What if you pay every month $2,899.80 on the 30Y loan? That’s $654.58 per month extra…

Instead of ending in 2047 the loan end in Nov 2039, two years after the 20 year mortgage, paying $284,226.20 in interest (that’s because the schedule is fixed!), $88.6K more to the bank.

Ok so if you over pay 2 more years, you should compare to a 22 year loan then! Well that loan would have cost $217,657.34. Still 67K less.

To make it clear:

I think a loan schedule calculates the 3.5% interest you owe every year and splits the interest by 12… 500K * (3.5% / 12) = $1458 for the first month interest. Now we want the loan to last 30 years so you have to pay a given principal amount.

This generates  a schedule like so. Now the schedule is fixed. This means that if you pay early, the interest you owe at that time is not recalculated (it should since you now owe less premium), it stays the same just following the schedule, you only reduced the principal, without changing the interest you pay monthly.

If you kept the money in the bank and just plunked these $176,736.60 to close the mortgage on Nov 2039, you would not have given the lender a penny more…

In fact if you kept that money invested at 3.5% you would break with a 20Y mortgage, gaining $91,538.79 of interest for a total of $268,275.39 in oc 2039. My math must be wrong because it shows you could end the loan in April 2027, not Dec 2037…

Windows hidden tricks

I came across this old article and discovered few things:

  • Microsoft Virtual Wifi Miniport allows a PC to be a wifi access point by installing Connectify.
  • Erase free disk space on C: with “cypher /w:c”
  • Move active window with Window-Arrow keys
  • Pin folders on taskbar folder with righ-click + drag to task bar. Access them with right click on task-bar file-explorer folder
  • Get power efficiency report by searching & launching cmd.exe as admin (right click), then run “powerconfig -energy”, move \windows\system32\energy-report.html and open it.
  • Record steps you take in windows to send to someone, search and run”psr” which creates a mhtml file to share.
  • Verify reliability of apps by searching and launching “view reliability history”. It shows a score graph with the issues listed.
  • Wordpad supports .docx and OpenOffice .odt files.
  • Calculator has units conversion, date calculation, mpg and mortgage.
  • Create a godmod directory anywhere and it will list all the windows configuration options.



C++ typeof and typeid

This is a small discussion about a solution I posted on stack overflow, related to my need for FAB_LED coding.

I came across the need to compare variable types, while writing my C++ LED library class, because of my reliance on template types. This capability becomes useful when you have a template class which itself defines a template function: The user may set both to the same type or to different types.

If you read through my example, and think of other ways to implement the same thing through simple function polymorphism, you will notice that this solution is more elegant: There is only one code block handling all use cases, which can be automatically specialized/optimized at compile time. It is also efficient, compact, as well as much simpler to maintain.

 C++ shortcomings

  • typeid() (C++ only) is meant to work like sizeof() would, but is apparently a runtime check, which works on classes as they automatically define a type field with a unique type ID. It is implemented also for base types (int, float, etc.). If it doesn’t resolve statically for cases where all the info is known at compile time, it  can’t be used for code optimization, and would be of little use for my arduino library.
  • typeof() seems rather useless to me, except maybe to write contrived C macros with hidden temporary variables. You cannot use it to do a compile-time type check from my attempts. It can only be used as a replacement for a type during a variable declaration.

So the built-in support is not there. Bummer 😦 It would have avoided managing types.

My solution…

I hence forth came up with a hybrid solution, which works in my use case. My structures now all declare a static type field, loaded with a unique value:

class X { static const uint8_t type = 2; ...}

Because it is a static const, it is not allocated with each instance of the class/struct. Once the compiler analyses the code, it also ends up being deleted completely once it’s filled its purpose of guarding unused code blocks.

Now I can use that new static property to handle all my use cases:

template <class TC> class foo {
    TC * buffer;
    static inline template <class TF> void bar(TF * data) {
        if (data.type == buffer.type) {
            bcopy(data, buffer, ...);
        } else {

Now I can use that static type property to handle all my use cases:

foo<type1> foo1;

type1 * v1;
type2 * v2;
uint8_t * v3;<type1>(v1); // replaced by fast bcopy()<type2>(v2); // replaced by slow convert()<uint8_t>(v3); // should fail!

Of course as an API this is still very ugly: your template has to declare the type of your input parameter. The next step is to provide a clean API of all types supported by the class. Notice how it becomes simple to support basic types because this circumvents the need to compare those types, using a built-in C++ mechanism for that:

    static inline void bar(rgb * data) { bar<rgb>(data); }
    static inline void bar(bgrw * data) { bar<bgrw>(data); }
    static inline void bar(uint8_t * data) { bar<rgb>((rgb *) data); }

You can now cleanly call the API:; // replaced by fast bcopy(); // replaced by slow convert(); // replaced by fast bcopy()

The user now has no need to worry about his data types unless performance matters for the use case.



Classic data structures

Data Structures

Terminology: Child, Descendant, Parent, Ancestor, Node, Root, Leaf, sub-tree, Depth, Height.

Best/Worst/Average performance, and memory cost estimations:

Big-O: g(N) = O( f(N)) if g(n>n0) <= k f(n>n0).
Big-Omega: f(N) = Omega(g(N))
Big-Theta : if both above are true.

sorting heuristics:

interview questions:
(follow links on right)

C++ STL containers:

  • vector
  • deque
  • list linked list
  • stack
  • queue
  • priority_queue
  • set BST
  • multiset
  • map
  • multimap
  • bitset


  • Linked lists: A->B->C
    • Skip lists A->B->C->D, and A->D accelerator
      If node 2^n skips ahead to 2^n+1 then same as a binary search tree. But overkill.
      node-level = # of forward ptr. Node level of node inserted is chosen randomly.
      Insert/Delete/Search O(log N)
    • Double-linked list
  • Binary Trees:  c<-a->b
    Each node holds a key to manipulate. Height = log2(nodes) => H = O(Log N).

    • BST Binary Search Tree: keys k are ordered, all keys left are smaller, right are bigger. A node holds (*p, *l, *r, key/data, property)
      • Find: compare key vs k.
      • Insert: find till NULL (no dupes)
      • Delete:
      • Find next key: may be right child, or may be a parent linked via a left link.
    • Balanced BST
      • AVL tree: A balanced BST with depth off by no more than 1.
        keep sub-trees balanced with a rotation of the nodes: Swap a child and its parent & the middle sub tree changes parent for l-l and r-r childs. If it’s a l-r or r-l do a double swap to pull up the node. Keep sub-tree height in each node.
      • Red-black tree: O(log N) worst case for insert, find, delete. More lax rule than AVL: Root is black. If parent is red, children are black. All paths from a node to NULL must have same number of black.
    • Splay tree: A BST,but on a find, move the node found as root, to improve speed of next search, using splaying.
    • K-D K-dimensional trees: Branch logic at level L is done with key L mod k. Used to store nodes with k-dimension coordinates, to allow sorting and region partitions.
    • Heap: A partially sorted binary tree or equivalent, where A is parent of B if A is greater (maxHeap), or lesser (minHeap). It’s an efficient implementation of a priority queue. It allows fast find/delete of max or min.
      Often implemented with an array, putting the child of n at 2n+1,2n+2, or parent at (i-1)/2, and swaps to move nodes up/down. Insertion is done at 1st free spot and fixing the heap with swaps.
    • Treaps: It is a BST for the keys, but node also has a priority, where parent has priority >= all children. In other words, instead of AVL, nodes are swapped based on priority. Inserts are O(Height of tree). Used for split/join operations. To split at key K, insert node (K, infinity). To join, make a dummy root with both trees, and remove it.
      • RST: Randomized search tree: Use a priority that is random at insert. Depth = O(long N) = cost of  find, delete, insert, split, join…
  • B-tree balanced trees: Nodes have variable m to M children so all leaves are at same level. B-tree node fits in a 4KB block, high branch-out limit number of disk IO.
    • B+ tree:
    • 2-3 tree: M=L=3, a node has 2 or 3 child.
  • Tries (or radix tree):
    Finds a key, held in the leaf node by traversing the tree, of which each node represent one letter/bit of the word/number. The digit or letter used is the radix, and tells you how many edges are out of a node (R=2^edges). If an element has at most B digits -> B comparisons/height.

    • Alphabet trie: {a..z} -> {a…z} (represent word dictionaries, 26 edges)
    • Ternary trie: Each node is a digit, and move down middle on a match and do next digit, else left/right.
    • Bitwise tree (represent numbers, 2 edges/binary)
      Note the shortcut is to have a root with 32 edges, and encoding the leading zeroes in it to reduce the search depth.
    • (DST (digital search trees): use the ith bit to choose the ith vertex, each node holds a key)


node is a vertex, connected by edge

  • DAG Directed Acyclic Graph
    edges are directed, and the flow does not loop. Circular graphs are often broken into DAGs in compilers.
  • Dense graph: V^2 edges
    adjacency matrix: boolean 2D array saying if 2 vertices are connected by an edge.
  • Sparse graph: ~ V edges
    adjacency list: each vertex has a list of other adjecent vertices.
  • Weighted graph: edge has weight

Shortest path in a graph G(V,E) from a node to any other.

  • depth first
    pre-order traversal of a tree. You need to back track using a stack. If you find a shorter path to a node, you need to retry the paths from that node
  • breadth first O(E + V)
    In a tree it means scan all the same level first. It is better for shortest path detection in a graph. Visit all nodes adjacent using a queue. Once visited don’t revisit a node.
  • best first (for weighted shortest path)

Shortest path in weighted graph G(V,E)

Dijkstra’s greedy algorithm extends the best path found so far. It uses a priority queue<pointer to vertex, cost>, and marks a vertex with current best cost/distance, previous node on that best path and done if solved.

Hash Tables & Maps

A map is called table, dictionary or associative memory. Hold <key, value> pairs.

Searching an index is O(1), a sorted list is O(log N), O(N) if not. To implement create B buckets slightly more than the number of keys stored. On collision use a linked list or store in an unused bucket.

load factor lambda: number of entries n / number of buckets m l =n/m.

hash function: must be deterministic, fast, uniform distribution, consistent with equal.

Hash(string) = sum of ascii value *31^pos // use horner’s rule to reduce math work


  • binary search of sorted data: O(log N)

Kernel Memory Management notes

More study notes.

Pages: Physical memory is managed in pages by the CPU’s MMU, usually 4K or 8K. The kernel tracks every physical page with <linux/mm.h> struct page, and holds the status flags, the count of references to that page (see page_count(),  zero = unused), the kernel’s virtual address of the page, and owner (user, dyn-alloc kernel, static kernel, page cache, etc.).

Zones: Physical address range limits some activity, like DMA devices may have access to a smaller range (ZONE_DMA) like ISA which can only access top 16MB, or there may be more physical than can be mapped virtual so not always mapped to kernel (ZONE_HIGHMEM), on x86 all memory above 896MB, or just normal use (ZONE_NORMAL).

struct page * alloc_pages(mask, pow) allocates 2^pow contiguous physical pages. Convert to a logical/virtual page with void * page_address(struct page * phys);. See also get_zeroed_page(mask), free_pages(vaddr, order).

kmalloc()/kfree() returns byte granularity, physically contiguous memory for the kernel. It has flags controlling actions (no sleep, no retry, etc) , zones (dma, high) and types (atomic for interrupt handlers, dma, user space…).

vmalloc()/vfree() returns memory contiguous in kernel virtual memory. It may fix up the page table to align physical pages, it may sleep.This causes TLB thrashing so is used only for very large allocations when it is unlikey to work with kmalloc.

Slab allocator: a system managed allocator used to allocate specific structures and cache them for reuse. This replaces usage of custom free lists. Some of the caching can be per CPU, removing the need for locks, if it is NUMA aware it can allocate memory from the same node as the CPU. Objects can be colored to not map to the same cache line.

static stack allocation: Kernel threads have small stacks unlike user space, default is usually 2 physically contiguous pages (8k ~ 16k). Linux moved to 1 page, so interrupt handlers don’t fit, so a one page, per processor interrupt stack was added. Recursion and alloca() are not allowed. Overruning the stack will damage thread_info.

High memory: Physical pages past 896MB are only temporarily mapped in the kernel’s VM, in the 3GB~4GB range. Permanent mapping is done in process context via void * kmap(struct page * p). Temporary mapping is done without sleeping (for interrupts) and done via void * kmap-atomic(struct page *p, km_type type) and disables preemption because the map is bound to a CPU.

per-CPU allocation: For SMP to avoid locking, you can use CPU local variables. While accessing the variable, your thread must not change CPU, hence use a primitive that turns off preemption

Legacy method: Declare a variable as an array with one entry per CPU. myType myVar[NR_CPUS], and detect you current CPU with get_cpu()/put_cpu(). get_cpu will turn off preemption to stay on the same CPU (else use smp_processor_id()).

percpu: Simpler more systematic and powerful. Cache alignment is implicitly handled to avoid thrashing. Consider the section after the call as critical, you cannot sleep until the variable is released.

Static declarations won’t work for dynamic kernel modules.

#include <linux/percpu.h>
DEFINE_PER_CPU(myType, myVar);
// DECLARE_PER_CPU(myType, myVar); for the header file
// Critical section with preemption off.
// Updates var of another CPU without changing preemption:
per_cpu(myVar, cpu)++;

dynamic dynamically allocated version, optionally force the compiler to align

void * ptr = alloc_percpu(myType) __alignof__ (myType);
myType * myVar = get_cpu_ptr(ptr);
// Critical section with preemption off.

// Updates var of another CPU without changing preemption:
per_cpu_ptr(myVar, cpu);


Multi-processor synchronization notes

These are brief refresher on code synchronization concepts, which apply to Unix/Linux/Windows/OSX operating systems.

  • Single processor (uni-processor or UP) concurrency comes from interrupts, or cooperative multi-tasking (the code yields the CPU). Variables shared with interrupts must be preempt-safe.
  • Multi-processor can do cooperative multi-tasking, where a thread yields the CPU when it decides to, and domain locks (only one CPU can control a feature).
  • Symmetric Multi-Processing (SMP), same code can run on multiple CPUs, so shared data access must be protected/synchronized to keep coherency. Most PCs and servers are now SMP (multiple sockets, multiple cores, multiple hyper-threads).
  • All general purpose OSes for SMP systems use Preemptive multi-tasking schedulers can start/stop a thread any time on any CPU causing all possible coherency scenarios to eventually happen. Shared data must be SMP-safe.

A critical region is a code block accessing shared data. Data access for a region must be atomic. If 2 threads run the same critical region, it is a race condition and likely a bug. It is avoided with synchronization primitives (i.e. code). A goo practice is to write the smallest critical section possible, aka just access the data needed, and do all the processing outside of the critical section.

Atomic instructions are in the instruction-set for all SMP CPUs, and run a complex operation as one indivisible action as seen by any other CPU. They apply to native integer types and bit manipulations. Typical ones are fetch-and-add, compare-and-swap, test-and-set, load-link/store-conditional. Load and store instructions for basic types like a word are usually atomic by nature. Note that a 32-bit CPU will likely not have an atomic 64-bit load or store!

Memory barriers: Atomicity only ensures that the memory access is consistent and was not corrupted by another CPU. It does not enforce ordering, aka which one comes first. Barriers enforce an action is done before, or after a certain point.

Linux kernel: uses the atomic_t integer type, see <asm/atomic.h> for ints or <asm/bitops.h> for bits, and defines inline ASM macros for common atomic instructions.

One can use atomic instructions to create lock-free code (pay attention to cache line thrashing). Usually programmers will declare shared variables in lock-free code as volatile, which tells the compiler that if the source code does a load or a store on that variable, the compiled code must perform that same exact action, without trying to optimize the code away. Otherwise, if the compiler can keep the variable in a register to make the code go fast, it will. For lock-free code, the “critical region” is always accessing a native data type with an atomic instruction.

Locks are higher constructs implemented using atomic instructions (their use has higher overhead).

Locks are advisory and voluntary. The programmer must know a variable is shared and use the corresponding lock to protect the critical section. Note: A lock protects shared data, and not a critical region of code. If 2 critical regions access the same data, they must grab the corresponding lock.

  • is it a global variable
  • is it shared with an interrupt handler
  • is it shared with another process (like a child process)
  • can process sleep (block on a syscall), if so what state is the data?
  • can it be freed by another thread
  • can the same code be called by other threads?

A deadlock is when the code cannot make progress because it waits for  resource that is not released. For example a thread tries to grab a lock it already holds. Or two threads own one of two resources and need to grab the other resource before releasing both (deadly embrace or ABBA). These are hard to debug.

Lock order: To avoid ABBA deadlocks, always acquire the locks in the same order. HP-UX actually enforces ordering by giving each lock a priority, and flagging at compile time an order reversal. Lock release order is not important though common practice to do in reverse order.

Scalability: Lock contention when 2 threads try to acquire at the same time will cause poor performance because work of multiple threads is serialized. Contention comes from frequency of access AND size of the critical section. Scalability implies increasing number of threads, size of memory and/or number of CPUs.

Lock granularity: Finer grain means that a lock controls access to a smaller region of memory. For example a field in a structure, instead of an instance of the structure, or even a whole list. Finer grain locking is only needed when performance is impacted otherwise. We call this breaking a lock, and may incur a small penalty for systems with few CPUs.

Recursive locks: allow a thread to grab the lock multiple times. (avoid)

Spinlocks: If a thread tries to grab a lock already held/contended, it will spin in a busy loop trying to acquire it. In UP, spinlocks are compiled out of the code completely.

Locks can be used in interrupt handlers (no sleep) but in that case require disabling interrupts prior to locking on that CPU or an interrupt could try to acquire the lock too while it is held by interrupted code (see spin_lock_irqsave()). In UP, the en/dis of interrupts would be kept, even with the lock compiled out. Note that locking disables kernel preemption. A critical section is not stopped. Interrupt handlers with bottom halves will have similar requirements (see spin_lock_bh()), and behaviors and solutions will vary with tasklets and softIRQs.

<linux/spinlock.h> // see CONFIG_DEBUG_SPINLOCK macro for debugging
spinlock_t l = SPIN_LOCK_UNLOCKED;
// unsigned long flags;
// spin_trylock(&l); spin_is_locked(&l);
spin_lock(&l);  // or spin_lock_irqsave(&l, flags);
// critical section
spin_unlock(&l); // or spin_unlock_irqrestore(&l, flags);

read-write locks: It’s a spin lock that allows multiple readers to access the data at the same time but exclusive access for writers. They have the same flavors are spinlocks. You cannot upgrade a read to a write lock. However you can recursively obtain a read lock (for interrupts it means you can use read_lock() instead of read_lock_irqsave(), however it is still needed for a write to avoid a deadlock).

rwlock_t l = RW_LOCK_UNLOCKED;
// critical section
// critical section

The thread waiting can instead of spinning, go to sleep until the lock becomes available.

Semaphores: Waiting threads are put to sleep, with the overhead of 2 context switches. A  process holding a semaphore can sleep and can be preempted by the scheduler. Other waiters will sleep too so there is no deadlock. Well suited for locks held a long time. They can’t be used for interrupts or code that is not subjected to the OS scheduler. They are often used for locking held while sync’ing with user space.

Semaphores can’t be acquired while holding a spinlock, as the process may sleep.

Counting semaphores can have a use count greater than one to allow multiple processes to hold the lock at the same time, and are used to enforce limits.
A binary semaphore or mutex has a count of one, like spinlocks.

Semaphore operators are often called: P() to probe, and V(), to to increment, or down() to decrement and acquire it, and up() to increment and release it. If down() retruns a negative value, the lock is not acquired, and the process goes into a wait queue and sleeps.

#include <asm/semaphore.h>
// static DECLARE_MUTEX(s); // binary semaphore shortcut
// semaphore s;
// sema_init(&s, count);
// init_MUTEX(&s);
down_interruptible(&s); // acquire sema or sleep in TASK_INTERRUPTIBLE state
// down_trylock(&s);
// critical section
up_interruptible(&s); // release sema

Read-write semaphores: Same as semaphore, but with multiple readers and single writer. Note these come with a downgrade_writer() method to convert holding a write lock to a read lock.

#include <linux/rwsem.h>
static DECLARE_RWSEM(s);
// critical section

Completion variables: Somewhat similar to a semaphore, threads waiting for another to finish can sleep on one. Used by vfork().

#include <linux/completion.h>
// init_completion(&v);

// other thread runs:

Big Kernel Lock (BKL): In Linux it is a global spinlock, used to transition from simple SMP to fine grained locking. Process can sleep holding BKL, it is dropped and reacquired automatically for the process when it is scheduled. It is recursive. It disables scheduler process preemption. It works only in process context. It is controlled with <linux/smo_lock.h> lock_kernel()/unlock_kernel().

Seq locks: simple shared data access mechanism that favors writers which never stall. On write, the data is locked and a sequence number is incremented. On a read, the sequence number is checked before and after the read, if the value is unchanged the data was not modified during the read.If the value is even, no write is pending. Taking a write lock makes the sequence odd.

#include <>
seqlock_t l = SEQLOCK_UNLOCKED;
// critical section

do {
  n = read_seqbegin(&l);
  // critical section
} while (read_seqretry(&l, n);

Disabling preemption: spinlocks disable preemption. However per-CPU data don’t need locking, but may need to disable preemption for a critical region to prevent another thread do be scheduled on the CPU while data should be updated atomically.

// CPU specific critical section

int cpu;
cpu = get_cpu(); // disables preemtion
// CPU specific critical section
put_cpu(); // reenables preemption

Barriers: Both compiler optimizations and out-of-order CPU execution can reorder memory accesses without dependencies. Barriers instructions will enforce ordering.

read barrier (rmb()): No load can be reordered across that point. Previous loads will all complete before, and those after will not start before it. Write barrier (wmb()) and read/write barriers (mb()) perform similarly.

read_barrier_depends() is a read barrier that applies only to loads for which subsequent loads have a dependency. On non out-of-order CPUs this is a noop.

Other topics:

Alpha and Beta semaphores (HP-UX)
Lock queue (order the threads spinning to avoid thundering herd effect).
Posix threads and locking primitives
Conditional Variables

APA102 specs

A thorough study of APA102 is on Tim’s blog. I’m just rehashing his findings and the spec sheet here for my own purpose.

APA102C spec sheet:

  • Max clock is 800 kHz to 1.2 MHz
  • A frame is controlled by one uint32_t set to Zero (START_FRAME) and one set to MaxInt (END_FRAME):
    0x0000.0000 LED1 LED2 … LEDn 0xFFFF.FFFF
  • A LED value is a uint32_t, with the first byte encoding 3 most significant bits set to 111, and a 5-bit brightness (0..31).

Noteworthy tidbits:

  • You can’t have a white max brightness LED display. Indeed the first byte would be 11111111 with R,G,B all set to 0xFFFF. That means the pixel would be 0xFFFF.FFFF, the same signature as the END FRAME. That’s very odd, until you read Tim’s blog that explains its use: It doesn’t matter.
  • The brightness setting seems to be implemented as a 580Hz PWM, that is noticeable when the brightness is set to 30 or lower, while the RGB settings are PWM’d at 19.2kHz.
    => Note to notice discrepancies between LED setting and actual display, we’d have to refresh the LED faster than 50 micro seconds, or with a clock of less than 1.5 micro seconds (614kHz), turning the LED on and off immediately. In that case it is possible the LED does not have time to light up.
  • SPI clocks of up to 10MHz have been tested. No upper limit was reported.
  • Tim shows clock latch is when clock goes LOW to HIGH. CKout passed to the next LEDs is inverted, and SDO is set on CKin latch. That allows the next LED to latch SDO when it is stable.
    => From this I infer CK should be a 50% duty cycle as much as possible to maximize stability of the LED strip. However the circuits support over 10MHz.
  • The START_FRAME with 0x0000.0000 tells a LED there’s no data to read. The first bit to 1 means read the data and store it as the LED value. That information is not pushed to SDO which stays at zero that whole time. Once the LED data is used, all extra data is passed on to SDO. Hence the LED strip is refreshed from the closest to the most remote LED.
  • The END_FRAME is only used to push CK for the last pixel. The number of bits of that frame must be at least 1/2 the number of pixels driven. It can contain 1’s or 0’s but not both.


FAB_LED: A WS2812B Library with Palette Support

FAB_LED Is a Fast Arduino Bitbanging library for addressable LEDs, like WS2812B.


The Library can be downloaded here from GitHub, or directly as a zip file.

To install the library, unzip the source files into your Arduino  “libraries” directory, like any other library. Rename the directory FAB_LED. Restart your arduino IDE. You’re ready to test it.

Load an example, plug your WS2812B LEDs onto pin D6 of your board (pin “~6” on Arduino Uno), power it, compile and load the program. The led strip should cycle different patterns. If you don’t use pin D6 then just change the declaration in the program.

  • Jun 19,2016: Refactoring the library to make it more compact and maintainable. Simplifying the a API to draw() and send() from send_pixels().
  • May 2016: support for palettes, pixel remapping, parallel port updates (up to 6 ports on 16mhz avr) beginning support for apa102 and arm.
  • Apr 1, 2016: updated repository. Added APA104, WS2812 support, fixed Mac compile bugs, slowed aggressive timing to handle APA104.
  • Mar 29, 2016: Library should adapt to CPU frequency, added 1K pixels demo. Forgot to update GIT repository!
  • Mar 23, 2016: The beta is available, with one code example.
  • Feb 29, 2016: I’ve finally debugged my WS2812B library to the point where it all works on a 16MHz Arduino UNO. Now why would anyone care for a me too library?


The FAB_LED library has a few benefits attached to it versus other libraries (AdaFruit NeoPixel or FastLED libraries):

  • The low level bit-banging code is written in gcc C, without any inline assembly. This means it should support any CPU frequency, not just the frequencies that I have hardcoded.
  • The LED class supports multiple pixel formats, which provide great flexibility to choose between pixel manipulation convenience and pixel memory size. A pixel can use from 1 bit of memory to 32 bits, using different schemes:
    • 3 uint8_t per pixel (24 bits)
    • One uint32_t word per pixel (32 bits)
    • 1, 2, 4 or 8 bits per pixel mapped to a color palette of 2, 4, 16 or 256 colors
    • One uint16_t word per pixel (16 bits, 5 bits per pixel), with a choice of 3 levels of brightness
    • Separate pixel clear method and a 256-gray levels method allow you to clear or set a LED strip of any size without using up any memory.
  • The methods can be called back to back to duplicate patterns on infinitely long LED strips, tested on a 16MHz Arduino Uno. Timings for the palette method work but are very tight.
  • Alternatively, you can define multiple arrays to draw frames of an animation, and just change which array is drawn at every clock tick. There is no need to re-draw the frames, unlike other LED libraries.
  • Using the class requires nothing more than knowing which port your LED strip is attached to. A physical port on Arduino is 2 elements, the port letter A to F, and the port pin, from 0 to 7.

At this point FAB_LED is not vaporware anymore. It is fully working.


The following example uses the palette routine, using 2 bits per pixels.

  • It uses a color palette taking only 12 bytes of RAM to hold the 4 color values.
  • It defines a 16-pixel pattern which fits in another 4 bytes of RAM.
  • The 16-pixels pattern is sent 11 times in a tight loop, to light up the whole 170-LED strip.

The animation is done by manipulating the palette colors after the tight sendPixel loop:

  • We rotate one of the four colors (blue-green pulse).
  • We blink on/off the red color.

All this uses a total of 16 bytes for the data arrays, and a one page user code loop to animate. The code page is shown as screen capture.

The bottom loop does the display directly, there is no need for a show() routine since it paints the pixels on the fly as soon as sendPixel() is called.

Here’s a video of this code running:

I am adding slowly 2D support. This is my first rendering. I had to put a cloth to diffuse the LEDs to take the photo. It’s an 8×8 pixel board:


Other links


Ooh and posting my video I see someone posted how to pipeline data from a USB port directly… I’ll have to look into that, to maybe stream from USB or a SD card! They can handle thousands of LEDs.

Arduino Reference Links

I’ll try to track interesting reads here: