art with code


Unified Interconnect

Playing with InfiniBand got me thinking. This thing is basically a PCIe to PCIe -bridge. The old kit runs at x4 PCIe 3 speeds, the new stuff is x16 PCIe. The next generation is x16 PCIe 4.0 and 5.0.

Why jump through all the hoops? Thunderbolt is x4 PCIe over a USB connector. What if you bundle four of those, throw in some fiber and transceivers for long distance. You get x16 PCIe between two devices.

And once you start thinking of computers as a series of components hanging off a PCIe bus, your system architecture clarifies dramatically. A CPU is a PCIe 3 device with 40 to 64 lanes. DRAM uses around 16 lanes per channel.

GPUs are now hooked up as 16-lane devices, but could saturate 256 to 1024 lanes. Because of that, GPU RAM is on the GPU board. If the GPU had enough lanes, you could hook GPU RAM up to the PCIe bus with 32 lanes per GDDR5 chip. HBM is probably too close to the GPU to separate.

You could build a computer with 1024 lanes, then start filling them up with the mix of GPUs, CPUs, DRAM channels, NVMe and connectivity that you require. Three CPUs with seven DRAM channels? Sure. Need an extra CPU or two? Just plug them in. How about a GPU-only box with the CPU OS on another node. Or CPUs connected as accelerators via 8 lanes per socket as you'd like to use the extra lanes for other stuff. Or a mix of x86 and ARM CPU cards to let you run mixed workloads at max speed and power efficiency.

Think of a rack of servers, sharing a single PCIe bus. It'd be like one big computer with everything hotpluggable. Or a data center, running a single massive OS instance with 4 million cores and 16 PB of RAM.

Appendix, more devices

Then you've got the rest of the devices, and they're pretty well on the bus already. NVMe comes in 4-lane blocks. Thunderbolt 3, Thunderbolt 2 and USB 3.1 are 4, 2 and 1 lane devices. SAS and SATA are a bit awkward, taking up a bit more than 1 lane or bit more than half a lane. I'd replace them with NVMe connectors.

Display connectors could live on the bus as well (given some QoS to keep up the interactivity). HDMI 2.1 uses 6 lanes, HDMI 2 is a bit more than 2 lanes. DisplayPort's next generation might go up to 7-8 lanes.

Existing kit

[Edit] Hey, someone's got products like this already. Dolphin ICS produces a range of PCIe network devices. They've even got an IP-over-PCIe driver.

[Edit #2] Hey, the Gen-Z Interconnect looks a bit like this Gen-Z Interconnect Core Specification 1.0 Published


InfiniBanding, pt. 2

InfiniBand benchmarks with ConnectX-2 QDR cards (PCIe 2.0 x8 -- very annoying lane spec outside of server gear: either it eats up a PCIe 3.0 x16 slot, or you end up running at half speed, and it's too slow to hit the full 32 Gbps of QDR InfiniBand. Oh yes, I plugged one into a PCIe 2.0 x8 four-lane slot, it gets half the RDMA bandwidth in tests.)

Ramdisk-to-ramdisk, 1.8 GB/s with Samba and IP-over-InfiniBand.

IPoIB iperf2 does 2.9 GB/s with four threads. The single-threaded iperf3 goes 2.5 GB/s if I luck out on the CPU affinity lottery (some cores / NUMA nodes do 1.8 GB/s..)

NFS over RDMA, 64k random reads with QD8 and one thread, fio tells me read bw=2527.4MB/s. Up to 2.8 GB/s with four threads. Up to 3 GB/s with 1MB reads.

The bandwidth limit of PCIe 2.0 x8 that these InfiniBand QDR cards use is around 25.6 Gbps, or 3.2 GB/s. Testing with ib_read_bw, it maxes out at around 3 GB/s.

So. Yeah. There's 200 MB/s of theoretical performance left on the table (might be ESXi PCIe passthrough exposing only 128 byte MaxPayload), but can't complain.

And... There's an upgrade path composed of generations of obsoleted enterprise gear: FDR gets you PCIe 3.0 x4 cards and should also get you the full 4 GB/s bandwidth of the QDR switch. FDR switches aren't too expensive either, for a boost to 5.6 GB/s per link. Then, pick up EDR 100 GbE kit...

Now the issue (if you can call it that) is that the server is going to have 10 GB/s of disk bandwidth available, which is going to be bottlenecked (if you can call it that) by the 3 GB/s network.

I could run multiple IB adapters, but I'll run out of PCIe slots. Possibilities: bifurcate a 16x slot into two 8x slots for IB or four 4x slots for NVMe. Or bifurcate both slots. Or get a dual-FDR/EDR card with a 16x connector to get 8 GB/s on the QDR switch. Or screw it and figure out how to make money out of this and use it to upgrade to dual-100 GbE everywhere.

(Yes, so, we go from "set up NAS for office so that projects aren't lying around everywhere" to "let's build a 400-machine distributed pool of memory, storage and compute with GPU-accelerated compute nodes and RAM nodes and storage nodes and wire it up with fiber for 16-48 GB/s per-node bandwidth". Soon I'll plan some sort of data center and then figure out that we can't afford it and go back to making particle flowers in Unity.)


Quick test with old Infiniband kit

Two IBM ConnectX-2 cards, hooked up to a Voltaire 4036 switch that sounds like a turbocharged hair dryer. CentOS 7, one host bare metal, other on top of ESXi 6.5.

Best I saw thus far: 3009 MB/s RDMA transfer. Around 2.4 GB/s with iperf3. These things seem to be CPU capped, top is showing 100% CPU use. Made an iSER ramdisk too, it was doing 1.5 GB/s-ish with ext4.

Will examine more next week. With later kernel and firmwares and whatnot.

The end goal here would be to get 2+ GB/s file transfers over Samba or NFS. Probably not going happen but eh, give it a try.

That switch though. Need a soundproof cabinet.


OpenVPN settings for 1 Gbps tunnel

Here are the relevant parts of the OpenVPN 2.4 server config that got me 900+ Mbps iperf3 on GbE LAN. The tunnel was between two PCs with high single-core performance, a Xeon 2450v2 and an i7-3770. OpenVPN uses 50% of a CPU core on the client & server when the tunnel is busy. For reference, I tried running the OpenVPN server on my WiFi router, it peaked out at 60 Mbps.

# Use TCP, I couldn't get good perf out of UDP. 

proto tcp

# tun or tap, roughly same perf
dev tun 

# Use AES-256-GCM:
#  - more secure than 128 bit
#  - GCM has built-in authentication, see https://en.wikipedia.org/wiki/Galois/Counter_Mode
#  - AES-NI accelerated, the raw crypto runs at GB/s speeds per core.

cipher AES-256-GCM

# Don't split the jumbo packets traversing the tunnel.
# This is useful when tun-mtu is different from 1500.
# With default value, my tunnel runs at 630 Mbps, with mssfix 0 it goes to 930 Mbps.

mssfix 0

# Use jumbo frames over the tunnel.
# This reduces the number of packets sent, which reduces CPU load.
# On the other hand, now you need 6-7 MTU 1500 packets to send one tunnel packet. 
# If one of those gets lost, it delays the entire jumbo packet.
# Digression:
#   Testing between two VBox VMs on a i7-7700HQ laptop, MTU 9000 pegs the vCPUs to 100% and the tunnel runs at 1 Gbps.
#   A non-tunneled iperf3 runs at 3 Gbps between the VMs.
#   Upping this to 65k got me 2 Gbps on the tunnel and half the CPU use.

tun-mtu 9000

# Send packets right away instead of bundling them into bigger packets.
# Improves latency over the tunnel.


# Increase the transmission queue length.
# Keeps the TUN busy to get higher throughput.
# Without QoS, you should get worse latency though.

txqueuelen 15000

# Increase the TCP queue size in OpenVPN.
# When OpenVPN overflows the TCP queue, it drops the overflow packets.
# Which kills your bandwidth unless you're using a fancy TCP congestion algo.
# Increase the queue limit to reduce packet loss and TCP throttling.

tcp-queue-limit 256

And here is the client config, pretty much the same except that we only need to set tcp-nodelay on the server:

proto tcp
cipher AES-256-GCM
mssfix 0
tun-mtu 9000
txqueuelen 15000
tcp-queue-limit 256

To test, run iperf3 -s on the server and connect to it over the tunnel from the client: iperf3 -c For more interesting tests, run the iperf server on a different host on the endpoint LAN, or try to access network shares.

I'm still tuning this (and learning about the networking stack) to get a Good Enough connection between the two sites, let me know if you got any tips or corrections.

P.S. Here's the iperf3 output.

$ iperf3 -c
Connecting to host, port 5201
[  4] local port 39590 connected to port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   112 MBytes   942 Mbits/sec    0   3.01 MBytes
[  4]   1.00-2.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   2.00-3.00   sec   111 MBytes   933 Mbits/sec    0   3.01 MBytes
[  4]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   4.00-5.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    0   3.01 MBytes
[  4]   6.00-7.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   8.00-9.00   sec   111 MBytes   933 Mbits/sec    0   3.01 MBytes
[  4]   9.00-10.00  sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  1.08 GBytes   928 Mbits/sec    0             sender
[  4]   0.00-10.00  sec  1.08 GBytes   927 Mbits/sec                  receiver

iperf Done.


Fast-ish OpenVPN tunnel

500 Mbps OpenVPN throughput over the Internet, nice. Was aiming for 900 Mbps, which seems to work on LAN, but no cigar. [Edit: --tcp-nodelay --tcp-queue-limit 256 got me to 680 Mbps. Which is very close to non-tunneled line speed as measured by a HTTP download.]

OpenVPN performance is very random too. I seem to be getting different results just because I restarted the client or the server.

The config is two wired machines, each with a 1 Gbps fibre Internet connection. The server is a Xeon E3-1231v3, a 3.4 GHz Haswell Xeon. The client is my laptop with a USB3 GbE adapter and i7-7700HQ. Both machines get 900+ Mbps on Speedtest, so the Internet connections are fine.

My OpenVPN is set to TCP protocol (faster and more reliable than UDP in my testing), and uses AES-256-GCM as the cipher. Both machines are capable of pushing multiple gigaBYTES per second over openssl AES-256, so crypto isn't a bottleneck AFAICT. The tun-mtu is set to 9000, which performs roughly as well as 48000 or 60000, but has smaller packets which seems to be less flaky than big mtus. The mssfix setting is set to 0 and fragment to 0 as well, though fragment shouldn't matter over TCP.

Over iperf3 I get 500 Mbps between the endpoints. With HTTP, roughly that too. Copying from a remote SMB share on another host goes at 30 MB per second, but the remote endpoint can transfer from the file server at 110 MB per second (protip: mount.cifs -o vers=3.0). Thinking about it a bit, I need to test with a second NIC in the VPN box, now VPN traffic might be competing with LAN traffic.


Building a storage pyramid

The office IT infrastructure plan is something like this: build interconnected storage pyramids with compute. The storage pyramids consist of compute hooked up to fast memory, then solid state memory to serve mid-IOPS and mid-bandwidth workloads, then big spinning disks as archive. The different layers of the pyramid are hooked up via interconnects that can be machine-local or over the network.

The Storage Pyramid

Each layer of the storage pyramid has different IOPS and bandwidth characteristics. Starting from the top, you've got GPUs with 500 GB/s memory, connected via a 16 GB/s PCIe bus to the CPU, which has 60 GB/s DRAM. The next layer is also on the PCIe bus: Optane-based NVMe SSDs, which can hit 3 GB/s on streaming workloads and 250 MB/s on random workloads (parallelizable to maybe 3x that). After Optane, you've got flash-based SSDs that push 2-3 GB/s streaming accesses and 60 MB/s random accesses. At the next level, you could have SAS/SATA SSDs which are limited to 1200/600 MB/s streaming performance by the bus. And at the bottom lie the HDDs that can do somewhere between 100 to 240 MB/s streaming accesses and around 0.5-1 MB/s random accesses.

The device speeds guide us in picking the interconnects between them. Each HDD can fill a 120 MB/s GbE port. SAS/SATA SSDs plug into 10GbE ports, with their 1 GB/s performance. For PCIe SSDs and Optane, you'd go with either 40GbE or InfiniBand QDR, and hit 3-4 GB/s. After the SSD layer, the interconnect bottlenecks start rearing their ugly heads.

You could use 200Gb InfiniBand to connect single DRAM channels at 20 GB/s, but even then you're starting to get bottlenecked at high DDR4 frequencies. Plus you have to traverse the PCIe bus, which further knocks you down to 16 GB/s over PCIe 3.0 x16. It's still sort of feasible to hook up a cluster with shared DRAM pool, but you're pushing the limits.

Usually you're stuck inside the local node for performance at the DRAM-level. The other storage layers you can run over the network without much performance lost.

The most unbalanced bottleneck in the system is the CPU-GPU interconnect. The GPU's 500 GB/s memory is hooked to the CPU's 60 GB/s memory via a 16 GB/s PCIe bus. Nvidia's NVLink can hook up two GPUs together at 40 GB/s (up to 150 GB/s for Tesla V100), but there's nothing to get faster GPU-to-DRAM access. This is changing with the advent of PCIe 4.0 and PCIe 5.0, which should be able to push 128 GB/s and create a proper DRAM interconnect between nodes and between the GPU and the CPU. The remaining part of the puzzle would be some sort of 1 TB/s interconnect to link GPU memories together.

The Plan

Capacity-wise, my plan is to get 8 GB of GPU RAM, 64 GB of CPU RAM, 256 GB of Optane, 1 TB of NVMe flash, and 16 TB of HDDs. For a nicer-cleaner-more-satisfying progression, you could throw in a 4 TB SATA flash layer but SATA flash is kind of DOA as long as you have NVMe and PCI-E slots to use -- the price difference between NVMe flash and SATA flash is too small compared to the performance difference.

If I can score an InfiniBand interconnect or 40GbE, I'll stick everything from Optane on down to a storage server. It should perform at near-local speeds and simplify storage management. Shared pool of data that can be expanded and upgraded without having to touch the workstations. Would be cool to have a shared pool of DRAM too but eh.

Now, our projects are actually small enough (half a gig each, maybe 2-3 of them under development at once) that I don't believe we will ever hit disk in daily use. All the daily reads and writes should be to client DRAM, which gets pushed to server DRAM and written down to flash / HDD at some point later. That said, those guys over there *points*, they're doing some video work now...

The HDDs are mirrored to an off-site location over GbE. The HDDs are likely capable of saturating a single GbE link, so 2-3 GbE links would be better for live mirroring. For off-site backup (maybe one that runs overnight), 1 GbE should be plenty.

In addition to the off-site storage mirror, there's some clouds and stuff for storing compiled projects, project code and documents. These don't need to sync fast or are small enough to do so.

Business Value

Dubious. But it's fun. And opens up possible uses that are either not doable on the cloud or way too expensive to maintain on the cloud. (As in, a single month of AWS is more expensive than what I paid for the server hardware...)



Ultraviolet Fairies

"Can you make them dance?", Pierre asked. Innocent question, but this was a cloud of half a million particles. Dance? If I could make the thing run in the first place it would be cause for celebration.

The red grid of the Kinect IR illuminator came on. Everything worked perfectly again. Exactly twelve seconds later, it blinked out, as it had done a dozen times before. The visiting glass artist wasn't impressed with our demo.

Good tidings from France. The app works great on Pierre's newly-bought Acer laptop. A thunderhead was building in the horizon. The three-wall cave projection setup comes out with a wrong aspect ratio. I sipped my matcha latte and looked at the sun setting behind the cargo ships moored off West Kowloon. There's still 20 hours before the gig.

The motion was mesmerizing. Tiny fairies weaving around each other, hands swatting them aside on long trajectories off-screen. I clenched my fist and the fairies formed a glowing ring of power, swirling around my hand like a band of living light. The keyboard was bringing the escaping clouds to life, sending electric pulses through the expanding shells of fairies knocked off-course.

Beat. The music and Isabelle's motion become one, the cloud of fairies behind her blows apart from the force of her hands, like sand thrown in the air. Cut to the musicians, illuminated by the gridlines of the projection. Fingers beating the neon buttons of the keyboard, shout building in the microphone. The tension running through the audience is palpable. Beat. The flowing dancer's dress catches a group of fairies. Isabelle spins and sends them flying.

The AI

A dot. I press space. The dot blinks out and reappears. One, two, three, four, five, six, seven, eight, nine, ten. I press space. The dot blinks out and reappears. Human computation.

Sitting at a desk in Zhuzhou. The visual has morphed into a jumble of sharp lines, rhythmically growing every second. The pulse driving it makes it glow. Key presses simulate the drums we want to hook up to it. Rotate, rotate, zoom, disrupt, freeze. The rapid typing beat pushes the visual to fill-rate destroying levels and my laptop starts chugging.

Sharp lines of energy, piercing the void around them. A network of connections shooting out to other systems. Linear logic, strictly defined, followed with utmost precision. The lines begin to _bend_, almost imperceptibly at first. A chaotic flow pulls the lines along its turbulent path. And, stop. Frozen in time, the lines turn. Slowly, they begin to grow again, but straight and rigid, linear logic revitalized. Beginning of the AI Winter.

The fucking Kinect dropped out again! I start kicking the wire and twisting the connector. That fucker, it did it once in the final practice already, of course it has to drop out at the start of the performance as well. Isabelle's taking it in stride, she's such a pro. If the AI doesn't want to dance with her, she'll dance around it. I push myself down from my seat. How about if I push the wire at the Kinect end, maybe taping it to the floor did ... oh it works again. I freeze, not daring to move, lying on the floor of the theater. Don't move don't move don't move, keep working you bastard! The glowing filter bubbles envelop Isabelle's hands, the computer is responsive again. Hey, it's not all bad. We could use this responsiveness toggle for storytelling, one more tool in the box.

We're the pre-war interexpressionist movement. Beautiful butterflies shining in the superheated flashes of atomic explosions.

Blog Archive

About Me

My photo

Built art installations, web sites, graphics libraries, web browsers, mobile apps, desktop apps, media player themes, many nutty prototypes