fhtr

art with code

2018-04-21

rdma-pipe

Haven't you always wanted to create UNIX pipes that run from one machine to another? Well, you're in luck. Of sorts. For I have spent my Saturday hacking on an InfiniBand RDMA pipeline utility that lets you pipe data between commands running on remote hosts at multi-GB/s speeds.

Unimaginatively named, rdma-pipe comes with the rdpipe utility that coordinates the rdsend and rdrecv programs that do the data piping. The rdpipe program uses SSH as the control channel and starts the send and receive programs on the remote hosts, piping the data through your commands.

For example


  # The fabulous "uppercase on host1, reverse on host2"-pipeline.
  $ echo hello | rdpipe host1:'tr [:lower:] [:upper:]' host2:'rev'
  OLLEH

  # Send ZFS snapshots fast-like from fileserver to backup_server.
  $ rdpipe backup@fileserver:'zfs send -I tank/media@last_backup tank/media@head' backup_server:'zfs recv tank-backup/media'

  # Decode video on localhost, stream raw video to remote host.
  $ ffmpeg -i sintel.mpg -pix_fmt yuv420p -f rawvideo - | rdpipe playback_host:'ffplay -f rawvideo -pix_fmt yuv420p -s 720x480 -'

  # And who could forget the famous "pipe page-cached file over the network to /dev/null" benchmark!
  $ rdpipe -v host1:'</scratch/zeroes' host2:'>/dev/null'
  Bandwidth 2.872 GB/s

Anyhow, it's quite raw, new, exciting, needs more eyeballs and tire-kicking. Have a look if you're on InfiniBand and need to pipe data across hosts.

2018-04-17

IO limits

It's all about latency, man. Latency, latency, latency. Latency drives your max IOPS. The other aspects are how big are your IOs and how many can you do in parallel. But, dude, it's mostly about latency. That's the thing, the big kahuna, the ultimate limit.

Suppose you've got a workload. Just chasing some pointers. This is a horrible workload. It just chases tiny 8-byte pointers around an endless expanse of memory, like some sort of demented camel doing a random walk in the Empty Quarter.

This camel, this workload, it's all about latency. How fast can you go from one pointer to the next. That gives you your IOPS. If it's from a super-fast spinning disk with a 10 ms latency, you'll get maybe like a 100 IOPS. From NVMe flash SSD with 0.1 ms latency, 10000 IOPS. Optane's got 6-10 us latency, which gets you 100-170k IOPS. If it's, I don't know, a camel. Yeah. Man. How many IOPS can a camel do? A camel caravan can travel 40 kilometers per day. The average distance between two points in Rub' al Khali? Well, it's like a 500x1000 km rectangle, right? About 400 kilometers[1] then. So on average it'd take the camel 40 days to do an IO. That comes down to, yeah, 289 nanoIOPS.

Camels aren't the best for random access workloads.

There's also the question of the IO size. If you can only read one byte at a time, you aren't going to get huge throughput no matter how fast your access times. Imagine a light speed interconnect with a length of 1.5 meters. That's about a 10 picosecond roundtrip. One bit at a time, you could do 12.5 GB per second. So, while that's super fast, it's still an understandable number. And that's the best-case scenario raw physical limit.

Now, imagine our camel. Trudging along in the sand, carrying a saddle bag with decorative stitchwork, tassels swinging from side to side. Inside the pouches of the saddle bags are 250 kilograms of MicroSD cards at 250 mg each, tiny brightly painted chips protected from the elements in anti-static bags. Each card can store 256 GB of data and the camel is carrying a million of them. The camel's IO size is 256 petabytes. At 289 nanoIOPS, its bandwidth is 74 GB/s. The camel has a higher bandwidth than our light speed interconnect. It's a FTL camel.

Let's add IO parallelism to the mix. Imagine a caravan of twenty camels, each camel carrying 256 petabytes of data. An individual camel has a bandwidth of 74 GB/s, so if you multiply that by 20, you get the aggregate caravan bandwidth: 1.5 TB/s. These camels are a rocking interconnect in a high-latency, high-bandwidth world.

Back to chasing 8-byte pointers. All we want to do is find one tiny pointer, read it, and go to the next one. Now it doesn't really matter how many camels you have or how much each can carry, all that matters is how fast they can go from place to place. In this kind of scenario, the light speed interconnect would still be doing 12.5 GB/s (heck, it'd be doing 12.5 GB/s at any IO size larger than a bit), but our proud caravan of camels would be reduced to 0.0000023 bytes per second. Yes, that's bytes. 2.3 microbytes per second.

If you wanted to speed up the camel network, you could spread them evenly over the desert. Now the maximum distance a camel has to travel to the data is divided by the number of camels serving the requests. This works like a Redundant Array of Independent Camels, or RAIC for short. We handwave away the question how the camels synchronize with each other.

Bringing all this back to the mundane world of disks and chips, the throughput of a chip device at QD1 goes through two phases: first it runs at maximum IOPS up to its maximum IO block size, then it runs at flat IOPS up to its maximum parallelism. In theory this would give you a linear throughput increase with increasing block size until you run into the device throughput limit or the bus throughput limit.

You can roughly calculate the maximum throughput of a device by multiplying its IOPS by its IO block size and its parallelism. E.g. if a flash SSD can do ten thousand 8k IOPS and 16 parallel requests, its throughput would be 1.28 GB/s. If you keep the controller and the block size and replace the flash chips with Optane that can do 10x as many QD1 IOPS, you could reach 12.8 GB/s throughput. PCIe x16 Optane cards anyone?

To take it a step further, DRAM runs at 50 ns latency, which would give you 20 million IOPS, or 200x that of Optane. So why don't we see RAM throughput in the 2.5 TB/s region? First, DDR block size is 64 bits (or 8 bytes). Second, CPUs only have two to four memory channels. Taking those numbers at face value, we should only be seeing 320 MB/s to 640 MB/s memory bandwidth.

"But that doesn't make sense", I hear you say, "my CPU can do 90 GB/s reads from RAM!" Glad you asked! After the initial access latency, DRAM actually operates in a streaming mode that ups the block size eight-fold to 64 bytes and uses the raw 400 MHz bus IOPS [2]. Plugging that number into our equation, we get a four channel setup running at 102.4 GB/s.

To go higher than that, you have to boost that bus. E.g. HBM uses a 1024-bit bus, which gets you up to 400 GB/s over a single channel. With dual memory channels, you're nearly at 1 TB/s. Getting to camel caravan territory. You'll still be screwed on pointer-chasing workloads though. For those, all you want is max MHz.

[1] var x=0, samples=100000; for (var i=0; i < samples; i++) { var dx = 500*(Math.random() - Math.random()), dy = 1000*(Math.random() - Math.random()); x += Math.sqrt(dx*dx + dy*dy); } x /= samples;

[2] Please tell me how it actually works, this is based on incomplete understanding of Wikipedia's incomplete explanation. As in, what kind of workload can you run from DRAM at burst rate.

2018-04-12

RDMA cat

Today I wrote a small RDMA test program using libibverbs. That library has a pretty steep learning curve.

Anyhow. To use libibverbs and librdmacm on CentOS, install rdma-core-devel and compile your things with -lrdmacm -libverbs.

My test setup is two IBM-branded Mellanox ConnectX-2 QDR InfiniBand adapters connected over a Voltaire 4036 QDR switch. These things are operating at PCIe 2.0 x8 speed, which is around 3.3 GB/s. Netcat and friends get around 1 GB/s transfer rates piping data over the network. Iperf3 manages around 2.9 GB/s. With that in mind, let's see what we can reach.

I was basing my test programs on these amazingly useful examples: https://github.com/linzion/RDMA-example-application https://github.com/jcxue/RDMA-Tutorial http://www.digitalvampire.org/rdma-tutorial-2007/notes.pdf and of course http://www.rdmamojo.com/ . At one point after banging my head on the ibverbs library for a bit too long I was thinking of just using MPI to write the thing and wound up on http://mpitutorial.com - but I didn't have the agility to jump from host-to-host programs to strange new worlds, so kept on using ibverbs for these tests.

First light


The first test program was just reading some data from STDIN, sending it to the server, which reverses it and sends it back. From there I worked towards sending multiple blocks of data (my goal here was to write an RDMA version of cat).

I had some trouble figuring out how to make the two programs have a repeatable back-and-forth dialogue. First I was listening to too many events with the blocking ibv_get_cq_event -call, and that was hanging the program. Only call it as many times as you're expecting replies.

The other fib was that my send and receive work requests shared the sge struct, and the send-part of the dialogue was setting the sge buffer length to 1 since it was only sending acks back to the other server. Set it back to the right size before sending each work request, problem solved.

Optimization


Once I got the rdma-cat working, performance wasn't great. I was reading in a file from page cache, sending it to the receiver, and writing it to the STDOUT of the receiver. The program was sending 4k messages, doing a 4k acks, and a mutex-requiring event ack after each message. This ran at around 100 MB/s. Changing the 4k acks to single-byte acks and doing the event acks for all the events at once got me to 140 MB/s.

How about doing larger messages? Change the message size to 65k and the cat goes at 920 MB/s. That's promising! One-megabyte messages and 1.4 GB/s. With eight meg messages I was up to 1.78 GB/s and stuck there.

I did another test program that was just sending an 8 meg buffer to the other machine, which didn't do anything to the data. This is useful to get an optimal baseline and gauge perf for a single process use case. The test program was running at 2.9 GB/s.

Adding a memcpy to the receive loop roughly halved the bandwidth to 1.3 GB/s. Moving to a round-robin setup with one buffer receiving data while another buffer is having the data copied out of it boosted the bandwidth to 3 GB/s.

The send loop could read in data at 5.8 GB/s from the page cache, but the RDMA pipe was only doing 1.8 GB/s. Moving the read to happen right after each send got them both moving in parallel, which got the full rdma_send < inputfile ; rdma_recv | wc -c -pipe running at 2.8 GB/s.

There was an issue with the send buffer contents getting mangled by an incoming receive. Gee, it's almost like I shouldn't use the same buffer for sending and receiving messages. Using a different buffer for the received messages resolved the issue.

It works!


I sent a 4 gig file and ran diff on it, no probs. Ditto for files less than buffer size in size and small strings sent with echo.

RDMA cat! 2.9 GB/s over the network.

Let's try sending video frames next. Based on these CUDA bandwidth timings, I should be able to do 12 GB/s up and down. Now I just need to get my workstation on the IB network (read: buy a new workstation with more than one PCIe slot.)

[Update] For the heck of it, I tried piping through two hosts.

[A]$ rdma_send B < inputfile
[B]$ rdma_recv | rdma_send C
[C]$ rdma_recv | wc -c

2.5 GB/s. Not bad, could do networked stream processing. Wonder if it would help if I zero-copy passed the memory regions along the pipe.

And IB is capable of multicast as well... 




2018-04-06

4k over IB

So, technically, I could stream uncompressed 4k@60Hz video over the Infiniband network. 4k60 needs about 2 GB/s of bandwidth, the network goes at 3 GB/s.

This... how would I try this?

I'd need a source of 4k frames. Draw on the GPU to a framebuffer, then glReadPixels (or CUDA GPUDirect RDMA). Then use IB verbs to send the framebuffer to another machine. Upload it to the GPU to display with glTexImage (or GPUDirect from the IB card).

And make sure that everything in the data path runs at above 2 GB/s.

Use cases? Extreme VNC? Combining images from a remote GPU and local GPU? With a 100Gb network, you could pull in 6 frames at a time and composite in real time I guess. Bringing in raw 4k camera streams to a single box over a switched COTS fabric.

Actually, this would be "useful" for me, I could stream interactive video from a big beefy workstation to a small installation box. The installation box could handle stereo camera processing and other input, then send the control signals to the rendering station. (But at that point, why not just get longer HDMI and USB cables.)

2018-04-02

Quick timings

NVMe and NFS, cold cache on client and server. 4.3 GiB in under three seconds.

$ cat /nfs/nvme/Various/UV.zip | pv > /dev/null
 4.3GiB 0:00:02 [1.55GiB/s]

The three-disk HDD pool gets around 300 MB/s, but once the ARC picks up the data it goes at NFS + network speed. Cold cache on the client.

$ echo 3 > /proc/sys/vm/drop_caches
$ cat /nfs/hdd/Videos/*.mp4 | pv > /dev/null
16.5GiB 0:00:10 [ 1.5GiB/s]

Samba is heavier somehow.

$ cat /smb/hdd/Videos/*.mp4 | pv > /dev/null
16.5GiB 0:00:13 [1.26GiB/s]

NFS over RDMA from the ARC, direct to /dev/null (which, well, it's not a very useful benchmark). But 2.8 GB/s!

$ time cat /nfs/hdd/Videos/*.mp4 > /dev/null

real    0m6.269s
user    0m0.007s
sys     0m4.056s
$ cat /nfs/hdd/Videos/*.mp4 | wc -c
17722791869
$ python -c 'print(17.7 / 6.269)'
2.82341681289

$ time cat /nfs/hdd/Videos/*.mp4 > /nfs/nvme/bigfile

real    0m15.538s
user    0m0.016s
sys     0m9.731s

# Streaming read + write at 1.13 GB/s

How about some useful work? Parallel grep at 3 GB/s. Ok, we're at the network limit, call it a day.

$ echo 3 > /proc/sys/vm/drop_caches
$ time (for f in /nfs/hdd/Videos/*.mp4; do grep -o --binary-files=text XXXX "$f" & done; for job in `jobs -p`; do wait $job; done)
XXXX
XXXX
XXXX
XXXX
XXXX

real    0m5.825s
user    0m3.567s
sys     0m5.929s

2018-03-26

Infinibanding, pt 4. Almost there

Got my PCIe-M.2 adapters, plugged 'em in, one of them runs at PCIe 1.0 lane speeds instead of PCIe 3.0, capping perf to 850 MB/s. And causes a Critical Interrupt #0x18 | Bus Fatal Error that resets the machine. Then the thing overheats, melts its connection solder, shorts, releases the magic smoke and makes the SSD PCB look like wet plastic. Yeah, it's dead. The SSD's dead too.

[Edit: OR.. IS IT? Wiped the melted goop off the SSD and tried it in the other adapter and it seems to be fine. Just has high controller temps, apparently a thing with the 960s. Above 90 C after 80 GB of writes. It did handle 800 GB of /dev/zero written to it fine and read everything back in order as well. Soooo, maybe my two-NVMe pool lives to ride once more? Shock it, burn it, melt it, it just keeps on truckin'. Disk label: "The Terminator"]

I only noticed this because the server rebooted and one of the mirrors in the ZFS pool was offline. Hooray for redundancy?

Anyway, the fast NVMe work pool is online, it can kinda saturate the connection. It's got two one Samsung 960 EVOs in it, which are is fast for reads, if maybe not the best for synced writes.

I also got a 280 gig Optane 900p. It feels like Flash Done Right. Low queue depths, low parallelism, whatever, it just burns through everything. Optane also survives many more writes than flash, it's really quite something. And it hasn't caught on fire yet. I set up two small partitions (10 GB) as ZFS slog devices for the two pools and now the pools can handle five-second write bursts at 2 GB/s.

Played with BeeGFS over the weekend. Easyish to get going, sort of resilient to nodes going down if you tune it that way, good performance with RDMA (netbench mode dd went at 2-3GB/s). The main thing lacking seems to be the "snapshots, backups and rebuild from nodes gone bye"-story.

Samba and NFS get to 1.4 to 1.8 GB/s per client, around 2.5 GB/s aggregate, high CPU usage on the server somehow, even with NFSoRDMA. I'll see if I can hit 2 GB/s on a single client. Not really native-level random access perf though. The NVMe drive can do two 1 GB/s clients. Fast filestore experiment mostly successful, if not completely.

Next up, wait for a bit of a lull in work, and switch my workstation to a machine that's got more than one PCIe slot. And actually hook it up to the fast network to take advantage of all this bandwidth. Then plonk the backup disks into one box, take it home, off-site backups yaaay.

Somehow I've got nine cables, four network cards, three computers, and a 36-port switch. Worry about that in Q3, time to wrap up with this build.

2018-03-20

InfiniBanding, pt. 3, now with ZFS

Latest on the weekend fileserver project: ib_send_bw 3.3 GB/s between two native CentOS boxes. The server has two cards so it should manage 6+ GB/s aggregate bandwidth and hopefully feel like a local NVMe SSD to the clients. (Or more like remote page cache.)

Got a few 10m fiber cables to wire the office. Thinking of J-hooks dropping orange cables from the ceiling and "Hey, how about a Thunderbolt-to-PCIe chassis with an InfiniBand card to get laptops on the IB."

Flashing the firmware to the latest version on the ConnectX-2 cards makes ESXi detect them, which somehow breaks the PCI pass-through. With ESXi drivers, they work as ~20 GbE network cards that can be used by all of the VMs. But trying to use the pass-through from a VM fails with an IRQ error and with luck the entire machine locks up. So, I dropped ESXi from the machines for now.

ZFS

Been playing with ZFS with Sanoid for automated hourly snapshots and Syncoid for backup sync. Tested disks getting removed, pools destroyed, pool export and import, disks overwritten with garbage, resilvering to recover, disk replacement, scrubs, rolling back to snapshot, backup to local replica, backup to remote server, recovery from backup, per-file recovery from .zfs/snapshot, hot spares. Backup syncs seem to even work between pools of different sizes, I guess as long as the data doesn't exceed pool size. Hacked Sanoid to make it take hourly snapshots only if there are changes on the disk (zfs get written -o value -p -H).

Copied over data from the workstations, then corrupted the live mounted pool with dd if=/dev/zero over a couple disks, destroyed the pool, and restored it from the backup server, all without rebooting. The Syncoid restore even restored the snapshots, A+++.

After the successful restore, my Windows laptop bluescreen on me and corrupted the effect file I was working on. Git got me back to a 30-min-old version, which wasn't so great. So, hourly snapshots aren't good enough. Dropbox would've saved me there with its per-file version history.

I'm running three 6TB disks in RAIDZ1. Resilvering 250 gigs takes half an hour. Resilvering a full disk should be somewhere between 12 and 24 hours. During which I pray to the Elder Gods to keep either of the two remaining disks from becoming corrupted by malign forces beyond human comprehension. And if they do, buy new disks and restore from the backup server :(

I made a couple of cron jobs. One does hourly Syncoid syncs from production to backup. The others run scrubs. An over-the-weekend scrub for the archive pool, and a nightly scrub on the fast SSD work pool. That is the fast SSD work pool that doesn't exist yet, since my SSDs are NVMe and my server, well, ain't. And the NVMe-to-PCIe -adapters are still on back order.

Plans?

I'll likely go with two pools: a work pool with two SSDs in RAID-1, and the other an archive pool with three HDDs in RAIDZ1. The work pool would be backed up to the archive pool, and the archive pool would be backed up to an off-site mirror.

The reason for the two volume system is to get predictable performance out of the SSD volume, without the hassle of SLOG/L2ARC.

So, for "simplicity", keep current projects on the work volume. After a few idle months, automatically evict them to the archive and leave behind a symlink. Or do it manually (read: only do it when we run out of SSD.) Or just buy more SSD as the SSD runs out, use the archive volume only as backup.

I'm not sure if parity RAID is the right solution for the archive. By definition, the archive won't be getting a lot of reads and writes, and the off-site mirroring run is over GbE, so performance is not a huge issue (120 MB/s streaming reads is enough). Capacity-wise, a single HDD is 5x the current project archive. Having 10 TB of usable space would go a long way. Doing parity array rebuilds on 6TB drives, ugh. Three-disk mirror.

And write some software to max out the IOPS and bandwidth on the RAM, the disks and the network.

Blog Archive

About Me

My photo

Built art installations, web sites, graphics libraries, web browsers, mobile apps, desktop apps, media player themes, many nutty prototypes