I realize I haven't posted in a while. Not even during the pre-release frenzy surrounding Frost, Krakatoa's BFF. Having spent the last 4 days in Las Vegas showing off Deadline, Krakatoa and Frost to anyone who came to see them, I thought it was time to write something about the future.
I am sure those of you who remember my "How Fast Is Fast" blog about Krakatoa 1.6 wouldn't expect yet another similar jump in performance from the next version. Well, you would be very very wrong, and I have the numbers to prove it. In fact, the upcoming version of Krakatoa might provide a bigger speed-up over 1.6 than 1.6 had over 1.5!
Creating Partitions and loading PRT file sequences from disk are among the most typical workflows in the Krakatoa pipeline, and the time to load the particles has traditionally been about half the time of the rendering process. Several factors affect the speed of loading - the speed of the hard drive or network connection serving the files, the speed of the CPU reading the data and decompressing the stream into memory, and the number of additional operations performed on the particles while loading, like MagmaFlow and Material evaluation, deformations and culling. In the past, the latter operations were gradually updated to support multi-threading, but the initial loading has remained limited to two threads, and typically saturated only one core.
Not anymore. I had the pleasure to benchmark an early alpha build of what might become Krakatoa 2.0 on a variety of machines with a multitude of storage solutions. To test the pure loading speed, I created a Box with dimensions 100x100x100, converted to a PRT Volume and partitioned it to disk as 100 partitions, each one with 1 million particles. Note that this introduces some overhead to the loading process - loading 10 partitions with 10 million each, or one partition with all 100 million would be somewhat faster, but I wanted to produce a more realistic case which is nearly the worst case of partitioning since Krakatoa currently limits the max. number of partitions to 100. Also, it would be able to create enough threads for any number of cores. I kept the default channels layout - Position, Velocity, Normals, Color, Density and ID - to simulate a typical case even though I did not need some of these like Velocity or ID for the actual rendering. I then created a single PRT Loader from all 100 partitions, created a default Spotlight and rendered.
My first test used the slightly outdated hardware tasked to perform the Thinkbox demos at the NAB show in Las Vegas - it was a dual Intel Core Duo, in other words four physical cores, no Hyperthreading. The interesting thing about this machine though was that it contained one 7200 RPM harddrive, two striped 10000 RPM drives, one SSD drive and a Fusion-io card, all connected to the same hardware. This gave me the ability to find out how the storage medium affects the new software.
For comparison, I used the current 1.6.1 build. Loading the 100 million particles with it took 57.3 seconds and the total rendering time was 2 minutes 46 seconds. It did not matter what drive I loaded the particles from because the speed was fully determined by the performance of the one core reading the ZIP stream from the PRT file.
Loading the particles from the 7200 RPM drive using the new build cut the loading time down to 38.2 seconds and the total rendering time to 2 minutes 29 seconds. This is not a very impressive speed-up, but it reached the physical limitations of the hard drive, while loading the CPUs as much as the I/O bottleneck allowed. Having 4 cores, 4 threads were created to read 4 PRT streams at once, but the drive could not keep up with the demand.
Loading the particles from the two 10000 RPM drives brought the 4 CPUs to about 80% saturation before the bandwidth of the hard disks became the bottleneck again. The time to load the 100MP went down to impressive 16.1 seconds, but it was obvious that there was more to be expected from the solid state drives. And indeed, running the exactly same tests from the SSD drive gave me 11.2 seconds for loading and 1 minute 49.9 seconds total render time, while saturating all 4 cores completely! Trying the same with the even faster Fusion-io card produced the same loading and rendering time, clearly proving I had reached the CPU bottleneck.
Thankfully, Fusion-io was well represented at NAB and I got the chance to run the benchmark on an 8 core machine to test the waters. My gut feeling told me I should expect about half the loading time with twice as many cores so I wasn't exactly surprised when the faster system loaded the 100 million particles in 6 seconds and finished rendering in only 52 seconds! (Un)fortunately, all 8 cores were once again at 100%, making the result CPU-bound instead of I/O bound, leaving me wanting to test on a 16 or 32 core machine to see what a Funsion-io card can really do for Krakatoa. My gut feeling tells me again we could expect loading times of 3 seconds or less for 100 million particles on such a system, but until I actually get to try one, I can live with the results pretty well. Supposedly, a good SSD drive could keep up with an 8 core system to produce the 6 seconds loading time, too, so you don't have to spend the equivalent of a new car to get that performance...
There are several other areas that have seen some speed up in the new version - on my home i7 quadcore machine, the sorting for both lighting and drawing of 100 MP went down from 12 seconds to about 8, and the drawing was reduced from 5 to 4 seconds. That machine is not very good for testing the loading improvements due to a slow hard drive though, so if you could imagine a modern computer with a lot of cores and fast SSD drives, Krakatoa will literally fly on it later this year!
Obviously, this is just the tip of the iceberg when it comes to what will be new in the next version of Krakatoa. Wait for Siggraph and be very, very excited - I know I am...