You've probably all seen the new TPC-H benchmark result for the SPARC T5-4 submitted to TPC on June 7. Our benchmark guys over at "BestPerf" have already pointed out the major takeaways from the result. However, I believe there's more to make note of.
Scalability
TPC doesn't promote the comparison of TPC-H results with different storage sizes. So let's just look at the 3000GB results:
- SPARC T4-4 with 4 CPUs (that's 32 cores at 3.0 GHz) delivers 205,792 QphH.
- SPARC T5-4 with 4 CPUs (that's 64 cores at 3.6 GHz) delivers 409,721 QphH.
That's just 1863 QphH or 0.45% short of 100% scalability if you'd expect a doubling of cores to deliver twice the result. Of course, one could expect to see a factor of 2.4, taking the increased clockrate into account. That would set the bar to 493901 QphH, and the SPARC T5-4 would be at 83% of that. So what didn't scale? Most likely the storage! Let's look at that a little closer:
Storage
Looking at the report on BestPerf as well as the full disclosure report, they provide some interesting insight into the storage configuration. For the SPARC T4-4 run, they had used 12 2540-M2 arrays, each delivering around 1.5 GB/s for a total of 18 GB/s. These were obviously directly connected to the 24 8GBit FC ports of the SPARC T4-4, using two cables per storage array. Given the 8GBit ports of the 2540-M2, this setup would be good for a theoretical maximum of 2GB/sec per array. With 1.5GB/sec actual throughput, they were pretty much maxed out.
In the SPARC T5-4 run, they report twice the number of disks (via expansion trays for the 2540-M2 arrays) for a total of 33GB/s peak throughput, which isn't quite 2x the number achieved on the SPARC T4-4. To actually reach 2x the throughput (36GB/s), each array would have had to deliver 3 GB/sec over its 4 8GBit ports. The FDR only lists 12 dual-port FC HBAs, which explains the use of Brocade FC switches: Connecting all 4 8GBit ports of the storage arrays and using the FC switch to bundle that into 24 16GBit HBA ports. Again, the theoretical maximum of 4 8GBit ports on each storage array would be 4 GB/sec, but considering all the protocol and "reality overhead", the 2.75 GB/sec they actually delivered isn't bad at all. Given this, reaching twice the overall benchmark performance is good. And a very likely explanation for not going all the way to 2.4x.
By the way - neither the SPARC T4-4 nor the SPARC T5-4 used any flash in these benchmarks.
Competition
Ever since the T4s are on the market, our competitors have done their best to assure everyone that the SPARC core still lacks in performance, and that large caches and high clockrates are the only key to real server performance. Now, when I look at public TPC-H results, I see this:
TPC-H @3000GB, Non-Clustered Systems | |
---|---|
System | QphH |
SPARC T5-4 3.6 GHz SPARC T5 4/64 – 2048 GB | 409,721.8 |
SPARC T4-4 3.0 GHz SPARC T4 4/32 – 1024 GB | 205,792.0 |
IBM Power 780 4.1 GHz POWER7 8/32 – 1024 GB | 192,001.1 |
HP ProLiant DL980 G7 2.27 GHz Intel Xeon X7560 8/64 – 512 GB | 162,601.7 |
So, in short, with the 32 core SPARC T4-4 (which is 3 GHz and 4MB L3 cache), we deliver more QphH@3000GB than IBM with their 32 core Power7 (which is 4.1 GHz and 32MB L3 cache) and also more than HP with the 64 core Intel Xeon system (2.27 GHz and 24MB L3 cache). So where exactly is SPARC lacking??
Right, one could argue that both competing results aren't exactly new. So let's do some speculation:
IBM's current Performance Report lists the above mentioned p780 with an rPerf value of 425.5. A successor to the above p780 with P7+ CPUs would be the p780+ with 64 cores, which is available at 3.72 GHz. It is listed with an rPerf value of 690.1, which is 1.62x more. So based on IBM's own performance estimates, and assuming that storage will not be the limiting factor (IBM did test with 177 SSDs in the submitted result, they're welcome to increase that to 400) we might expect a theoretical result of 311398 QphH@3000GB, still far from where we are with the SPARC T5-4 today, and even less in the "per core" metric IBM values so highly.
For x86, the story isn't any better. Unfortunately, Intel doesn't have such handy rPerf charts, so I'll have to fall back to SPECint_rate2006 for this one. (Note that I am not a big fan of using one benchmark to estimate another. Especially SPECcpu is not very suitable to estimate database performance as there is almost no IO involved.) The above HP system is listed with 1580 CINT2006_rate. The best result as of 2013-06-14 for the new Intel Xeon E7-4870 with 8 CPUs is 2180 CINT2006_rate. That's an improvement of 1.38x. (If we just take the increase in clockrate and core count, it would give us 1.32x.) I'll stop here and let you do the math yourself - but for the quick reader, I'll summarize this in a little overview:
TPC-H @3000GB Performance Speculations | ||
---|---|---|
System | QphH* | Generational Improvement |
SPARC T4-4
32 cores SPARC T4 | 205,792 | 2x |
SPARC T5-4 64 cores SPARC T5 | 409,721 | |
IBM Power 780 32 cores Power7 | 192,001 | 1.62x |
IBM Power 780+ 64 cores Power7+ | 311,398* | |
HP ProLiant DL980 G7 64 cores Intel Xeon X7560 | 162,601 | 1.38x |
HP ProLiant DL980 G7 80 cores Intel Xeon E7-4870 | 224,348* |
* Not real results - numbers speculative based on rperf (Power7+) or SPECint_rate2006 (HP)
Of course, IBM and others are welcome to prove me wrong - but as of today, I'm waiting for recent publications in this data range.
So what have we learned?
- There's some evidence that storage was the limiting factor that prevented the SPARC T5-4 to scale beyond 2x
- The myth that SPARC cores don't perform is just that - a myth. Next time you meet one, ask your IBM sales rep when they'll publish TPC-H for Power7+
- Cache memory isn't the magic performance switch some people think it is.
- Scaling a CPU architecture (and the OS on top of it) beyond a certain limit is hard. It seems to be a little harder in the x86 world.
What did I miss? Well, price/performance is something I'll let you discuss with your sales reps ;-)
And finally, before people ask - no, I haven't moved to marketing. But sometimes I just can't resist...
Disclosure Statements
The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.
TPC-H, QphH, $/QphH are trademarks of Transaction Processing Performance Council (TPC). For more information, see www.tpc.org, results as of 6/7/13. Prices are in USD. SPARC T5-4 409,721.8 QphH@3000GB, $3.94/QphH@3000GB, available 9/24/13, 4 processors, 64 cores, 512 threads; SPARC T4-4 205,792.0 QphH@3000GB, $4.10/QphH@3000GB, available 5/31/12, 4 processors, 32 cores, 256 threads; IBM Power 780 QphH@3000GB, 192,001.1 QphH@3000GB, $6.37/QphH@3000GB, available 11/30/11, 8 processors, 32 cores, 128 threads; HP ProLiant DL980 G7 162,601.7 QphH@3000GB, $2.68/QphH@3000GB available 10/13/10, 8 processors, 64 cores, 128 threads.
SPEC and the benchmark names SPECfp and SPECint are registered
trademarks of the Standard Performance Evaluation Corporation.
Results as of June 18, 2013 from www.spec.org. HP ProLiant DL980 G7 (2.27 GHz, Intel Xeon X7560): 1580 SPECint_rate2006; HP ProLiant DL980 G7 (2.4 GHz, Intel Xeon E7-4870): 2180 SPECint_rate2006,