Quantcast
Channel: Oracle Bloggers
Viewing all articles
Browse latest Browse all 19780

Hardware Threading Makes a Difference in Clouds

$
0
0

Today’s server designs are based on multi-core and multithreaded processors, and these characteristics have a fundamental impact on the server performance that users observe when running enterprise workloads. Performance is not simply a factor of number of threads and performance per thread, but of balanced design and intelligent architectural choices.

A Bit of History

For decades, microprocessor designers improved performance by increasing clock frequency and adding capabilities to the pipeline. This did indeed improve single thread performance, and processor speeds improved rapidly. However, memory technologies could not keep up: memory latency became a major bottleneck to improve application performance. Very often compute threads were starved and stalled waiting for data from memory. Some processor designs (such as Intel Xeon and IBM Power) addressed the issue by continuously increasing cache sizes and adding various performance schemes to further improve single thread performance.

Microprocessor cache architectures were developed to mask memory latency, and many smart strategies have been created on multiple level cache structures to allow processor speeds to continue to advance faster than memory technology. But it’s still not commercially or technically viable to increase the on-chip cache size enough to completely avoid cache misses. Therefore, hardware threads will be idle periodically, waiting for data coming from memory, wasting cycles and resources. The difference in processor and memory speeds causes some processor designs to spend as much as three quarters of their cycles waiting for the memory subsystem to return data, instead of doing useful work.

Therefore, the effort to improve single thread performance helps only with a portion of the “compute” cycle of the hardware thread. Following the example above, if single thread compute performance is improved by 20%, the overall single thread execution is only 5% faster (20% of one quarter). In other words, if the memory latencies remain the same, then the thread is basically no faster than before: the processor just rushes to wait longer for data.

The Multi-core and Multithreaded SPARC T5 processor

Meanwhile, new SPARC processor designs were developed that directly addressed the root cause of the performance problem, and acknowledged the trend towards parallelism in workloads and software design. One key goal of these modern SPARC processors was to deliver the highest possible throughput and efficiency in systems running multiple threaded workloads, instead of focusing just on single thread performance. The precious real estate on the chip was used to increase processing capacity (cores and threads), instead of increasing cache sizes and creating overly complex pipelines with limited applicability.

The SPARC multithreading architecture starts multiple jobs simultaneously, enabling efficient use of the computing resources in each processor core. The current SPARC S3 core (present in SPARC T4, T5, M5 and M6 processors) can run up to 8 simultaneous threads. In addition, each core contains two integer execution pipelines, so a single SPARC S3 core is capable of executing two threads in parallel. This means that the pipelines are fully utilized if each thread spends 25% of the time processing, and 75% waiting for data from memory.

Note that this efficiency is achieved only because these hardware threads are in the same core. Contrast this with, for instance, the Intel Xeon processors which only support up to 2 threads per core. Here, threads will be frequently waiting for data to arrive from memory. If the operating system tries to oversubscribe threads to those cores, parking some outside the core and swapping them in when there is some time available, then there is a price to pay: flushing one thread out and bringing another in wastes something like 50 cycles.

Dynamic Threading and Critical Threads Optimization

Although most applications today have a high degree of multiprocessing, or are made up of several processes or copies running together, there are certain code portions that are dependent on a single thread. The breakthrough architecture of the SPARC S3 core includes significant capabilities to deliver the highest possible single thread performance, by utilizing the processor structures intelligently. The current SPARC processor provides a robust out-of-order instruction execution, advanced branch prediction, and most importantly, Dynamic Threading.

The Dynamic Threading functionality was introduced with the S3 core in the SPARC T4 processor, and it enables real time optimization of per thread performance. Software can activate via Critical Threads Optimization up to eight hardware threads (also called “strands”) on each core, and the processor hardware dynamically and seamlessly allocates core resources among the active threads. Typically, users do not have to worry about controlling this, and can let the Oracle Solaris operating system manage thread optimization based on the workload.

Since the processor core dynamically allocates resources among the active threads, there is no explicit single-thread mode or multithread mode. If only one thread is active, the core devotes all its resources to the sole running thread. Thus, that thread will run as quickly as possible, i.e., faster than if it were running together with others in the same core. Similarly, if software declares six out of eight threads as noncritical, the two active threads share the core execution resources.

For example, feature enhancements in Oracle Database Rel. 11.2.0.4 on SPARC T4, T5, M5 and M6 servers automatically activate Critical Thread scheduler class for optimal database performance, enhancing the log writer, LMS, and VTKM background processes' execution.

Existing applications can take advantage of the SPARC dynamic threading performance benefits without having to be rewritten or recompiled. Read more about Dynamic Threading in Oracle’s SPARC T5 Server Architecture White Paper.

This great architecture design enabled the latest generation multi-core, multithreaded SPARC T5 processor to be the world’s fastest microprocessor at the time it was introduced. Even after over a year, it is still the fastest in many enterprise applications. Oracle’s SPARC servers running Oracle Solaris have led enterprise application performance with over 20 record performance results - see here for more information.

A High Density of Hardware Threads is Perfect for Clouds

The large number of hardware threads in SPARC systems is an important advantage when deploying a virtualized cloud environment. Given the desire to share resources among many users, and the power of the current processors, Virtual Machines (VMs) can frequently consume a fraction of a processor core. So you end up with lots of VMs to support in a single processor.

When you pack sixteen S3 cores into a SPARC T5 processor, you can use up to 128 threads per processor, which creates fine granularity to be enjoyed by many VMs, while maintaining efficiency.

For example, if we dedicate two hardware threads to each VM, a 2 socket SPARC T5-2 server can support over 120 VMs. In order to support these many VMs, other 2 socket servers would require VMs to have less than a single thread available to them, or resort to oversubscription of processor resources. Both options bring performance losses in virtualization overhead, context switching as threads are trying to execute multiple jobs simultaneously, and unpredictable performance when workload exceeds available capacity.


Viewing all articles
Browse latest Browse all 19780

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>