Oracle VM Server for SPARC is a high performance virtualization technology for SPARC
servers. It provides native CPU performance without the virtualization overhead typical
of hypervisors. The way memory and CPU resources are assigned to domains avoids
problems often seen in other virtual machine environments, and there are intentionally
few "tuning knobs" to adjust.
However, there are best practices that can enhance or ensure performance. This blog
post lists and briefly explains performance tips and best practices that should be used
in most environments. Detailed instructions are in the Oracle VM Server for SPARC
Administration Guide. Other important information is in the Release Notes. (The
Oracle VM Server for SPARC documentation home page is here.)
Despite these disclaimers, there is advice that can be valuable for providing performance and availability:
- Keep firmware, Logical Domains Manager, and Solaris up to date - Performance
enhancements are continually added to Oracle VM Server for SPARC, so staying
current is important. For example, Oracle VM Server for SPARC 3.1 and 3.1.1 both
added important performance enhancements.
That also means keeping firmware current. Firmware is easy to "install once and forget",
but it contains much of the logical domains infrastructure, so it should be kept current too.
The Release Notes list
minimum and recommended firmware and software levels needed for each platform.
Some enhancements improve performance automatically just by installing the new
versions. Others require administrators configure and enable new features. The
following items will mention them as needed.
Allocate sufficient CPU and memory resources to each domain, especially
control, I/O and service domains - This cannot be
overemphasized. If a service domain is short on CPU, then all of its clients are
delayed. Don't starve the service domains!
For the control domain and other service domains, use a minimum of at least 1 core (8 vCPUs) and 4GB or 8GB of memory.
Two cores and 8GB of RAM are a good starting point if there is substantial I/O load, but be prepared to allocate
more resources as needed.
Actual requirements must be based on system load:
small CPU and memory allocations were appropriate with older, smaller LDoms-capable systems,
but larger values are better choices for the demanding, higher scaled systems and applications now used with domains,
Today's faster CPUs and I/O devices are capable of generating much higher I/O rates than older systems,
and service domains must be suitably provisioned to support the load.
Control domain resources suitable for a T5220 with 1GbE network cards will not be enough for a T5-8 or an M6-32!
A 10GbE network device driven at line speed can consume an entire CPU core, so add another core to drive that.
Within the domain you can use vmstat
, mpstat
,
and prstat
to see if there is pent up demand for CPU. Alternatively,
issue ldm list
or ldm list -l
from the control domain.
Good news: you can dynamically add and remove CPUs to meet changing load
conditions, even for the control domain. You can do this manually or automatically
with the built-in policy-based resource manager. That's a Best Practice of its own,
especially if you have guest domains with peak and idle periods.
The same applies to memory. Again, the good news is that standard Solaris tools
can be used to see if a domain is low on memory, and memory can also added to or
removed from a domain. Applications need the same amount of RAM to run
efficiently in a domain as they do on bare metal, so no guesswork or fudge-factor
is required. Logical domains do not oversubscribe memory, which avoids problems
like unpredictable thrashing.
In general, add another core if ldm list
shows that the control domain is busy.
Add more RAM if you are hosting lots of virtual devices
are running agents, management software, or applications in the control domain andvmpstat -p
shows that you are short on memory. Both can be done
dynamically without an outage.
- Allocate domains on core boundaries - SPARC servers supporting logical
domains have multiple CPU cores with 8 CPU threads each.
(The exception is that Fujitsu M10 SPARC servers have 2 CPU threads per core.
The considerations are similar, just substitute "2" for "8" as needed.)
Avoid "split core"
situations in which CPU cores are shared by more than one domain (different domains
with CPU threads on the same core). This can reduce performance by causing "false
cache sharing" in which domains compete for a core's Level 1 cache. The impact on
performance is highly variable, depending on the domains' behavior.
Split core situations are easily avoided by always assigning virtual CPUs in
multiples of 8 (ldm set-vcpu 8 mydomain
or ldm add-vcpu 24
mydomain
). It is rarely good practice to give tiny allocations of 1 or 2
virtual CPUs, and definitely not for production workloads. If fine-grain CPU
granularity is needed for multiple applications, deploy them in zones within a
logical domain for sub-core resource control.
The best method is to use the whole core constraint to assign CPU resources
in increments of entire cores (ldm set-core 1 mydomain
or ldm add-core 3 mydomain
). The whole-core constraintrequires a domain be given its own cores, or the bind operation will fail.
This prevents unnoticed sub-optimal configurations, and also enables the
critical thread opimization discussed below in the sectionSingle Thread Performance.
In most cases the logical domain manager avoids split-core situations even if
you allocate fewer than 8 virtual CPUs to a domain. The manager attempts to
allocate different cores to different domains even when partial core allocations
are used. It is not always possible, though, so the best practice is to allocate
entire cores.
For a slightly lengthier writeup, see Best
Practices - Core allocation.
- Use Solaris 11 in the control and service domains - Solaris 11 contains
functional and performance improvements over Solaris 10 (some will be mentioned
below), and will be where future enhancements are made. It is also required to useOracle
VM Manager with SPARC. Guest domains can be a mixture of Solaris 10
and Solaris 11, so there is no problem doing "mix and match" regardless of which
version of Solaris is used in the control domain. It is a best practice to deploy
Solaris 11 in the control domain even if you haven't upgraded the domains running
applications.
- NUMA latency - Servers with more than one CPU socket, such as a T4-4, have
non-uniform memory access (NUMA) latency between CPUs and RAM. "Local" memory
access from CPUs on the same socket has lower latency than "remote". This can have
an effect on applications, especially those with large memory footprints that do
not fit in cache, or are otherwise sensitive to memory latency.
Starting with release 3.0, the logical domains manager attempts to bind domains
to CPU cores and RAM locations on the same CPU socket, making all memory
references local. If this is not possible because of the domain's size or prior
core assignments, the domain manager tries to distribute CPU core and RAM equally
across sockets to prevent an unbalanced
configuration. This optimization is automatically done at domain bind time, so
subsequent reallocation of CPUs and memory may not be optimal. Keep in mind that
that this does not apply to single board servers, like a T4-1. In many cases, the best
practice is to do nothing special.
To further reduce the likelihood of NUMA latency, size domains so they don't
unnecessarily span multiple sockets. This is unavoidable for very large domains
that needs more CPU cores or RAM than are available on a single socket, of course.
If you must control this for the most stringent performance requirements, you
can use "named resources" to allocate specific CPU and memory resources to the
domain, using commands like ldm add-core cid=3 ldm1
and ldm
add-mem mblock=PA-start:size ldm1
. This technique is successfully used in
the SPARC Supercluster engineered system, which is rigorously tested
on a fixed number of configurations. This should be avoided in general purpose
environments unless you are certain of your requirements and configuration, because
it requires model-specific knowledge of CPU and memory topology, and increases
administrative overhead.
- Single thread CPU performance - Starting with the T4 processor, SPARC
servers can use a critical threading mode that delivers the highest single thread performance.
This mode uses out-of-order (OOO) execution and dedicates all of a core's pipeline and cache resource to a software thread.
Depending on the application, this can be several times faster than in the normal "throughput mode".
Solaris will generally detect threads that will benefit from this mode and "do the right thing"
with little or no administrative effort, whether in a domain or not. To explicitly set this for an
application, set its scheduling class to FX with a priority of 60 or more.
Several Oracle applications, like Oracle Database, automatically leverage this capability to get performance
benefits not available on other platforms, as described in the section "Optimization #2: Critical Threads" inHow Oracle Solaris Makes Oracle Database Fast. That's a serious example of the benefits of the combined software/hardware stack's synergy.
An excellent writeup can be found in Critical Threads Optimization
in the Observatory blog.
This doesn't require setup at the logical domain level other than to use whole-core allocation, and to
provide enough CPU cores so Solaris can dedicate a core to its critical applications.
Consider that a domain with one full core or less cannot dedicate a core to 1 CPU thread, as it has other threads to dispatch.
The chances of having enough cores to provide dedicated resources to critical threads get better as more cores are added to the
domain, and this works best in domains with 4 or more cores. Other than that, there is little you need to do to enable this
powerful capability of SPARC systems (tip of the hat to Bob Netherton for enlightening me on this area).
Mentioned for completeness sake: there is also a deprecated
command to control this at the domain level by
using ldm set-domain threading=max-ipc mydomain
, but this is generally unnecessary and should not
be done.
- Live Migration - Live migration is CPU intensive in the control domain of
the source (sending) host. Configure at least 1 core (8 vCPUs) to the control
domain in all cases, but an additional core will speed migration
and reduce suspend time. The core can be added just before starting migration and
removed afterwards. If the machine is older than T4, add crypto accelerators to the
control domains. No such step is needed on later machines.
Live migration also adds CPU load in the domain being migrated, so its best to
perform migrations during low activity periods. Guests that heavily modify their
memory take more time to migrate since memory contents have to be retransmitted,
possibly several times. The overhead of tracking changed pages also increases guest CPU
utilization.
- Network I/O - Configure aggregates, use multiple network links,
use jumbo frames, adjust TCP windows and other systems settings the same way and for the
same reasons as you would in a non-virtual environments.
Use RxDring support to substantially reduce network latency and CPU utilization.
To turn this on, issue ldm set-domain extended-mapin-space=on mydomain
for
each of the involved domains. The domains must run Solaris 11 or Solaris 10 update 10
and later, and the involved domains (including the control domain) will require a domain
reboot for the change to take effect. This also requires 4MB of RAM per guest.
If you are using a Solaris 10 control or service domain for virtual network I/O, then it is important to plumb
the virtual switch (vsw) as the network interface and not use the native NIC or aggregate (aggr) interface. If the native
NIC or aggr interface is plumbed, there can be a performance impact sinces each packet may be duplicated to provide
a packet to each client of the physical hardware. Avoid this by not plumbing the NIC and only plumbing the vsw.
The vsw doesn't need to be plumbed either unless the guest domains need to communicate with the service domain.
This isn't an issue for Solaris 11 - another reason to use that in the service domain.
(thanks to Raghuram for great tip)
As an alternative to virtual network I/O, use Direct I/O (DIO) or Single Root I/O Virtualization
(SR-IOV) to provide native-level network I/O performance. With physical I/O, there is no virtualization
overhead at all, which improves bandwidth and latency, and eliminates load in the service domain.
They currently have two main limitations:
they cannot be used in conjunction with live migration, and introduce a dependency on the domain owning
the bus containing the SR-IOV physical device,
but provide superior performance. SR-IOV is described in an excellent blog article by Raghuram Kothakota.
For the ultimate performance for large application or database domains, you can use a PCIe root complex domain for
completely native performance for network and any other devices on the bus.
- Disk I/O - For best performance, use a whole disk backend (a LUN or full
disk). Use multiple LUNs to spread load across virtual and physical disks and reduce queueing
(just as you would do in a non-virtual environment).
Flat files in a file system are convenient and easy to set up as backends, but have less performance.
Starting with Oracle VM Server for SPARC 3.1.1, you can also use SR-IOV for Fibre Channel devices,
with the same benefits as with networking: native I/O performance.
For completely native performance for all devices, use a PCIe root complex domain and exclusively use physical I/O.
ZFS can also be used for disk backends.
This provides flexibility and useful features (clones, snapshots, compression) but can
impose overhead compared to a raw device. Note that local or SAN ZFS disk backends preclude live migration,
because a zpool
can be mounted to only one host at a time. When using ZFS
backends for virtual disk, use a zvol
rather than a flat file - it performs much
better. Also: make sure that the ZFS recordsize
for the ZFS dataset
matches the application (also, just as in a non-virtual environment). This avoids
read-modify-write cycles that inflate I/O counts and overhead. The default of
128K is not optimal for small random I/O.
- Networked disk on NFS and iSCSI -
NFS and iSCSI also can perform quite well if an appropriately fast network is used.
Apply the same network tuning you would use for in non-virtual applications.
For NFS, specify mount options to disable
atime
, use hard mounts, and set large read and write sizes.
If the NFS and iSCSI backends are provided by ZFS, such as in the ZFS Storage
Appliance, provide lots of RAM for buffering, and install write-optimized solid-state disk (SSD) "logzilla"
ZFS Intent Logs (ZIL) to speed up synchronous writes.
By design, logical domains don't have a lot of "tuning knobs", and many tuning
practices you would do for Solaris in a non-domained environment apply equally when
domains are used. However, there are configuration best practices and tuning steps you
can use to improve performance. This blog note itemizes some of the most effective (and
least exotic) performance best practices.