Oracle don’t include this in the marketing material – SPARC T series servers have dismal single thread performance

I don’t do new years resolutions. But having said that, I am going to blog more this year.  In my day job I do a lot of research, and much of the time months later, the information is forgotten/not recorded/in a black hole. Or if I am lucky I have written it down or stored a bookmark somewhere. So without further ado, todays post is about making you aware that you need to be careful migrating legacy Solaris workloads to the T-series family of servers. This is especially relevant today, maybe more than ever before in the age of virtualisation and clouds. During my day job, at least once or twice a year I come across a situation that could have been avoided – a poorly written or poorly tuned application placed on an Oracle Sparc T-series server.

A Brief History

In 2002 Sun Microsystems invested in a company called Afara Websystems that was developing an innovative highly multi threaded processor tuned for highly multi-threaded applications.  The name of this processor was called Niagra, and they had some ex-Sun employees in their company. The idea was to build a platform which would run Java and the web screaming FAST, and being able to process large numbers of execution threads simultaneously. In conventional processors, the most latency during execution is when data has to be fetched from main memory. With a large number of hardware threads the thread requesting main memory can be idled while the request to main memory is in progress – and the core can execute another ready to run hardware thread.  Building the T-series platform was about re-thinking the way processors had evolved up until that point and stripping the layers back to the basics to reduce latency during instruction execution. Web traffic generally matches this multi-theaded profile that is suitable for T-series hardware.  Java can be – or not, depending on which developer you have writing your code. Java application servers can match this profile as well – as long as you don’t have lock contention, then all bets are off and your application slows to a crawl – but more of that later.  Just remember, more often than not – a legacy java app server has not been written with optimised multi-threading in mind.  Thats been my experience anyway.

Without going into too much detail – the T1 processor as it was later called, was:

  • optimised for speed by keeping each core’s instruction pipeline as busy as possible.
  • does not perform out of order execution of instructions.
  • the T1 consisted of a single processor, with up to 8 cores.
  • the T1 runs at speeds up to 1.4Ghz.
  • each core has 4 threads per core, giving a maximum concurrency of 32 threads.
  • provided a Hyper-Priviledged execution mode. (providing Sun’s entry into Virtualisation with LDOM’s).
  • single Floating Point Unit (FPU) shared with all cores.
  • single cryptographic unit shared with all cores.
  • Max 32GB memory.
  • Shared L2 Cache 3MB.
  • 8kB primary data cache (per core).
  • 16kB primary instruction cache (per core).
  • low power (72W).

The design of the T1 processor was open sourced in 2006, and called the OpenSPARC project.

In October 2007 came the T2:

  • 8 threads per core, doubling maximum concurrency to 64 threads
  • runs at speeds up to 1.6Ghz (slightly higher)
  • FPU per core, rather than per processor (significant improvement on FP operations)
  • Dual 10GB ethernet integrated onto the chip
  • 1 crypto unit per core
  • Shared L2 Cache increased from 3MB to 4MB
  • power consumption up to ~95W due to extra integration with the chip
  • improved instruction pre-fetching to improve single thread performance
  • 2 integer Algorithmic Logic Units (ALU’s) per group of 4 threads – up from one on the T1 – this increased throughput of integer operations

In 2008  the T2 Plus was released which provided up to 4 sockets of T2′s (total concurrency 64*4=256 threads).

In September 2010 the T3 launched:

  • 16 cores still with 8 threads per core
  • Shared L2 Cache increased from 4MB to 6MB
  • Primary instruction cache still 16kB (per core)
  • Primary data cache still 8kB (per core)
  • Came in 4 variants, 1, 2 or 4 physical processors running at 1.65Ghz (almost no change from the T2)
  • Up to 512GB memory depending on the T3 server model

Across the first three generations, the chip did not change much. Mostly incremental changes, a few more cores, a few more chips, some extra buses, but nothing spectacular.

One thing that is constant across them all, is extremely poor single thread performance. I cannot stress this enough!! Oracle don’t include this in the marketing material. A combination of a slow clock speed, coupled with no L3 cache, no out of order execution of instructions, and poorly written/tuned applications makes for pretty dismal performance. These poorly tuned applications run perfectly fine on the circa 2004-2006 SPARC IV/IV+ line of processors that power the multiprocessor V480, V490, V880, V890, and the E25k platforms! Your single threaded app runs perfectly fine on a SunFire V890 but when migrated to a new and shiny T3-4 it runs like a dog.  Sometimes up to 10x slower, you probably won’t be impressed.  But even more frustrating is that someone in the business has just spent a significant investment on newer hardware with an expectation of better performance.  On a T5220 (T2 based) the litmus test was to check CPU utilisation while your workload was running.  If it was running at a constant 3.125% utilisation and not going any higher, chances are you had a single thread problem.  This indicates that a single hardware thread is running at 100% (1/32 *100).

Single Thread Applications Live!

The T4 was released in late 2011:

  • Processor speed jumped to 2.85 or 3GHz
  • Integer and floating point pipelines more efficient
  • Primary data cache doubled to 16kB (per core)
  • L2 cache now 128kB and is now per core
  • new 4MB L3 cache shared by all 8 cores
  • first T-series chip that performs out of order execution
  • 8 cores per chip (down from 16)
  • still 8 threads per core
  • critical thread mode

This significantly improved single thread execution (Oracle quote a 5x single thread performance increase). The specs above really make it clear why it is faster for single threads:

  • much faster clock speed
  • extra on core cache + the new L3 shared cache
  • improved instruction pipelines
  • and the killer – critical thread mode.

Effectively what this means is that if a core detects it is executing a single thread – then all resources that would be used to handle other threads, are directed to help that single thread execute as fast as possible.

The T4 is the first T-series platform that you can do serious application consolidation from legacy hardware.  Using Oracle virtual machines (OVM) for SPARC (previously known as LDOMs) gives you the flexibility to carve up a T4-4 (4 processors) to place legacy workloads. As all T-series chips only support Solaris 10 (and now Solaris 11), if you want to consolidate Solaris 8 or 9 workloads you need branded zones within an OVM.

Conclusion

The underlying architecture of the T-series from inception has not changed much over about 7 years. I have seen customers consolidate existing workloads onto these platforms in the last few years where they saw many unexpected performance problems. Mostly they have been resolved by optimising database queries, or re-architecting applications to work more efficiently under the T-series multi-thread model. With the T4 and upcoming T5 – things should improve and sites that have large legacy Sun/Solaris footprints will be able to finally consolidate – without having to spend a bomb on M-series servers. Most of the time you want to consolidate and get power saving benefits – the T4 processor or higher fits the bill. The T5 is expected to bring back 16 cores per chip as per the T3 architecture but with the T4 single thread speed improvements.

About these ads
2 comments
  1. I always felt that the T Series mostly solved a problem that no longer exists. It was an amazing feat of engineering to get an E10K’s worth of web server performance onto a 1U box with the original. Unfortunately (for Sun) few organisations have that much web serving to do (and selling a $2500 box to do the job of a $1M one isn’t great business). The T series hasn’t been great at handling other workloads – especially anything that needs an FPU. Perhaps that’s starting to change, but Intel’s relentless tick-tocking to make Moore’s Law stay true is hard to keep up with.

    • gillardm said:

      Sorry for late reply. ;-)
      Yeah totally agree – at the time of design there was a need. But by the time they came to market, things had moved on. Also, T4′s will probably be discontinued when the T5′s come out. Short sales periods (a little over 12 months) causes potential issues with IT roadmaps/strategy, eg: if you want to augment a RAC cluster for example and cannot buy a T4 based server anymore.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 365 other followers