Company: PlatformCustomer: AMDSubmitted by: MCC InternationalDate: August 2000The US-based AMD team, who made the world’s fastest PC processor a reality, has announced that Platform Computing’s resource management solution LSF has been instrumental in the creation of one of the world’s most sophisticated engineering achievements - the development of the 1GHz AMD Athlon processor.Announced on March 6th 2000, the 1GHz Athlon broke the PC microprocessor speed record for the entire computing industry and is considered to be one of the greatest achievements in PC processor design of all time. According to AMD, one of the keys to their success was the use of LSF.”It is hard to say how Platform LSF didn’t help this project,” said Steve Baugh, systems administrator/LSF admin, AMD. “There are literally 1,000s of components and billions of tests that were run to create the 1GHz chip.
This is one the most complicated machines ever designed. LSF enabled the design teams to run the machines in AMD’s compute clusters as a single machine. By running computing tasks which were virtually impossible on smaller systems across the distributed processors in their (RedHat) Linux ‘supercluster’ environment, the team was able to get access to the computing resources needed to accomplish the historic project. LSF provided huge benefits, I have never seen anything like it.”LSF was also a key factor in the breakthrough development of the AMD K6 chip in which the use of LSF was touted to have slashed the design-to-manufacture time by as much as six months.
“The percentage of the processing power across the environment we have been able to harness with LSF has been unprecedented,” said Baugh. “We are now running at 90 per cent and above utilisation which is really pushing the outer limits of our design hardware and computing resources.”A number of years ago, AMD employed what was then the traditional design infrastructure of a workstation on every designer’s desk – a set-up still common in some semiconductor design environments today. What they discovered was the processing power of the compute infrastructure was going idle the vast majority of the time. “Years ago, you’d look out across the environment and see little-to-no load on the systems whatsoever,” added Baugh.Today, the racks of Linux systems, powered with AMD’s own high-powered processors along with a variety of other Unix-based systems were able to run in tandem on demanding compute jobs, thanks to the use of LSF.
By marshalling the compute power needed to run leading design simulation and verification applications across the cluster, LSF was able to ensure in the words of Baugh, “The right person got access to the right machine at the right time.”LSF creates a virtual queue that dispatches jobs and matches them to the correct computing resources. A challenging job would be matched with the requisite amount of processing power and memory. The functionality extends to the management of software licenses and network resources. “Designers are able to submit their work to LSF for completion,” said Baugh. The result is the work gets the right resources to get done quickly, with no questions asked.
“Importantly, the LSF solution ensured mission-critical performance of the Linux environment. AMD, an early pioneer with Linux systems, had confidence that the Linux environment worked reliably and in close concert with existing Unix systems as part of the same cluster if required. Thanks to Platform’s early and ongoing support for Linux as well as extensive multi-platform support, this was not an issue for the organisation.Throughout the critical design verification process, LSF was vital in administering and running the large-scale mainly Linux-based supercluster used to perform the complex tests on the design. “Of all the stages in the design process, the verification is the most time consuming,” said Baugh. “LSF was instrumental.
By tapping and running the processors in the machines to full performance, the teams were able to achieve the previously unachievable levels of computing power needed to accomplish the complex tests.”The 1GHz Athlon chip is a PC processor capable of executing one billion clock cycles per second. It makes use of AMD’s 0.18-micron, six-layer metal process technology, and has approximately 22 million transistors. The sheer magnitude of the job of verifying a design of this complexity is, according to AMD, ‘mind-boggling.
‘”I can say without exaggeration that we performed literally billions of tests. I don’t know how you could have tested this more. It really blows my mind. Not only are we able to complete the tests more quickly but we were able to increase the number of tests by a factor of ten. …So not only were we running better and faster but we were able to run more. That’s unprecedented.
” Said Baugh.With the variety and sheer scale of work that needed to be accomplished, high availability was a daily concern. A no-compromise, no-downtime environment was key to the project’s success according Clive Dawson, manager of CAD Systems Engineering, AMD. “With designers submitting so many jobs to the clusters, that represents a lot of productivity that needed to be guaranteed and preserved.”LSF’s failover and fault-tolerant features, such as the ability to recover the job queues after a failure, were critical. “We’ve really seen miracles happen with LSF.
Once, when work resumed after a building-wide power failure, I saw all of the jobs restart automatically, without any of the engineers needing to resubmit their work. That’s really something to see,” explained Dawson.”The joy was that they didn’t have to be resubmitted, and none of the jobs were lost. LSF just picked up where it left off. That’s so valuable in preserving the work of the engineers! For us that translates directly into dollars.
“Today, downtime is the exception not the rule for AMD. “We used to experience downtime where designers had to head home. With LSF, even running everything – the network, the systems and applications – at full tilt we have had a really limited experience with downtime.”This is because of one key benefit that LSF provides the teams, something Baugh calls the ‘canary in the coalmine.’ “LSF is able to help us predict when failures will occur across the network.
Simply by having a single system view of the entire cluster and knowing the characteristics of each aspect of the resource pool, you can actually predict and head off troubles before they start.”Most importantly, the sheer scale of work was something that couldn’t have been imagined in AMD’s previous environment, according to Dawson. “It has really changed the entire way in which we work. We have been able to harness the full power of not only the background compute servers, but of all our desktop systems as well. LSF can detect whenever a workstation is idle (if the engineer is at lunch or in a meeting, for example) and immediately press it into service.
Upon the engineer’s return, LSF will detect the need for interactive response time and divert the background job to another system. This is where the competitive advantage really kicks in.”Today, LSF is used throughout the AMD’s Austin, Texas facility in which the 1GHz was designed and manufactured. In an industry where speed is king and time to market pressures intense, Dawson and Baugh credit LSF with helping them revolutionise both the way AMD works and along the way, the entire microprocessor market.