A couple of weeks ago AMD introduced their long awaited 64bit processor for the desktop market, Athlon 64 and Athlon FX. At that time we unfortunately got no opportunity to look closer at it the way we did with the new flagship from Intel, Pentium 4 3.2 GHz Extreme Edition. The processor is intended as Intel’s ”temporary” reply to Athlon 64 and proved to have an impressive performance with its huge 2 MB L3-cache.
However, now it is time for AMD to show what they have got: Athlon 64 3200+…

A couple of weeks ago AMD introduced their long awaited 64bit processor for the desktop market, Athlon 64 and Athlon FX. At that time we unfortunately got no opportunity to look closer at it the way we did with the new flagship from Intel, Pentium 4 3.2 GHz Extreme Edition. The processor is intended as Intel’s ”temporary” reply to Athlon 64 and proved to have an impressive performance with its huge 2 MB L3-cache.
However, now it is time for AMD to show what they have got: Athlon 64 3200+.

Unfortunately we have not had the possibility to test AMD’s enthusiast model, Athlon 64 FX, because of the major lack of mainboards supporting this processor. We will try to get our hands on one of these mainboards, but Athlon 64 is most likely the most interesting choice for most consumers today.
Currently Athlon 64 FX has a number of cons and the price tag for a complete system is the biggest problem. We will come to this later on in this review but all kinds of close details will have to wait until we have got an Athlon 64 FX in our test lab.

Anyhow, the topic is Athlon 64 3200+ and it is based on a brand new processor architecture which we will take closer look at.


During several years, AMD has made a huge success with their K7-architecture which they introduced along with the launch of the AMD Athlon CPUs. The past year has however, been a very hard battle for AMD against Intel with their now effective NetBurst architecture. The Pentium 4 CPUs has had a strong grip of the desktop market and despite of AMD’s introduction of the Barton core they’ve seen themselves left behind the chip giant.

AMD has for a long time been in deep need of their infinitely postponed K8 architecture and this summer it turned up, if yet only for the server market. It was neither a success nor a flopped introduction but if Intel has a great influence on the desktop market it is nothing compared to the server market where AMD don’t have the same experience or confidence at all.
AMD Opteron (which is the name of the server model of Hammer) has found a severe amount of supporters but many companies have awaited the future since they don’t want to take any risks so early.

Finally, it’s time for us normal consumers to lay our hands on the eagerly awaited AMD CPU for the desktop market. Before we make a deep dive into the K8 architecture we have to mention the two existing models of the architecture which is available now. Hammer has always been the common name for the K8 architecture but some of you might recall the following names; ClawHammer and SledgeHammer. Anyway, these are the code names for the two cores AMD has developed for two different markets, ClawHammer for the desktop market and SledgeHammer for the server market.

Accordingly, the original plan was that only the server CPU, Opteron, would be based on the SledgeHammer core, while the desktop CPU, Athlon 64, would use the ClawHammer core. Now when we have seen AMD’s launch of the Athlon 64 we know that that’s not the case. AMD also chose to launch an Athlon 64 CPU based on their server core, SledgeHammer.
One might ask then what the differences between those cores really are, and we are going to try to answer that here.

ClawHammer
(Athlon 64)
SledgeHammer
(Opteron, Athlon 64 FX)
Barton
(Athlon XP)
Architecture:
K8
K8
K7
Interface:
Socket 754
Socket 940
Socket A (462)
Number of transistors:
105.9 mil
105.9 mil
54.3 mil
Manufacturing technique:
0.13 micron, SOI
0.13 micron, SOI
0.13 micron
CPU core’s surface:
193 mm2
193 mm2
101 mm2
System bus:
HyperTransport
HyperTransport
Front Side Bus
32-bit instruction support:
Yes
Yes
Yes
64-bit instruction support:
Yes, AMD64 tech.
Yes, AMD64 tech.
No
Integrated memory controller:
Singlechannel 64-bit
Dualchannel 128-bit
No, integrated on the mainboard.
Total CPU-to-system bandwidth:
HT: 6.4 GB/s @ 1.6 GHz
MCT: 3.2 GB/s @ 400 MHz
Total: 9.6 GB/s
HT: 6.4 GB/s @ 1.6 GHz
MCT: 6.4 GB/s @ 400 MHz
Total: 12.8 GB/s
Total: 3.2 GB/s @ 400 MHz
Integrated north bridge:
Yes, 128-bit wide databus @ CPU clock frequency
Yes, 128-bit wide databus @ CPU clock frequency
No, integrated on the mainboard, 64-bit wide databus @ 200 MHz
Pipeline length:
12
12
10
Integrated cache memory:
L1: 128KB*
L2: 1024 KB exclusive
Total: 1152 KB
L1: 128KB*
L2: 1024 KB exclusive
Total: 1152 KB
L1: 128KB*
L2: 512 KB exclusive
Total: 640 KB
3D and multimedia instructions:
3DNow!,SSE/SSE2
3DNow!, SSE/SSE2
3DNow!
* The L1 cache consists of 64KB data and 64 KB instructions cache

Here one can see that there are very few differences between ClawHammer and SledgeHammer. The difference in the CPU interface is directly dependent of the differences we see in the architecture of the two CPU cores. The reason of why SledgeHammer has 25% more pins in its CPU interface is its double memory controllers integrated in the core. The ClawHammer architecture is only equipped with one integrated memory controller, thus it does not need an equally large data bus to the mainboard.

The integrated memory controllers are one of the most interesting news with the Hammer architecture and something we will take a closer look at during this review. But before we continue with this and more about the CPUs inside, we will check how the Athlon 64 3200+ looks like physically.


With the introduction of a new CPU architecture there is no surprise that it is not only the internal differences that are big but also the physical. Athlon 64 and Athlon XP are two completely different CPUs seen to their packaging and physical appearance. Even if they have several similarities internally it is undeniably difficult to see it on the outside of the CPUs. From the top Athlon 64 is very similar to Intel’s Pentium 4 CPUs which is shown here below.
What makes the similarities this many with Intel’s flagship is the long wanted heatspreader on AMD’s CPUs. The heatspreader is the metallic plate which can be mistaken for the CPUs core. The core itself is beneath this metal plate and thus it is not directly exposed. The heatspreader both spread the CPU’s heat dissipation out on a larger area and protects the core against outer violence. The latter can be a gift from god for retailers which during several years has had to cope with R.M.A’d CPUs which has had been damaged during a faulty installation of CPU coolers etc. To kill an Athlon 64 CPU by installing the heatsink in a wrong way is more or less impossible.
But the differences are if we turn the CPU and look closer at its pins which are many more to the number on the Athlon 64 CPU, 276 of them to be exact. Intel’s Pentium 4 CPU uses 478 pins while Athlon 64 uses a whole 754 pins, although this is nothing compared to Athlon 64 FX which uses 940 pins (!).

The number of pins brings us to AMD’s new socket interface for the Athlon 64 which namely goes under the name Socket 754. Thus you leave the symbolic naming that you used on Athlon XP’s socket, Socket A.

 

Already some month after the official launch of Athlon 64 every major chipset manufacturer has jumped on the Socket 754 train, nVidia, VIA, SiS and ALi has all got Socket 754 chipsets in their product line.

Now we know a little bit more about the Athlon 64 CPU’s exterior but how does the physical aspects look inside the core you may ask? Well, in the following way:


The Athlon 64/Athlon 64 FX core


Selected 64/Athlon 64 FX core

So this is how the Athlon 64 core looks like in many times magnification and since this picture applies for both Athlon 64 and Athlon 64 FX we can state that the core itself is physically identical. Athlon 64 has in other words two integrated memory controllers too, but only one of them is activated and has contact with the rest of the system. Though this is something that we suspected already when the first specifications came on the Athlon 53 and Athlon 64 FX since both the CPUs uses the exact same number of transistors.

It is not hard to understand the difficulties with integrating two huge amounts of cache in a CPU core when you see these images. The L2 cache on the Hammer core occupies more than 50 % of the total area and increases the manufacturing costs. Unfortunately we have no pictures on the Pentium 4 Extreme Edition’s core but with its 2MB L2 cache we guess that the cache occupies a whole lot of space of the area.

So the L1 cache is split into two separate caches, data cache and instruction cache. As shown it is the L1 cache which is closest to the CPU’s execution units thus it is the first stop for the CPU’s data storage (disregarding the CPU’s register, Load/Store).

Also notice AMD’s new CPU bus, HyperTransport, which is in direct contact with all the important parts in the CPU to best handle the data flow.

Below is a screenshot from the CPU identify program WCPUID.

We will go into more advanced descriptions of parts of the CPU later in the review. But first we shall examine a part of the Hammer architecture’s biggest PR drawing card, 64 bits CPU architecture.


Many of you have probably seen ”x86” being mentioned in coherence with the PC, but all might now know what it stands for. The term was coined actually already in the 80’s and originates from Intel’s former CPU line which had the number 86 in the end of their model names i.e. 386, 486.
All desktop CPUs from Intel and AMD have during the last 20 years been built on the x86 ISA (Instruction Set Architecture) which is itself striking since it has happened a lot with the CPUs during the last 20 years. The performance on today’s CPUs differ ridiculous much today against those we saw 20 years ago.
The question is then how have you been able to keep the same foundation throughout the years and maybe more important, why?

All CPUs work with executing instructions which are the command that the computer’s program send out. The instructions consists basically only ones and zeros (binary numbers) which the CPU works with to execute the commands it receives, for example like adding two numbers.

Since a CPU (with today’s technology) can not take any own initiatives you must first ”learn” the CPU what it is supposed to do and how it shall do it. This is done by programming specific programs for every task the computer shall perform.
But since the CPU only understands its binary machine code it is an impossible task for a programmer to write advanced programs directly towards the CPUs language.
Another problem with writing programs directly towards a specific CPU is that you can not use these programs on other CPUs which are not identical built and everything named backwards compatibility is thrown away.
Every time a new CPU is launched you would need to begin from the beginning by rewriting every program and adapting them towards the new CPU’s hardware.

This was something that one early understood (in the end of the 1970’s) and one had to solve it in some way. The solution to this was in the end x86 ISA.

The function on an Instruction Set Architecture (ISA) is like an adapter, it is paired with a unique hardware in one end and in the other end we have the programmed software. By not changing the ”contact surface” on the part which communicates with the software but only the end which communicates with the CPU one may keep the compatibility between different CPUs and older software.
The complicated machine code ”translates” by ISA to specific and firm instructions which later on the programmers can access and use by their programming languages.
Thus by never changing or deleting the basic instructions in ISA one can keep the backward compatibility.

Yet this makes you locked by the foundations which consists of the ISA and even if the x86 ISA is well thought through some things has happened in two decades and the programmers is getting a harder time to wrestle with the architecture’s limitations.

Even of the absolute base can not be changed if you want to keep the backward compatibility. it is free to develop and adding new functions to the ISA. This is something that specially Intel has taken aware of and during the years they have developed the ISA at several occasions.
Among these we see for example the increased MMX and SSE/SSE2 instructions, and several smaller additions of instructions which the x86 architecture has gotten support for.

It is actually so that the x86 architecture from the beginning was a 16 bit one but at the launch of Intel’s 386 CPUs one went over to a 32bit model.

AMD is now the next actor to increase the CPU’s internal bandwidth, from 32bit to 64bit with their x86-64bit architecture.

To understand what and why one increases the CPUs internal bandwidth we shall move on with some basic information about the CPU’s functionality.


Let us begin with explaining what ”bit” stands for and the answer is found in the binary way of calculating since ”bit” is short for ”binary digit”.

Without becoming engrossed in binary calculating we will try to explain how it works. In our ordinary decimal way of counting a simple figure can have one out of ten different values, from 0 to 9. To make a number with a value over 9 we have to use ourselves of more figures for example like a 1 and a 3 to make the number 13.
Binary information/numbers can only have two different values, 1 or 0. For those who are used to the decimal way of counting this is very unpractical and difficult to understand, but it has several advantages in digital information.
In the case of the CPU an electric impulse means either 1 or 0 (1 = current, 0 = no current) which of course is easier than for example having different voltages for the nine different values in the decimal way of counting.

When you express whole numbers with binary digits the result becomes very confusing for humans who usually only use the decimal way of counting.
The number 38 written in binary looks like: 100110. This because of every digit’s value in a binary number is doubled and you begin with 1 from the right. Here below we have two examples of the number 38 and 73 and how these are written in binary form.

Decimal number = 38
128
64
32
16
8
4
2
1
Binary number = 100110
0
0
1
0
0
1
1
0
64 and 128 are not included in the number but are included as a reference for the digits increase in value

Decimal number= 73
128
64
32
16
8
4
2
1
Binary number = 1001001
0
1
0
0
1
0
0
1
128 is not included in the number but is included as a reference for the digits increase in value

The binary digit here receives a Yes/No function so to speak, if the binary digit is a zero it has no value but if it is a one it takes the value which is corresponding in the decimal number. Then it is just to add the decimal numbers which the binary number’s 1’s represent, in the case for 38 it is 2+4+32.

Therefore 1 Bit is the smallest information the computer works with and is related to other size naming as following.

Naming
Abbreviation and value
bit:
1 b
Byte:
1 B = 8 b
Kilobyte:
1 KB = 1024 B
Megabyte:
1 MB = 1000 KB
Gigabyte:
1 GB = 1000 MB
Terabyte:
1 TB = 1000 GB
Petabyte:
1 PB = 1000 TB
Exabyte:
1 EB = 1000 PB

Since  1 MB actually corresponds to exactly 1,048.576 bytes the above values are somewhat faulty but it is those which are often used in this coherence.

Important to know is that the abbreviation of bit is always written with the lower-case ”b” while byte is written with the capital letter ”B”. Therefore 1Mbit is the same as ~0,125 MB (is among other things often used to measure network bandwidth).

With this as background we can simply but correctly say that a 32bit CPU can save binary numbers with up to 32 digits in its GPR’s (general-purpose registers). And 64bit CPUs can save numbers with up to 64 digits and treat them during one clock cycle. We will take a close approach on the function of the CPU’s GPR’s later in the review but in short it is the CPU’s primary memory bank where all information must pass.

Since every binary digit increases with the factor 2 a 64bit architecture is an enormous increase in usable integers, from 4.3e9 (2^32) to 1.8e19 (2^64).
By being able to manage more numbers and a lot greater integers you can use more advanced structures in your programs which can exceed the size of 32bit.

But how does the 64bit architecture work which AMD has implemented in the Hammer core?


For those who have known AMD’s Hammer architecture during a longer time the x86-64 term has probably shown up a several number of times. X86-64 which stood for AMD’s 64bit architecture has lately been exchanged for AMD64 which seems to be a lovingly concept for AMD since they want to get rid of all the development names such as K8, x86-64 and also Hammer in favor of the new, AMD64.

Although, we will continue to use the different concepts for a while since it is simply difficult to get rid of them, but x86-64 is now behind us and instead we name AMD’s 64bit architecture AMD64.

As the title for this episode reads AMD has staked a lot on backwards compatibility during the development of the AMD64 core (okay we are beginning to soften as well..). It has all the time been clear that AMD64 would be an architecture which would be both future safe but at the same time keep the 32bit market behind its back.
The major changes in a 64bit architecture is as we mentioned earlier the internal ”width” of the data bus in the CPU. Which in turn makes it possible for the CPU to use more advanced instructions and not at least adress more memory in the system. We will look close at this soon but first an overview on what changes AMD has done in the x86 ISA.

We can from the diagram above read that the general-purpose registers have been largened to 64 bit but also increased in number. Except the x86 architecture’s 8 HPR’s there are now additional 8 GPR’s available for programmers. The number of SIMD registers (singel instruction, multiple data), which is being used for SSE/SSE2 code, is also increased to 16 from earlier 8 in the x86 architecture. The SSE2 support is also new for the Hammer and AMD which does that every Intel’s specific decimal instructions (floating point) is now also supported by AMD.

Now the big question is how AMD has managed to implement all these enlarged and increased registers for 64bit operations without loosing the backward compatibility? Well by using several different CPU settings which we shall try to explain briefly here.

Legacy Mode is an inheritance from the earlier 32bit CPUs within the x86 ISA and has three sub settings. These are used in 32bit operating systems and when the Hammer CPUs are used in this environment they are not different from its predecessor K7 (Athlon XP). All the increments and widements of the CPU’s registers are simply disconnected in Legacy Mode.
It can be compared to the effect of a CD writer which has to lower its speed when the CD-R can not cope with the same high speed as the CD writer can actually manage. Thus in a 32bit operating system you have no use of the AMD64 architecture, even though it may seem obvious that it is worth mentioning again.

When we jump over to using 64bit operating systems we get more use of the AMD64 and its new CPU setting, Long Mode. Also Long Mode is split into several sub sections, two to be exact, 64bit Mode and Compatibility Mode.
The latter is the setting that is used in 64bit operating systems to run 32bit applications. Even with this setting the AMD64 architecture’s increased properties are shut off.
It is not until we set it to 64bit Mode the fun starts since you here have access to both the wider 64bit registers and the increased SSE/SSE2 registers.
There is a small detail in the table above worth mentioning and that is ”Operand Size” which actually is set to 32bit even in 64bit mode. This despite the AMD64 architecture’s support for 64bit instructions. The reason for this is simply that 64bit integer instructions are not often needed. And since 64bit instructions both take more space in the cache memory and loads the memory bandwidth even more, simply because of its size, it is not directly wanted to used 64bit instructions for no reason.
By setting the integer instructions at 32bit as default also in 64bit Mode you save both memory bandwidth and cache memory. When 64bit instructions are needed the programmers have to use a prefix in the instruction which bypasses the default value and allows 64bit instructions. This prefix goes under the name REX.

Thus for the majority of those who buy the Athlon 64/FX today the AMD64 architecture will be completely uninteresting which is very sad. But we shall soon take a closer look on how one can take more use of AMD’s 64bit architecture and a ´64bit architecture at all. Although we will first talk some about Intel’s own 64bit architecture, IA-64.

AMD64 + Intel IA-64 = false

Despite that AMD has received a lot of media cover for its 64bit architecture, Intel has actually had their own 64bit architecture for several years. But it is not the 64bit architecture that AMD has gotten all the attention from but the fact that we have mentioned several times in the review, they offer 64bit technology with full backward compatibility for the row of 32bit software which has shown up during the last two decades.

How AMD has managed to accomplish this we have already explained and the reason for their aim at backward compatibility has been directly related to the Hammer architecture’s marketing. Intel developed their IA-64 architecture (Itanium) as a pure server/workstation platform with the costs looked at afterwards.
AMD has as far as we know always aimed at launching the Hammer CPU for both the desktop and server market (Athlon 64, Opteron) which has made the choice very simple. To succeed on the desktop market the Hammer architecture have simply got to give full support for the x86 ISA and its belonging 32bit software.

Thus Intel chose to start on a new bullet with a completely new ISA, IA-64, and has no ”real” backward compatibility with today’s 32bit software. Intel has admittedly developed emulators to be able to run 32bit software even on their IA-64 platform, but that involves a real performance loss which the AMD64 architecture does not suffer from.

Without getting to deep into the IA-64 architecture it is clear that it has not got the same wide market as the AMD64, neither today nor within the closer future. One should also remember that the AMD64 and IA-64 are two different ISAs which means that the two platforms uses different operating systems and software even in 64bit mode. This is also the reason to why Microsoft had to develop a new 64bit version of Windows since their current 64bit Windows platform only supports Itanium.


AMD is far from the first or the only company to provide a 64 bit processor. We’ve got SPARC, PowerMac G5 and of course Intel’s Itanium among others. Intel and AMD has, as we mentioned earlier, chosen separate paths when designing their processors and therefore we thought we’d dedicate a page to those differences and the companies approach to the problem.

We could simplify the differences in a simple way like this: AMD continues to build on the well known x86 architecture while Intel started over from scratch, developing a completely new architecture. Which approach to prefer is hard to say but one thing speaks for AMD and that’s their superior support for legacy 32 bit applications and operating systems. The IA64 CPU is a completely new design where they added x86 compatibility whereas the Hammer is a x86 CPU with added 64-bit support so to speak.

The advantage with AMD’s path is obviously the complete support for the current x86 platform that we’re all so used to. All the games, programs and operating systems you’ve used with your old processor will work flawlessly on a Athlon 64. With the IA64 Itanium from Intel things get trickier as the processor, as mentioned earlier, can only handle 32-bit code in a compatibility layer which means a loss in performance as it can only be done through emulation. With the first Itanium processor this loss was so great some called it unusable but with the Itanium 2 the performance is at least decent.

It’s not just the technology that differs but also the target market. Itanium directs itself towards the server segment and high performance workstations while the Athlon 64/Opteron aims to meet the demands of gamers, server administrators, 3D developers and all sorts of different users.

As we approach the technical parts we discover that AMD uses longer pipelines (12 steps) than Intel (10 steps) for their 64-bit processors. That’s the opposite from what we’d see when comparing their 32-bit chips. Further on both the chips are equipped with nine execution units but their implementations differ in that AMD has to use three different addressing modes while Intel makes do with one. This results in AMD needing three units (pure 32-bit, mixed 64-bit and pure 64-bit) for address generation whereas Intel doesn’t need a single one as they’re always in the same addressing mode. To express this in layman’s terms Intel’s CPU can concentrate better.
AMD and Intel also did some choices apart from each other when it came to the floating point vs. integer number performance. Here, AMD has the upper hand when it comes to floating point performance while Intel takes the lead on integer calculations.

One big difference is the number of registers in each of the processors. AMD’s 16 GPR’s sounds pretty meager compared to Intel’s 128. On the other hand AMD only needs one clock cycle to reach a register while Intel needs two. Whether 16 is too few or 128 is too much is up to the experts to decide and I suppose there will be a lot of opinions on that. In general one might say something like 32-64 registers with 1 clock cycle latency would be the optimal and also reasonable solution.

Other differences is of course AMD’s advantages they gain through their integrated memory controller (especially beneficial when using multiple processors as each one gets their own dedicated memory channel). The other benefit is AMD’s HyperTransport, something Intel simply will have to do without. More information on these properties will be provided later in the review.

We don’t intend to chose a winner between the two processors. Both architectures have, as we’ve shown, both pros and cons. What weighs the most depends solely on what you want to use the machine for. One thing obvious though is that AMD aimed more for a all-round solution when Intel on the other hand went for a more specialized one.

If it’s not obvious by now this might need an explanation: Intel Itanium can’t run x86-64 code and the AMD Hammer can’t run IA64 code.


The maybe biggest benefit of a 64bit architecture is the enormous increase of memory addressing compared to a 32bit architecture. The 32bit architecture can address up to 4 gigabyte (2^32) while the 64bit architecture theoretically can address up to 18 billion terabyte (2^64).
Though the real memory addressing on the x86-64 architecture is not as giddy but with a good margin enough we might add.

Hammer
(Opteron, Athlon 64 FX, Athlon 64)
NetBurst
(Pentium 4, Pentium 4 C, Pentium 4 EE)
Barton
(Athlon XP)
Max. physical memory addressing
1024 GB flat (40 bit)
64 GB PSE* (36 bit)
4 GB (32 bit)
Max. virtual memory addressing:
256 TB (48 bit)
4 GB (32 bit)
4 GB (32 bit)
Number of transistors:
105.9 million
105.9 million
54.3 million
* PSE = Page Size Extension

The physical memory addressing weighs in on 40bit and 1024 GB memory while the virtual memory addressing is 48bit and supports 256 TB memory. In other words light years ahead from the earlier x86 CPUs. The Pentium 4 CPUs (and earlier Pentium models) can also address more than 4 GB physical memory (RAM) by their PSE technique but it is not ”real” 36bit addressing and it demands both special hardware and software.

Also the Hammer architecture demands new software, first of all a 64bit operating system since its 64bit architecture can not be activated in a 32bit operating system which we explained earlier.

Today there are not many ordinary consumers who need more than 4 GB of RAM, nor has the money to spend on it. But on servers and the workstation market it is different. Over there database servers and similar machines has since long time passed the 4 GB limit. Here a Hammer would have been a great benefit against its x86 brothers but on the heavier server/workstation market it is another deal where pure 64bit CPUs rule (SPARC, Power4, Alpha etc), although in an entire different price range.

If we are going to be completely honest it is just servers and workstations which have use of the 64bit architecture today since there is no mainstream OS with 64bit support, though Windows XP 64 is under development. Of course there is 64bit support in for example Linux and even if the Linux users are growing it is a very small part of the consumer market who have jumped on the Linux train.
Though it is not only the lack of software that lies behind this conclusion since neither complex 64bit instructions nor 4 GB of RAM is something that we ordinary consumers has any bigger need of. But then we should have in mind that the software developers so far have had no reason to develop 64bit compatible/optimized software for desktop computers. With AMD’s launch of Hammer and Windows XP 64bit Edition we hope that this speeds up and it is very reasonable that you within a near future will have a greater use of the 64bit architecture also on the desktop market.

There is also a part in the AMD64 architecture that can give noticeable performance increases in more or less all 64bit software, the 8 extra general-purpose registers (GPR’s) which are released in 64bit environments.

We have earlier in the review mentioned that this is the storage space which is closest to the CPU’s function units (ALU, FPU etc which performs mathematical and logic operations on data). Which also means that it is the fastest ”storage place” for data in and nearby the CPU.

As we saw on the previous page, in the diagram over the AMD64 architecture’s registers, today’s x64 CPUs only have 8 GPR’s integrated in the core. Even if these 8 registers are the only ones available for programmers and compilers, today’s CPUs use several hidden registers which the CPU’s hardware controls. You can in this way store more information close to those units which performs operations on data (ALU is one of these ”Execution Units”).
But these hidden registers can not be included in the calculations when one writes/compiles software for the CPU. By increasing the number of registers AMD has given programmers and compilers better conditions to optimize the performance for the Hammer CPUs.

With this in mind we have to move on with more information about registers and cache memory which is the CPU’s important storing central.


For you who have already read our review of the Intel Pentium 4 Extreme Edition this is a familiar area. There we went through the cache memory’s function and what one will benefit from using a larger cache. Some of that information will be repeated here but most of it is new and we will also bring up the CPU’s registers.

As we mentioned earlier the CPU’s registers (also named Load/Store) is the fastest storage that the CPU has access to and these are used to temporarily store the data/information that the execution units (those which execute the CPU instructions) need to calculate for example a mathematical calculation. As an example we can take the following.
An instruction is sent to the CPU’s ALU (arithmetic-logic unit) and decides that a mathematical calculation shall be done on two numbers, to make it simple we add two numbers. These two numbers must be collected from the CPU’s register and when the calculation is done the result shall also be sent to the register. Every register can only contain a single number and in a 32bit register the number can be of maximum 32 binary digits, and 64 binary digits in a 64bit register.

This simple execution will occupy 3 of the CPUs total 8 registers and with more complicated calculations it will not take long until the CPU runs out of space in its registers. By using hidden registers as we mentioned on the previous page you go around this problem to a great deal and this is called register renaming. But the optimal is of course to use more ”real” registers in the CPU to minimize the CPU’s need to read data from the cache or RAM (where the data which does not fit in the registers goes) which is very time consuming.

Thus a doubling of the number of registers visible to programmers and compilers has its clear benefits and even if the number of registers is controlled by the CPU’s specific ISA, AMD does not need to worry about this. All the software must in any case be rewritten for their AMD64 architecture and it is only then where the 8 extra registers are available.

Of course the CPU’s registers are not enough in any way to store all the information/data which is used. It is here where the RAM and cache enters the picture. In the following example we use the ALU as the executing unit (could of course be the FPU too).
The ALU needs a number of data ready to execute its tasks and for this purpose we have the RAM memory (also the HDD in the end but we don’t include that in this example) which is of such size that it can store most part of the data that is being frequently used of the ALU in these calculations. Although compared with the CPU the RAM memory is hopelessly slow and to fetch data from the RAM memory it takes many clock cycles thus it is a waste of important CPU power. To minimize the CPU’s need of fetching the data from the RAM memory you use a middle hand, which in today’s CPUs is often split in two. We are of course talking about the cache memory, Level 1 and Level 2.
The cache memory can hold much more data than the CPU’s registers and by being physical integrated in the CPU core the access time is much lower than what it is to the RAM memory. Though the cache memory is not in direct contact with the CPU’s ALU and thus it is not as fast available as the registers but the differences are not at all as huge as ”register <-> RAM memory”. The Level 1 cache is the one closest to the ALU thus it has a lower access time than the Level 2 cache. A simple picture of the relation between the different ”storage stations” is here above.
The CPU handles the different stations in the following simple way, most used data in the fastest available storage station. Then the prioritizing goes after this basic rule but the CPU often uses itself of ”inclusive” cache which means that all the data in the L1 cache shall also be stored in the L2 cache.
Though AMD is an exception and they use an ”exclusive” cache model where the L1 cache and the L2 cache stores different data which increases the effective cache size compared to an inclusive model where the same data has to be stored twice.

In short the storage areas gets smaller while it gets faster. The reason for this is of course the production costs since the extremely fast cache memories both cost more to manufacture and to implement (the number of registers are locked for the CPU’s ISA). Therefore you have to weigh price and performance gain against each other when you design a CPU to get the most realistic usable amount of cache memory. This evaluation is something that Intel has bent some in the design of the Pentium 4 Extreme Edition where the performance gain of the L2 cache is noticeable but hardly worth its price for an ordinary consumer.

The differences in the cache structures between the K7 and K8 architectures are actually not that big. The most noticeable difference is the greatly increased amount of L2 cache which with the K8 architecture is at 1 MB, the K7 architecture offers at most 512 KB (Barton).
The L1-L2 cache bus is also improved in the K8 architecture, by using double 64bit buses between the caches one has both increased the bandwidth and reduced the latency compared to the K7 architecture (which has a simple 64bit bus).


Cacheinformation – WCPUID
ccache.gif
Cacheinformation – CPU-Z

The conclusion you can take from what we have discussed here is that a lot of the CPU manufacturers’ energy is used to fasten the storage place for the data which is being used in the CPU. Today’s CPUs are simply too fast for one to simply and effectively being able to supply them with the data they need.
Since all the steps (for data fetching) we showed in the picture example above is being more or less frequently used, it is very important to optimize all these ”storage stations” which we call them. A chain is not stronger than its weakest link as you know.

Therefore AMD has modified the memory controller which is the next link in the chain, after the cache memory, and they have not shown any mercy.


To minimize the painful slow latency for the RAM memory, AMD chose to integrate the entire memory controller in the Hammer CPU cores. We can actually see exactly how it has been implemented into the CPU by once again looking closer to its core.

By integrating the memory controller directly in the CPU core it also works in the same clock frequency as the CPU itself. But there is no RAM memory which is even close to the clock frequencies of a CPU which means that the clock frequency of the memory has to be decreased.
This is done through a divider which in the memory controller’s case only can be an integer. AMD has chose to work after a nominal clock frequency at 200 MHz which means that one will increase the clock frequency on their CPU’s in 200 MHz intervals.
Because the memory controller is controlled by the CPU’s clock frequency this can contribute to somewhat suspicious memory clock frequencies.
However, this does not affect the Athlon 64 since the CPU has support for DDR400. Since DDR400 gives a more effective memory frequency at 200 MHz it will always be possible to multiply with an integer which fits the CPU’s clock frequencies. But for the Opteron platform which still don’t have official support from DDR400 the memory frequencies can diverge from their specifications accordingly to the following model.

CPU clock frequency: DDR266: Effective clock frequency (multiplier) DDR333: Effective clock frequency (multiplier) DDR400: Effective clock frequency (multiplier)
1600 MHz
133 MHz (13)
160 MHz (10)
200 MHz (8)
1800 MHz
129 MHz (14)
164 MHz (11)
200 MHz (9)
2000 MHz
133 MHz (15)
166 MHz (12)
200 MHz (10)
2200 MHz
129 MHz (17)
157 MHz (14)
200 MHz (11)

As can be seen, there should not be any problems with the memory frequencies on the Athlon 64 / 64 FX which has support for DDR400 memory. But if/when AMD launches a CPU with DDR500 support the situation will be different since 250 MHz don’t have the right relation to the base frequency AMD use, 200 MHz. Notice that we say ” launches a CPU with DDR500 support”, since the memory controller is integrated in the CPU it is also the entire CPU which has to be updated before a new memory standard. Not just a mainboard chipset as we are used to.

Even if the memory bandwidth won’t increase with the integrated memory controller the latencies will be noticeable lower due to the high clock frequency on the controller and hopefully we’ll see this in the performance results.

It’s now time to take a closer look on the bus which binds all parts together in the Hammer CPU’s, HyperTransport.


In 1997 HyperTransport Consortium was started to develop a system bus architecture and from this the HyperTransport bus came. There are a couple of hundred companies involved in the HyperTransport Consortium and AMD are not the first one to use the technology. nVidia use a model of the HyperTransport technology in their nForce2 chipset to link the north and south bridge together in the chipset. Now AMD has gone one step further by sharply increasing the bandwidth for the HyperTransport bus and implementing it directly in their Hammer CPUs.
This means that AMD’s Hammer architecture does not use an ordinary FSB (Front Side Bus) as most of today’s CPUs do. The HyperTransport bus is instead the CPU’s data way towards the mainboard chipset and the computers other components.

One of the differences between the HyperTransport bus and the ordinary CPU buses we see today is that the HT bus is much ”thinner” than its ordinary FSB competitors. The HT bus can be up to 32bit wide but for the Hammer architecture it is 16bit as maximum. With width we mean how much data the bus can transfer per clock cycle, just as for the internal width on the CPU’s data flow.

Hammer
(Opteron, Athlon 64 FX, Athlon 64)
NetBurst
(Pentium 4, Pentium 4 C, Pentium 4 EE)
Barton
(Athlon XP)
CPU bus:
HyperTransport
FSB
FSB
Clock frequency on the CPU bus:
1.6 GHz
800 MHz
400 MHz
Bus width (data per clock cycle):
16
64
64
Bandwidth:
6.4 GB/s
6.4 GB/s
3.2 GB/s

The table above shows that HT regains the lack of data width with high clock frequencies. Today’s HT link has as shown an effective clock frequency of 1.6 GHz. In reality the clock frequency is at 800 MHz but HT is working with DDR technique which doubles the effective clock frequency. The HT bus’ 800 MHz comes from the bus’ base frequency of 200 MHz and is gained by a multiplier at the value of 4. In the mainboard’s BIOS this is normally shown as LTD multiplier (Lig) which is the older name of HyperTransport.
Even if the equation of the HyperTransport’s bandwidth does not seem to be correct (1600 x 16 / 8* = 3.2GB/s?) it is because a HyperTransport link consists of two sub links with 3.2 GB/s each (Tx Sub link = Transmit, Rx Sub link = Receive).

In the case for Opteron there are actually 3 HyperTransport links, 1 for the CPU’s bus and I/O while the other two are for multiCPU configurations. But since Athlon 64 is only a single CPU it is only one active HyperTransport link in this CPU.
* We divide with 8 to get the value in bytes instead of bits.


Above is an example with 32bit data width where the maximum bandwidth is at 12.8 GB/s

We can see how HyperTransport is implemented in the AMD-8000 chipset on the image below.


AMD’s implemention of HyperTransport in the AMD-8000 chipset

An ordinary mainboard chipset usually consists of two chips, the north and south bridge, which we mentioned earlier. The northbridge in such a system ”speaks” with the CPU through the CPU bus (FSB) where the bandwidth is different depending on the CPU you use (Look at the table on top of the page). The north bridge controls in such a system both the memory controller and the AGP port.
By integrating the memory controller in the CPU, AMD has now only left the AGP for the ”northbridge” and instead of using bridges AMD has gone over to ”tunnels”. These HyperTransport tunnels are splits on the HT link between the CPU and the final I/O hub (In/Out) which can give support for, as an example, PCI-X or in the case of desktop models, AGP graphics interface.

The HT tunnels becomes like a middle hand for the AGP and HT bus since the AGP bus can not be directly connected to the HyperTransport link. The information which comes in the tunnel is 16bit in both directions but the information flow out of the tunnel, in to the I/O hub, is only 8bit. This gives the AMD-8000 chipset’s I/O hub a maximum bandwidth of 1.6GB/s which stands very good against its competitors.

It is still up to the chipset manufacturers to both control the HT bus for the CPU and later on choose if one at all wants to use the HT bus between the ”north and south bridges”. The pros are many with the HT bus; high performance, good back/forwards compatibility with its HT tunnels, lower production costs with its relatively thin data width and support for up to 32 units. But despite this it is not all of the manufacturers which fully use the technology.

AMD-8000 ALi M1687 NVIDIA nForce3 150 VIA K8T800 SiS 755FX
CPU bus:
HyperTransport
HyperTransport
HyperTransport
HyperTransport
HyperTransport
Clock of the CPU bus:
1.6 GHz
1.6 GHz
1.2 GHz
1.6 GHz
1.6 GHz
CPU bandwidth:
16b/16b
16b/16b
16b/8b
16b/16b
16b/16b
Bandwidth:
6.4 GB/s
6.4 GB/s
3.6 GB/s
6.4 GB/s
6.4 GB/s
I/O -bus:
HyperTransport
HyperTransport
HyperTransport
VIA V-Link
SiS MuTIOL 1G
I/O bandwidth:
1.6 GB/s
1.6 GB/s
3.6GB/s
533 MB/s
1.0 GB/s

The chipset which differ somewhat from the crowd is nVidia’s nForce3 chipset which has almost half as high CPU bus bandwidth but a much higher I/O bandwidth. The reason for the lower CPU bus bandwidth is that nVidia has chosen a lower clock frequency on the HyperTransport bus (600 MHz, 1.2 GHz effective) and at the same time lowered the width of the data bus on the Rx Sub link to 8bit. But since nVidia has implemented all the components in one chip they do not need to use an extra HT link to an external I/O hub (south bridge). The I/O hub then utilizes the same bandwidth as the CPU which of course is positive. However, the question is if this can compensate for the loss in bandwidth for the CPU.
We are somewhat doubtful of this matter, but we aim to get facts in following Athlon 64 mainboard reviews.

Now it is time to go from theory to practice but first we shall have a lesson about the VIA K8T800 chipset which is our test platform during this review.


VIA was one of those which fastest jumped on the AMD64 architecture with their K8T800 chipset. The mainboard chipset exists in two versions and not fully unexpected it is one model for performance PCs and one for servers and workstations.
That is also exactly how they are called, ’K8T800 Performance PC’ and ’K8T800 Server/Workstation’. The differences are at most minimal between the two where the server model has support for Opteron/Athlon FX with Socket 940 interface and also support for a Dual PCI-X bus through the VIA VPX2. The performance PC model has support for socket 754 and no PCI-X compatibility, that’s it.

Among VIA’s drawing card we see Hyper8 everywhere and that is simply VIA’s name of their 16bit/1.6 GHz HyperTransport bus. As we saw on the previous page they are not alone about this on the market since it is only nVidia which hasn’t followed this example yet (nForce3 250 has rumors about being of a 16bit/1.6 GHz model).

We can see the VIA K8T800 chipset’s setup above in the block diagram and among its features we can mention native SATA-RAID, Gigabit Ethernet and the first integrated 7.1 channel sound circuit. VIA Envy 24PT which is supposed to be a strong challenger to today’s available solutions, even external ones.
More in depth information about the VIA K8T800 chipset can be found here.

VIA K8T800’s biggest Achilles heal its embarrassing slow bandwidth between the north and south bridge. As we saw on the previous page the bandwidth for the VIA V-Link is a maximum of 533 MB/s which is not much to brag about. But even if high I/O bandwidth is positive and future safe it is not at all necessary for most of the users. But for those which use Gigabit Ethernet and several fast harddrives for example in a RAID configuration the bandwidth may be of much use.
With VIA’s major focus on SATA-RAID it would have been fun to see a little higher bandwidth to play with. Whether we will encounter any bandwidth problems is doubtful but for those who knows that one will load the I/O bus very much this may be something to include in the calculations.

The Albatron K8X800 Pro II is the VIA K8T800 board which we have used in the test rig but we will not go into details about this mainboard now. We’ll save it for coming mainboard reviews.


Test system
AMD Athlon64
CPU:
Athlon64 3200+ (Hammer, 1.6 GHz HT, 1MB L2-cache)
Mainboard:

Albatron K8X800 Pro II (VIA K8T800)

AMD AthlonXP
CPU:
AthlonXP 3200+ (Barton, 400 MHz FSB, 512KB L2-cache)
Mainboard:

ABIT NF7-S v2.0 (NVIDIA nForce2)

Intel Pentium 4
CPU:
Intel Pentium 4 3.2 GHz Extreme Edition (800MHz FSB, 512KB L2-cache + 3 MB L3-cache)
Intel Pentium 4 2.8Ghz (800MHz 512Kb L2 Cache)
Mainboard:

ABIT IC7-MAX3 (Intel 875P)

Other hardware which is in all the test systems
RAM:
2 x 256MB GeILPC3500 Ultra CL2
Graphics card:
nVidia GeForce FX5900 Ultra@ 475/950MHz
Harddrive:
120GB S-ATA Seagate Barracuda V
Software
Operating system:
Windows XP Professional SP1
Resolution:
1024x768x32bit, 90Hz
Drivers:
nVidia Detonator 51.75
DirectX 9.0b
Intel Chipset Driver 5.0.2
Benchmarking software:

Sisoftware Sandra 2003
Aquamark 3
Comanche 4 Demo benchmark
Unreal Tournament 2003 demo v.2206
Quake 3:Arena v1.32
3Dmark2001 SE 330
3Dmark 2003
Audioactive Production Studio 2.04j (Fraunhofer II encoder)
Winace v2.20
PCMark 2002 Pro
SPECviewperf 7.1
WCPUID
DivX 5.1

The table above says most of it about our test rig and to really test the new Athlon64 CPU we have the top models from both Intel’s Pentium 4 series and AMD’s own AthlonXP series. In Intel’s case it is a Pentium 4 3.2 GHz Extreme Edition but we also have an ”ordinary” 2.8 GHz C and a 2.8 GHz EE model. The latter is really the 3.2 GHz model downclocked with one multiplier.
On the AMD side we have the AthlonXP 3200+ which fights for the K7 architecture’s honor.
The only one we miss is the Athlon64 FX-51 which has shown to be a shy CPU in this oblong country, but hopefully we will also look closer at the Athlob64 FX within short.

Before we begin our ”benchmark bonanza” a couple of words about the test settings.
The memory which was used in all the tests with 200MHz FSB (DDR400) was 2x256MB GeIL PC3500 CL2 which was run at 1:1 with 2-2-5-2 timings. For the Athlon64 platform we could only use ourselves of 2-2-6-2 timings since it was the most aggressive memory timings that the board allowed.
Whether this affects the test results or not we can not know yet but we tend to find out in a forthcoming memory article where we will test more VIA K8T800 boards but also mainboards based on the nForce3 150 chipset. With a little bit of luck maybe also nVidia’s nForce3 250 chipset which of there will be test examples of next month.
We will also save our overclocking tests for that article since our test mainboard was not equipped with the mounting device that is used to mount the CPU cooler. We did not receive this device neither for the cooler nor CPU so we had to improvise the mounting of the cooler. For ordinary tests it worked fine but at overclocking it was not good enough, and we did not want to risk the health of the CPU in this way.
During our tests the CPU temperature was at 46 degrees C at top and the idle temperature was at 32 degrees C which must be seen as good considering the circumstances. Although this was measured with the internal temperature diode so we will look closer at this too in our future articles.

When it comes to the 3D tests all the tests were done with V-Sync, FSAA and ANISO off. A little exception for the Aquamark 3 test since it runs with 4xANISO in the standard settings.

Let the party begin. First out are the memory tests.


And we begin with some memory tests and that with aid from SiSoft Sandra 2003. First out is buffered memory benchmark.

SiSoft Sandra Memory Bandwidth
Buffered, Int (MB/s)

   
P4 2.8 GHz
  4952
 
P4 EE 3.2 GHz
  4912
 
P4 EE 2.8 GHz
  4897
 
Athlon64 3200+
  3005
 
AthlonXP 3200+
  2995
 
  0 1400 2800 4200 5600 7000

SiSoft Sandra Memory Bandwidth
Buffered, Float (MB/s)

   
P4 EE 3.2 GHz
  4911
 
P4 2.8 GHz
  4897
 
P4 EE 2.8 GHz
  4890
 
Athlon64 3200+
  3003
 
AthlonXP 3200+
  2832
 
  0 1400 2800 4200 5600 7000

As shown the Athlon CPUs have no chance coping with the Pentium 4 platform in this area but it has its explanations. The AthlonXP 3200+ has admittedly a high theoretical memory bandwidth (with the nForce2 chipset’s DualDDR) but it is restrained of its relatively low clocked CPU bus. The Athlon64 has theoretical half as high memory bandwidth as its predecessor since it only has one memory controller but thanks to a very effective and fast controller the Athlon64 can cope with AthlonXP.
In this test the results were not affected noticeably of the memory timings because the Hammer architecture’s integrated memory controller does not benefit much of low access times.

SiSoft Sandra Memory Bandwidth
Unbuffered, Int (MB/s)

   
P4 2.8 GHz
  2711
 
P4 EE 3.2 GHz
  2693
 
P4 EE 2.8 GHz
  2680
 
Athlon64 3200+
  1823
 
AthlonXP 3200+
  1382
 
  0 800 1600 2400 3200 4000

SiSoft Sandra Memory Bandwidth
Unbuffered, Float (MB/s)

   
P4 EE 3.2 GHz
  2836
 
P4 EE 2.8 GHz
  2757
 
P4 2.8 GHz
  2728
 
Athlon64 3200+
  1841
 
AthlonXP 3200+
  1498
 
  0 800 1600 2400 3200 4000

In the unbuffered test the results are more affected by memory timings and fast access time. Here we also see that the Athlon64 runs by its predecessor with a good margin despite its theoretical lower bandwidth. The Pentium 4 platform with its high memory bandwidth and fast CPU bus is nothing to do about. Though we would gladly had seen the Athlon64 FX in this test since Intel would have had it a bit more difficult since the FX CPU has dual memory controllers.

In our coming memory/chipset article we will look closer at latency tests to push the Athlon64 CPU’s internal memory controller but now we move on with some game tests.


First out among the game tests is Aquamark 3 which is one of the few 9.0 tests.

Aquamark 3
Default run

   
P4 EE 3.2 GHz
  48017
 
Athlon64 3200+
  46250
 
P4 EE 2.8 GHz
  46015
 
P4 2.8 GHz
  45673
 
AthlonXP 3200+
  45318
 
  0 12000 24000 36000 48000 60000

The results are overall quite even in this test which heavily loads the graphics card. Something that also is reflected in the results. Athlon64 manages to squeeze in itself between the two Pentium 4 EE CPUs on a second place.

Comanche 4
1024×768 (fps)

   
P4 EE 3.2 GHz
  75.67
 
P4 EE 2.8 GHz
  67.66
 
Athlon64 3200+
  61.6
 
P4 2.8 GHz
  59.21
 
AthlonXP 3200+
  58
 
  0 20 40 60 80 100

Comanche 4 has always been a test with cruel system demands where high CPU clock frequency and memory bandwidth are important ingredients. Despite its effective memory controller and the increased cache memory Athlon 64 has to see itself beaten by a painful margin by both the Pentium 4 EE models. It is clear that Comanche 4 loves cache memory and Athlon64’s 1MB L2 Cache is not enough in this particular moment. The Athlon64 CPU’s, in this coherence, low memory bandwidth is probably also a great cause in the basis since there is actually not a great difference between the Athlon64 and AthlonXP in this test.

UT2003 BotMatch demo
1024×768, Default run (fps)

   
Athlon64 3200+
  103.2
 
P4 EE 3.2 GHz
  96.5
 
AthlonXP 3200+
  87.3
 
P4 EE 2.8 GHz
  86.1
 
P4 2.8 GHz
  79.9
 
  0 25 50 75 100 125

Unreal Tournament 2003 is also very hungry for memory bandwidth but to judge from the results it is low latency which is the favorite recipe. The Athlon CPUs has admittedly always been strong in UT2003 but here the Athlon64 takes a clear victory over Intel’s flagship.

Quake 3
1024×768, Demo Four, MaxQ (fps)

   
P4 EE 3.2 GHz
  457.5
 
P4 EE 2.8 GHz
  420
 
Athlon64 3200+
  395.2
 
P4 2.8 GHz
  370
 
AthlonXP 3200+
  342
 
  0 120 240 360 480 600

The maybe most cache eating test in our test suite is the popular Quake 3. In difference from UT2003 this is a test which Intel has always managed good in and Extreme Edition pulverizes the competitors, even the Athlon64 3200+. It would also be interesting here to see what AMD’s Athlon 64 FX would be able to do with its high memory bandwidth.

Now it is time for some synthetic tests, in other words 3DMark.

3DMark2001
Default run

   
P4 EE 3.2 GHz
  19635
 
Athlon64 3200+
  18711
 
P4 EE 2.8 GHz
  18222
 
AthlonXP 3200+
  17217
 
P4 2.8 GHz
  16797
 
  0 5000 10000 15000 20000 25000

In our review of the Pentium 4 Extreme Edition 3DMark2001 was another test which reacted very positively on the increased L3 cache and therefore the outcome of this tests is hardly a surprise. The Pentium 4 3.2 GHz EE in the top with Athlon64 3200+ chasing on a second place. Though it hardly differs 1000 points between them. But we see that Athlon64 3200+ despite a lower clock frequency beats its predecessor AthlonXP 3200+ hard on the fingers. It is simply because the integrated memory controller and the increased cache memory which makes themselves useful, but unfortunately some memory bandwidth is still missing to remove Intel from the throne.

3DMark03
Default run

   
P4 EE 3.2 GHz
  6871
 
Athlon64 3200+
  6753
 
P4 EE 2.8 GHz
  6651
 
P4 2.8 GHz
  6526
 
AthlonXP 3200+
  5225
 
  0 1600 3200 4800 6400 8000

Exactly as in Aquamark 3, the other DirectX 9.0 test, 3DMark03 is heavily graphics card dependant and the only CPU that can’t make it all the way is the AthlonXP 3200+.
3DMark03 is in other words ridiculously graphics card limited and hardly suits for a CPU test really. Though it is important to show this since coming games will tend to show more and more of the same behavior since the graphics card’s Pixel and Vertex Shaders will stand for the great part of the performance limitations in the future. Though, Futuremark has been kind enough to build in two CPU tests in 3DMark03, let us have a look at the results from these:

3DMark03
CPU Test 1 (fps)

   
P4 EE 3.2 GHz
  105.9
 
Athlon64 3200+
  104.1
 
P4 EE 2.8 GHz
  92.8
 
P4 2.8 GHz
  80.1
 
AthlonXP 3200+
  79.9
 
  0 30 60 90 120 150

3DMark03
CPU Test 2 (fps)

   
P4 EE 3.2 GHz
  16.7
 
P4 EE 2.8 GHz
  15
 
Athlon64 3200+
  14.5
 
P4 2.8 GHz
  14.1
 
AthlonXP 3200+
  13.4
 
  0 5 10 15 20 25

Athlon64 3200+ keeps up good in the CPU tests and especially in Test 1 where it is a very small difference in favor to the 3.2 EE.

Let’s move on with some ordinary day applications.


Audio and video encoding has always been one of Intel’s strong sides, much because of its SSE/SSE2 instructions which with the right code can greatly enhance the performance in multimedia applications. Now AMD has also support for SSE and SSE2 instructions in their Athlon64 CPUs so it shall be interesting to see what results we get in these tests. First out we have a video clip in MPEG(1) format which we recode to DivX 5.1 in 780 kbps. We chose not to encode the audio in this test.

Video encoding
DivX 5.1, 780 kbps (fps)

   
P4 EE 3.2 GHz
  40
 
Athlon64 3200+
  37.5
 
P4 EE 2.8 GHz
  35.4
 
P4 2.8 GHz
  33.5
 
AthlonXP 3200+
  31
 
  0 12 24 36 48 60

And again we see that the cache memory makes a difference in the performance and that Athlon64 3200+ has to see itself bypassed by Intel’s flagship. The performance is though very much lower on the predecessor AthlonXP 3200+.

Audio encoding
MP3, 192 kbps (sec)

   
P4 EE 3.2 GHz
  189
 
Athlon64 3200+
  205
 
AthlonXP 3200+
  213
 
P4 EE 2.8 GHz
  216
 
P4 2.8 GHz
  217
 
  0 50 100 150 200 250

To test how good the CPU is in dealing with audio encoding we take a 200 MB wave file and turn it into a MP3 in 192kbps. In this test there is a lot of focus on pure CPU speed and Intel stands as a clear winner in this test. Noticeable is that Athlon64 3200+ despite its lower clock frequency again leaves the AthlonXP 3200+ behind with a noticeable margin.

Next we shall take the same wave file but instead we will compress it with high compression with the aid of WinAce.

WinAce
Highest compression (sec)

   
P4 EE 3.2 GHz
  152
 
Athlon64 3200+
  161
 
P4 EE 2.8 GHz
  164
 
P4 2.8 GHz
  187
 
AthlonXP 3200+
  187
 
  0 42 84 126 168 210

Winace and file compression overall trusts very much on the CPU’s possibility to store large amounts of data. Something that means a lot of pressure on the cache memory and the system memory. The Pentium 4 EE goes with help from its cache memory and high bandwidth victorious out of this battle but the Athlon64 shows a very good potential.

PCMark2002
CPU Score

   
P4 EE 3.2 GHz
  7842
 
P4 EE 2.8 GHz
  6963
 
P4 2.8 GHz
  6844
 
AthlonXP 3200+
  6839
 
Athlon64 3200+
  6469
 
  0 2000 4000 6000 8000 10000

PCMark2002
Memory Score

   
P4 EE 3.2 GHz
  14092
 
P4 EE 2.8 GHz
  12856
 
P4 2.8 GHz
  9250
 
Athlon64 3200+
  8910
 
AthlonXP 3200+
  6749
 
  0 4000 8000 12000 16000 20000

Above we find some synthetic tests from PCMark2002 which measures general ”office performance”. Actually we would have wanted to use Sysmark here but we are not so eager about paying that much for a synthetic test when there are those that are free. As we can see the CPU tests is almost exclusively dependant on the number of MHz. Though the memory test shows completely different results.

The memory test in PCMark2002 can almost be excluded since it without doubts pays too much attention to the cache memory which makes that the EE model’s enormous L3 cache makes them unbeatable. But we included this test this time to see the effects of the Athlon64 CPU’s increased L2 cache. As waited it also gave a nice boost in this test but the memory bandwidth is still a bottleneck so even Intel’s ”ordinary” Pentium 4 2.8 GHz gets quit much better performance in this test.

Let us move on with some more workstation related tests in form of SPECViewperf 7.1.


The advanced 3D tests are performed using SPECviewperf 7.1. For those of you not familiar with SPECviewperf, it’s a test program containing 3D renditions from 6 different 3D programs available on the market today and is used professionally. The performed tests are as follow:

  • These tests are run in window mode and are thereby affected by the system’s desktop resolution. During our tests, we used a resolution of 1024x768x32 bit and an update frequency of 90 Hz.

    Last time when we compared CPU’s in SPECViewperf was in our Pentium 4 EE review where Intel ran over AthlonXP 3200+. We knew from the beginning that Intel were very strong in these tests, which is proofed for us once more. But Athlon64 3200+ gives still AMD a better position and in some tests do really fine, even where the AthlonXP 3200+ sacks.
    In 3D-rendering the whole system’s performance is measured, apart from the graphics card pure CPU power combined with memory bandwith is very important.
    Athlon64 3200+ performs much better than AthlonXP 3200+, but Intel Pentium 4 3.2 GHz EE is the victorious.

    Summary of the performance
    Generally we have seen that AMD Athlon64 3200+ is without any doubt a faster processor than AthlonXP 3200+. Even if the differences not always are astounding they are still noticeable. The tests where Athlon64 3200+ has difficulties to compete (as expected) when the memory bandwith becomes a crucial factor. In today’s applications high memory bandwith is almost always necessarily and therefore we have a feeling of that Athlon 64 FX can give you a performance gain in comparison with AthlonXP 3200+ in more or less every program.

    Compared with the Pentium 4 platform Athlon64 3200+ stands very good from a performance point of view, even if we have not been able to test its ”real” competitor, Pentium 4 3.2 GHz C we have a hunch of that a fight between these two would be very interesting.
    Something that is performing better than Athlon 64 in this review is Intel’s new flagship, Pentium 4 3.2 GHz Extreme Edition. A very fast CPU which Athlon64 even can not master. Why this is rather uninteresting we will soon tell you.


    When Intel released their NetBurst architecture they were very clear about pointing out its future potential. It is not so strange since the first models of the Pentium 4 despite the high clock frequencies had a very difficult time standing up against that time’s Athlon and Pentium 3 CPUs. Intel promised that the NetBurst architecture would show a whole new side where the architecture had grown up and the clock frequencies increased. Today we know that Intel was more correct than what many believed since the Pentium 4 platform today, about 3 years after its release, is one of the markets hottest PC platforms. With a more effective manufacturing technique and relatively small changes in the core to better its IPC value (freely translated: performance per clock cycle), as larger cache and faster processor bus, they have managed to keep the Pentium 4 CPUs at the top of the market during a long time.

    AMD in their turn has often themselves of a ”here and now” method where they have maybe thought less of the future and instead aimed hard against the now. Even this technique has shown to be very effective and AMD has through pure raw power kept themselves clearly competitive with their K7 architecture (Athlon). It has become more difficult for AMD since the K7 architecture has shown more and more ageing signs while the Pentium 4 has continued its development.
    AMD’s K8 architecture has been on the subject of discussion during several years now and now when it finally arrives to take over the K7 architecture in the top segment of the PC market we still get somewhat of a Deja vú.
    The K8 architecture has exactly as the NetBurst several properties which speaks for a very effective performance development and among these we can of course mention support for 64bit software, HyperTransport bus and an integrated memory controller.
    The 64bit architecture is of course a great feature for the future which not many consumers will benefit from today. Even if we have gone through the 64bit architecture in this review we have refrained ourselves from publishing any 64bit performance tests and save these for coming articles when we’ve become more familiar with the CPU and Microsoft has come longer with their 64bit version of Windows.

    The integrated memory controller on the Athlon64 has also a great future potential since not only the pure CPU power increases while the clock frequency on the CPU increases. Also the memory performance will get an extra kick at every clock frequency update and when AMD goes over to 0.09micron the effective performance scale for the K8 architecture will show its right face. Unfortunately we had no possibility to test overclocking in any larger scale but we guarantee that it will come within a near future, and then on several different mainboard platforms.

    It is clear that AMD has not only bet their money on the future since Athlon64 stands very good against the competitors already today. A proof on this is that Intel made a debatable new launch of their Pentium 4 CPUs in relation to AMD’s launch of Athlon64/64 FX. Intel Pentium 4 Extreme Edition is a Pentium 4 3.2 GHz C on steroids and with its enormous L3 cache Intel can still keep AMD behind themselves.
    But for what price one may ask, well one of the highest prices in the PC market’s history.

    The Intel Pentium 4 3.2 GHz Extreme Edition can today be find (if you are lucky) in Sweden for about 840 Euros, Athlon64 3200+ can be found for about 350 Euros.
    Even if Intel wins the performance round in the EE vs. A64 aspects the two CPUs are not even in the same league when we discuss the price value. In our eyes Intel Pentium 4 3.2 GHz Extreme Edition has no price value at all for an ordinary consumer.
    Athlon64 3200+ is actually a relatively expensive CPU too but despite this Intel’s Pentium 4 3.2 GHz Extreme Edition is almost 2.5 times as expensive (!).

    AMD Athlon64 has in our point of view already showed itself as a worthy competitor on the CPU market and it only looks to be better in the future. With more mature platforms and with refined manufacturing technology Athlon64 will be a great threat against Intel on the desktop market for a long time.

    This was all we had to say this time but we promise you that it will come more Athlon64 articles in a near future here at NordicHardware.

    Leave a Reply

    Please Login to comment
      Subscribe  
    Notifiera vid