The Science of AVX

What is Intel AVX Parallel Processing?

AVX stands for Advanced Vector Extensions. It is an invention of Intel corporation comprised of a processor architecture which facilitates parallel processing.

When it was first introduced by Intel with Sandy Bridge in early 2011 it constituted a radical departure from conventional processor design. Since 2011 Intel has built this capability into most of its consumer processors.

With the advent of AVX1 a special architecture is fabricated into the processor comprised of eight registers arranged in parallel connected to the instruction logic. These registers can be used to hold eight floating point numbers with different values in each register. AVX2 adds the same capability for integer instructions. In both cases the floats and integers are the normal 32 bits in width meaning that the entire register is 256 bits wide.

The beauty of this mechanism is that because all eight registers are connected to the same command processor, any command executed by the processor is performed on all eight registers at once in a single clock cycle instead of just one register at a time.

That is how parallel processing works.

Parallel processing should not be confused with multiple cores. Intel processors usually contain both technologies. A processor with multiple cores is simply a piece of hardware containing multiple independent processor units fabricated on the same substrate and which all share some common resources. In this sense a core is simply more of the same. It is not really a new design.

On the other hand, an architecture based on multiple registers connected to the same instruction logic represents a true parallel processor and constitutes the foundation of a completely different programming model.

How does AVX impact speed in Fathom Vector

Fathom Vector uses Intel AVX to process all eight detune channels in each synthesizer voice simultaneously.

This means that if you use Fathom Vector and enable Intel AVX the CPU load required for any number of detune voices up to eight voices is the same as two detune voices in Fathom Pro. You essentially get free detune voices with no additional CPU cost.

Fathom Vector processes all eight detune voices at once instead of in a loop with eight iterations.

Intel AVX itself does not improve the speed of other operations such as filters and effects.

That being said, there were also massive improvements made to the general performance and efficiency of Fathom’s processing code in the Fathom Vector release. So you will notice a slight increase in CPU performance even if you do not have Intel AVX enabled.

Why haven’t I heard about this?

If you are a programmer you probably have since it has been around a long time.

The reason that many consumers are unaware of it is because very few programmers really understand how to program it correctly and therefore the capability rarely ends up in consumer software products.

Most programmers of course understand the theory, but designing software to use it correctly requires a completely different approach than linear functional programming. This is because parallel processing is not a “speed sauce” which you can sprinkle over your code to make it faster. It can’t simply be added to existing code.

If you take the average computer program and try to load all the data in a typical function or algorithm into parallel registers and then execute those instructions and pull out the results, the total number of instructions required will almost certainly be equal to or greater than the number of instructions you had in the first place.

In fact your code could actually run slower.

In addition, if you enable parallel processing in most compilers like Visual Studio thinking that the compiler will create faster code for you automatically, more often than not you will be disappointed.

Visual Studio compiler can’t redesign your code for you. If you try this option very little if any of your code will in fact be converted to functional AVX instructions by the compiler. The only way to do vector processing correctly is to use the raw Intel instructions and convert your entire design to parallel code so that all the registers can be loaded and unloaded efficiently at certain critical junctures. This makes it possible to run massive portions of your logic in parallel instead of doing it in loops.

This is the only way to do it properly. It often requires a tremendous amount of work to take linear code and completely alter the design in order to take advantage of the Intel architecture.

If it is done correctly a program with AVX2 will run eight times faster because it can perform eight operations every clock cycle instead of one. And if the entire design is planned out in advance so that the large sections of your logic can be loaded into the registers before hand then it will work effectively.

If, on the other hand, you try to load the registers carelessly as they are needed you will produce a disaster that might actually run slower than it did before.

Because of these complications, most companies outside of music software and gaming never convert their code to parallel processing and their customers never hear about it even though almost every computer they own has the capability.

Why is AVX important in music software?

A few years ago revolutionary companies like Spectrasonics and Lennar Digital figured out how to do real parallel processing in their code by recognizing that a software synthesizer has many operations which are truly parallel in character.

For instance, processing the exact same wave data on 32 voices at once, or executing the same wave data on eight detune channels in each voice with different frequency values in each voice.

These are perfect opportunities for parallel processing. However, it needs to be incorporated into the design itself since attempting to add it in afterwards to code which was not designed for it is often pointless.

Any series of operations which use the same logic applied to different streams of data are good candidates for AVX provided that the data can be maintained in the parallel state throughout large sections of the design. If those sections are too small then the time required to load and unload the registers becomes longer than the speed gain and you are simply out of luck.

Companies like Lennar and Spectrasonics rewrote their code correctly and made their synthesizers run four times faster (SIMD), and then when AVX2 came out, eight times faster.

Fathom’s code is perfectly suited for parallel processing since it has eight detune channels for each polyphonic voice. This is the ideal situation for parallel processing. In this scenario the waveform and frequency data can be loaded into the eight parallel registers when a preset is loaded and then the exact same operations needed to run all eight detune voices can be executed by the processor on the eight channels at once every clock cycle.

The challenge though was that all of Fathom’s oscillator code needed to be completely rewritten and there were thousands of lines of code. All the oscillators needed to be redesigned and the code changed from C++ to Intel AVX assembly language using the raw AVX instructions.

This is why Fathom Vector took over a year of full time development to implement including some of the other advanced features.

Fathom Vector makes it possible for you to add any of the oscillators, turn on AVX processing and crank up the number of detune voices from two to eight and there will be little or no change in processor load. This is because when AVX is enabled Fathom processes all eight detune channels at once instead of in a loop with eight iterations.