As I mentioned in my last article, branches suck on modern CPUs. Any excuse to remove excess branches is almost always a win.
The first step in getting rid of branches, is looking at your "C" code and finding all the places that code flow can change. A very common test is where a character is tested against uppercase and lowercase to find a string match.
int temp = pInput[0];
if (temp=='a' || temp=='A') {
DoSomething();
}
In that code fragment, two tests were performed, and two branches were generated, one for the 'a' test and another for the 'A' test. Now, what actually differs between the tests? A is hex 0x41 and a is hex 0x61. Only one bit is different, 0x20. So, if the value in temp could be altered so logically only the other 7 bits are tested for equality and the bit for 0x20 was considered a "don't care", both values could be tested for at the same time and only one branch is required.
int temp = pInput[0];
temp ^= 'A'; // Normalize so 'A' -> 0x00 and 'a' -> 0x20
if (!(temp&(~0x20))) { // Is everything but 0x20 zero (It's A or a if this test yields zero)
DoSomething();
}
With one additional logic instruction, a branch is removed increasing the chances of a successful branch prediction. This is crucial on lower end CPUs where the branch prediction unit has very limited branch prediction slots available. Also, since it's a single test for true or not true, it reduced branch randomization which plays havoc on any branch prediction unit, since they are all designed with consistent branch behavior in mind.
The first step in getting rid of branches, is looking at your "C" code and finding all the places that code flow can change. A very common test is where a character is tested against uppercase and lowercase to find a string match.
int temp = pInput[0];
if (temp=='a' || temp=='A') {
DoSomething();
}
In that code fragment, two tests were performed, and two branches were generated, one for the 'a' test and another for the 'A' test. Now, what actually differs between the tests? A is hex 0x41 and a is hex 0x61. Only one bit is different, 0x20. So, if the value in temp could be altered so logically only the other 7 bits are tested for equality and the bit for 0x20 was considered a "don't care", both values could be tested for at the same time and only one branch is required.
int temp = pInput[0];
temp ^= 'A'; // Normalize so 'A' -> 0x00 and 'a' -> 0x20
if (!(temp&(~0x20))) { // Is everything but 0x20 zero (It's A or a if this test yields zero)
DoSomething();
}
With one additional logic instruction, a branch is removed increasing the chances of a successful branch prediction. This is crucial on lower end CPUs where the branch prediction unit has very limited branch prediction slots available. Also, since it's a single test for true or not true, it reduced branch randomization which plays havoc on any branch prediction unit, since they are all designed with consistent branch behavior in mind.
- Mood:geeky
Once upon a time, branches were good. The way you knew your code was fast was by how few instructions it took to get the job done.
Not anymore.
Most modern CPUs use very deep pipelines, and as such, they favor long uninterrupted instruction streams that has no branching to be able to squeeze out the performance they offer.
A mispredicted branch can cost you anywhere from 6 to 40 machine cycles and large amounts of hardware exist to try to perform deep juju to predict the behavior of the code in an attempt to minimize the penaties.
Here are some code fragments, and ways in which they can be done in a branchless manner.
int abs(int iInputt)
{
return (iInputt <0) ? -iInputt : iInputt;
}
Normally, that generates a test for (iInputt<0) and branches over the -iInput if positive. If the test is always the same, branch prediction minimizes the penalty. However, there is a way of doing this without branches.
int abs(int iInput)
{
// Create a mask. 0 if positve. 0xFFFFFFFF if negative
int iMask = iInput>>((sizeof(int)*8)-1); // Usually 31 bits to shift
iInput = (iInput^iMask)-iMask; // Do nothing for positive, negate if negative
return iInput;
}
There, no branches. It works by taking the sign bit and creating a mask of either 0 or all ones (0xFFFFFFFF). The logical form of negation is ((i^0xFFFFFFFF)+1) = -i and if we take that formula and insert zero into it ((i^0)+0)=i it does nothing. Since we don't have a one, but we have a negative one the formula is changed to ((i^0xFFFFFFFF)-(-1) = -i or ((i^0)-0 = i, now it yields a branchless formula to perform abs(i) = ((i^mask)-mask).
Easy huh?
Not anymore.
Most modern CPUs use very deep pipelines, and as such, they favor long uninterrupted instruction streams that has no branching to be able to squeeze out the performance they offer.
A mispredicted branch can cost you anywhere from 6 to 40 machine cycles and large amounts of hardware exist to try to perform deep juju to predict the behavior of the code in an attempt to minimize the penaties.
Here are some code fragments, and ways in which they can be done in a branchless manner.
int abs(int iInputt)
{
return (iInputt <0) ? -iInputt : iInputt;
}
Normally, that generates a test for (iInputt<0) and branches over the -iInput if positive. If the test is always the same, branch prediction minimizes the penalty. However, there is a way of doing this without branches.
int abs(int iInput)
{
// Create a mask. 0 if positve. 0xFFFFFFFF if negative
int iMask = iInput>>((sizeof(int)*8)-1); // Usually 31 bits to shift
iInput = (iInput^iMask)-iMask; // Do nothing for positive, negate if negative
return iInput;
}
There, no branches. It works by taking the sign bit and creating a mask of either 0 or all ones (0xFFFFFFFF). The logical form of negation is ((i^0xFFFFFFFF)+1) = -i and if we take that formula and insert zero into it ((i^0)+0)=i it does nothing. Since we don't have a one, but we have a negative one the formula is changed to ((i^0xFFFFFFFF)-(-1) = -i or ((i^0)-0 = i, now it yields a branchless formula to perform abs(i) = ((i^mask)-mask).
Easy huh?
- Mood:geeky
- Music:Hare Hare Yukai-Hirano Aya & Chihara Minori &-Hare Hare Yukai
I use a Perforce server for maintaining my source code and it was a lifesaver. Over the last few weeks, I've been making extensive upgrades to Burgerlib, inserting hand tuned assembly here and there and general clean up. I had several unit tests to verify my work and it's been working out great.
Until last Wednesday...
I ran numerous tests on the game Aliens vs. Predator to make sure that nothing was broken and it was running flawlessly. So I switched over to the Mac PowerPC version and ran it and to my horror, I found the 3D rendering engine was drawing all the textures black. A quick look at the source history showed me that I had made hundreds of changes to the code and finding out why the gamma tables were all screwed up was going to be a nightmare.
Then Perforce came to the rescue. I did a "revert" and went back in time, grabbing a source tree revision, rebuilding and testing until I found that it was a change made about 6 weeks ago that broke AvP MacOS PowerPC. With the possible suspects narrowed down, I quickly traced it to an inline assembly gotcha that happens with the Metrowerks CodeWarrior compiler for PowerPC. Shortly, I found that my Wii version also suffered from the compiler "glitch".
Here's the errorneous code...
static Int32 BURGER_INLINE FromFloatFloor(register float fInput)
{
Int32 iResult;
fInput = fInput-0.5f;
register Int32 *pResult = &iResult;
asm { fctiw fInput,fInput; stfiwx fInput,r0,pResult }
return iResult;
}
In a controlled unit test, this code generated the assembly I was expecting, but in a gamma table creation function in Aliens Vs. Predator, the "return iResult" operation was optimized out and as such, it was filling the table with whatever was in register r3 at the time (In this case, it was always zero). Gamma = 0 means textures are all black.
The fix was to force the compiler to never get rid of iResult and here's where "volatile" has its uses.
static Int32 BURGER_INLINE FromFloatFloor(register float fInput)
{
volatile Int32 iResult;
fInput = fInput-0.5f;
register volatile Int32 *pResult = &iResult;
asm { fctiw fInput,fInput; stfiwx fInput,r0,pResult }
return iResult;
}
Adding the keyword fixed the code and I double check the Wii compiler and it solved the issue there too.
Lesson learned... No matter how many safeguards you put in your development process, something always gets through. Being able to "go back in time" through source reversion saved me hours, even days, of crazy coding detective work.
Until last Wednesday...
I ran numerous tests on the game Aliens vs. Predator to make sure that nothing was broken and it was running flawlessly. So I switched over to the Mac PowerPC version and ran it and to my horror, I found the 3D rendering engine was drawing all the textures black. A quick look at the source history showed me that I had made hundreds of changes to the code and finding out why the gamma tables were all screwed up was going to be a nightmare.
Then Perforce came to the rescue. I did a "revert" and went back in time, grabbing a source tree revision, rebuilding and testing until I found that it was a change made about 6 weeks ago that broke AvP MacOS PowerPC. With the possible suspects narrowed down, I quickly traced it to an inline assembly gotcha that happens with the Metrowerks CodeWarrior compiler for PowerPC. Shortly, I found that my Wii version also suffered from the compiler "glitch".
Here's the errorneous code...
static Int32 BURGER_INLINE FromFloatFloor(register float fInput)
{
Int32 iResult;
fInput = fInput-0.5f;
register Int32 *pResult = &iResult;
asm { fctiw fInput,fInput; stfiwx fInput,r0,pResult }
return iResult;
}
In a controlled unit test, this code generated the assembly I was expecting, but in a gamma table creation function in Aliens Vs. Predator, the "return iResult" operation was optimized out and as such, it was filling the table with whatever was in register r3 at the time (In this case, it was always zero). Gamma = 0 means textures are all black.
The fix was to force the compiler to never get rid of iResult and here's where "volatile" has its uses.
static Int32 BURGER_INLINE FromFloatFloor(register float fInput)
{
volatile Int32 iResult;
fInput = fInput-0.5f;
register volatile Int32 *pResult = &iResult;
asm { fctiw fInput,fInput; stfiwx fInput,r0,pResult }
return iResult;
}
Adding the keyword fixed the code and I double check the Wii compiler and it solved the issue there too.
Lesson learned... No matter how many safeguards you put in your development process, something always gets through. Being able to "go back in time" through source reversion saved me hours, even days, of crazy coding detective work.
- Mood:geeky
- Music:Hare Hare Yukai-Hirano Aya & Chihara Minori &-Hare Hare Yukai
