Creating a 3D Game Engine (Part 13)

I don’t have much time, so I will be brief. Basically for the past few days I have been trying to optimize the engine. With the stress test you see above (around 13K cubes) I was only getting around 200 fps. Just slightly above my target of 120 fps, and with such a simple scene I was expecting more. So I got to hacking, fully realizing that early optimization is evil… yeah, yeah. In any case, I needed to know if the engine architecture was flawed in some way, and if I was going down the wrong path. Through some crude debugging I found that my matrix multiply operation was causing the huge sink in performance. My somewhat straight-forward implementation was as follows.

Matrix4x4 Matrix4x4::multiply(const Matrix4x4& rhs){
	Matrix4x4 ret;

	ret.m11 = m11 * rhs.m11 + m12 * rhs.m21 + m13 * rhs.m31 + m14 * rhs.m41;
	ret.m12 = m11 * rhs.m12 + m12 * rhs.m22 + m13 * rhs.m32 + m14 * rhs.m42;
	ret.m13 = m11 * rhs.m13 + m12 * rhs.m23 + m13 * rhs.m33 + m14 * rhs.m43;
	ret.m14 = m11 * rhs.m14 + m12 * rhs.m24 + m13 * rhs.m34 + m14 * rhs.m44;

	ret.m21 = m21 * rhs.m11 + m22 * rhs.m21 + m23 * rhs.m31 + m24 * rhs.m41;
	ret.m22 = m21 * rhs.m12 + m22 * rhs.m22 + m23 * rhs.m32 + m24 * rhs.m42;
	ret.m23 = m21 * rhs.m13 + m22 * rhs.m23 + m23 * rhs.m33 + m24 * rhs.m43;
	ret.m24 = m21 * rhs.m14 + m22 * rhs.m24 + m23 * rhs.m34 + m24 * rhs.m44;

	ret.m31 = m31 * rhs.m11 + m32 * rhs.m21 + m33 * rhs.m31 + m34 * rhs.m41;
	ret.m32 = m31 * rhs.m12 + m32 * rhs.m22 + m33 * rhs.m32 + m34 * rhs.m42;
	ret.m33 = m31 * rhs.m13 + m32 * rhs.m23 + m33 * rhs.m33 + m34 * rhs.m43;
	ret.m34 = m31 * rhs.m14 + m32 * rhs.m24 + m33 * rhs.m34 + m34 * rhs.m44;

	ret.m41 = m41 * rhs.m11 + m42 * rhs.m21 + m43 * rhs.m31 + m44 * rhs.m41;
	ret.m42 = m41 * rhs.m12 + m42 * rhs.m22 + m43 * rhs.m32 + m44 * rhs.m42;
	ret.m43 = m41 * rhs.m13 + m42 * rhs.m23 + m43 * rhs.m33 + m44 * rhs.m43;
	ret.m44 = m41 * rhs.m14 + m42 * rhs.m24 + m43 * rhs.m34 + m44 * rhs.m44;

	return ret;
}

Feeling that this could be improved, I found some code on StackOverflow to do the same operation using SSE instructions. I was initially considering coding it in assembly, but this looked like a cleaner solution and a little easier to understand (though, of course, nowhere near as cool as getting “pedal to the metal” and writing assembly code). I was told this should be as fast or faster than assembly anyhow. The new function is below.

void Matrix4x4::multiplySSE(float *lhs, float *rhs, float *out) {
	__m128 row1 = _mm_load_ps(&rhs[0]);
	__m128 row2 = _mm_load_ps(&rhs[4]);
	__m128 row3 = _mm_load_ps(&rhs[8]);
	__m128 row4 = _mm_load_ps(&rhs[12]);
	for (int i = 0; i < 4; i++) {
		__m128 brod1 = _mm_set1_ps(lhs[4 * i + 0]);
		__m128 brod2 = _mm_set1_ps(lhs[4 * i + 1]);
		__m128 brod3 = _mm_set1_ps(lhs[4 * i + 2]);
		__m128 brod4 = _mm_set1_ps(lhs[4 * i + 3]);
		__m128 row = _mm_add_ps(
			_mm_add_ps(
			_mm_mul_ps(brod1, row1),
			_mm_mul_ps(brod2, row2)),
			_mm_add_ps(
			_mm_mul_ps(brod3, row3),
			_mm_mul_ps(brod4, row4)));
		_mm_store_ps(&out[4 * i], row);
	}
}

To be honest, I was disappointed. There were some small gains, sure, but I was expecting some a serious improvement. With the same 13K cube scene, I was now getting close to 225 fps. Over a 10% improvement, it’s something, but not what I wanted. So I got it into my head that I would try the DirectXMath library, and at least do some benchmarks to see how it compared. I mean, I did really want to stick with my custom math library, but not if it meant slow performance. I did a few quick test calculations, and the speed seemed nice. Sadly the compiler probably optimized them out (keep reading).

So I spent the next few hours ripping out all of my math calls and adding in the DirectXMath classes and functions instead. Finally I got to a point of having some visuals on screen. What do I see? Slow frame rates. It was much worse than before. Barely even 100 fps. Unacceptable. I fixed up some more of the code and got it to an OK state. Even then, it was only running at around 200, or about the 10% worse than my own custom functions. How could this be? Well, in some sense I feel proud that my my implementation fared well. But on the other hand, I just wasted an entire night for nothing. You live and you learn.