While getting models loaded was pretty exciting, I ended up dealing with major load times on the demo. Granted, my XML parsing code is probably slow as all hell, but I don’t think COLLADA is really designed for real-time engine use. With simple plane and cube shapes the loading wasn’t that bad, but with my soda can model (around 600 triangles) the loading was nearing 10 seconds (totally unacceptable). I can only imagine what would happen with a really complex model. Something had to be done.

So I decided to switch to a binary format with basically only exactly what I needed to pump into DirectX (the vertices, normals, uvs, and indices). I created a separate console application that would covert *.DAE files into my new binary format. Then I added engine support for loading the binary file instead of COLLADA. The gain was HUGE. Now when running the exe, there was no noticeable lag time at all. I guess I kind of knew I needed to do this at some point, but the wait times were too much to bear any longer. Glad to find a good solution.

Here are some snippets of code to show how to save variables as binary data:

float someValue= 0.12345f;
ofstream outputFile;
outputFile.open(L"output.bin", ios::out | ios::binary);
outputFile.write((char*)&someValue, sizeof(float));

And then you can read this value later by doing:

float someValue;
ifstream inputFile;
inputFile.open(L"output.bin", ios::in | ios::binary);
inputFile.read((char*)&someValue, sizeof(float));

Actually not that difficult at all. The benefits are decreased loading time and also smaller file sizes. The cons are that you now have another step in the asset pipeline, and that the files are no longer human-readable. A fair trade I would say.

engine zero coke can

What you see above is a custom model I made in 3ds Max, exported as a COLLADA *.dae file, and imported into my DirectX engine. I figured I’d start with something simple, like a soda can, and I plan to make a lot more models going forward. Although I hadn’t touched Max in years, I found it to be a comfortable experience and was able to put the model together in a few hours.

Now, actually getting that model into DirectX was a different story. First off, the COLLADA documentation is vast, but they fail to explain basic things about the format. The examples they show all make sense, but with a real model it becomes more complex. To make matters worse, their forum was a ghost town and I found lots of people with the same basic questions I had that posted a thread with no replies for months (or years). That said, I was able to eventually figure it out by a lot of testing and trial and error. It really goes to show that you can build the best system in the world, but if the documentation is lacking and the community is thin, then it’s not worth jack.

To make matters worse, there was a small bug in my XML parsing code that was messing up the attributes. So some of the simple models I tried (and plane and a cube) worked, but the soda can didn’t. It ended up taking a while to track down this problem since Visual Studio was hanging if I tried to debug. It’s really scary to get to this point where you *need* the debugger desperately and it’s not there. While I thought it was crashing, it was actually just caught up in my slow parser, and when I waited for about 5 – 10 minutes it finally came back to life (and thankfully I only needed to get to that one breakpoint to see what the issue was).

Next up, I ran into some issues with the model orientation and texturing. Since 3ds Max using a Z-up coordinate system and DirectX is Y-up, this needed some special care. I would have thought the COLLADA exporter would handle this, but apparently not. The fix is to swap the Y and Z positions of each vertex. This will effect the winding order as well, so if you want your mesh to not be inside-out, you need to also change the order of the indices when you create the index buffer. For example, a triangle of “0, 1, 2” will become “0, 2, 1”. Finally I had to negate the V parameter of the UV coordinates so that the texture looked proper.

All-in-all, I am pretty happy considering I have wrote the importer basically from scratch. I would like to try some more complex models, but I will have to figure out what I want to build next. Since I am doing all this work myself, I’d like to use the engine to showcase my own artwork. I would rather not just download assets from the internet. Maybe I will build a refrigerator to put the soda in, or some more common products.

If you like what you read, post a comment and let me know how I’m doing. Cheers.

engine zero xml

Programmer art is great and all, but I’d really like to see some complex models inside the engine. Unfortunately, DirectX 11 does not include a built-in way to load in 3D models. As I’ve mentioned before, I am interested in using COLLADA has the import format. Since COLLADA is based on XML, I will need a way to load and parse XML files. While there are tons of XML parsing libraries out there, I decided to write my own. Why would I do that? A few reasons. First, I don’t want my engine to be encumbered with 3rd party licenses, forcing me to do things against my will. Secondly, I think it’s a great learning exercise to see how something like this is done. Lastly, it’s fun!

Sadly, I found very little resources on how you go about coding an XML parser from scratch. Of the code I did find (i.e. from open-source libraries), it was difficult for me to extract the algorithm from the code. There was one resource that did help somewhat, from ANTLR3, but it failed to provide the pseudo-code I was looking for. Even so, it was enough to get me started.

The basic procedure I followed was loading the XML file into a string, then iterating though the string and breaking the text up into tokens. Each of these tokens would take the string representation and a label of what the token meant. Then, in a second pass, I would look though all the tokens and parse them into a tree structure. It actually ended up being easier than I initially though, and I think I completed the whole thing in about 2 or 3 days just working a few hours in the evening. I’ll highlight some of the relevant code below.

I created a map with all the important XML syntax elements so I can break them up into tokens.

tokenMap[""] = XML_CLOSE;
tokenMap[""] = TAG_EMPTY;
tokenMap["<"] = TAG_OPEN;
tokenMap[">"] = TAG_CLOSE;
tokenMap["=\""] = ATTRIB_EQUALS;
tokenMap["\""] = ATTRIB_QUOTE;
tokenMap[" "] = WHITE_SPACE;
tokenMap["\n"] = WHITE_LINE;
tokenMap["\t"] = WHITE_TAB;

First I load the XML file using the standard C++ stream libraries.

ifstream inputFile;
inputFile.open(fileName, ifstream::in);

stringstream inputStream;

while (inputFile.good()){
	inputStream << (char)inputFile.get();

string temp = inputStream.str();
char* data = const_cast(temp.c_str());

Next, I loop through all the characters in the data stream and find the tokens.

TokenList tokenize(char* doc){
	size_t docLen = strlen(doc);

	TokenList tokens;
	TokenMap::iterator it;

	string buffer = "";

	unsigned int i;

	for (i = 0; i < docLen; i++){
		bool found = false;
		for (it = tokenMap.begin(); it != tokenMap.end(); ++it){
			int tokenLen = strlen(it->first);
			if (compare(&doc[i], it->first, tokenLen)){
				int textLen = strlen(buffer.c_str());
				if (textLen > 0){
					char* text = new char[textLen];
					strncpy_s(text, textLen + 1, buffer.c_str(), textLen);
					TokenMap token = { { text, GENERIC_TEXT } };
					buffer = "";
				char* match = new char[tokenLen];
				strncpy_s(match, tokenLen + 1, &doc[i], tokenLen);
				TokenMap token = { { match, it->second } };
				i += tokenLen - 1;
				found = true;
		if (!found)	buffer.append(&doc[i], 1);

	return tokens;

Finally I iterate through the list I just created and parse that into a node-based tree. I admit this part is a little ugly, but it seems to work so I’m OK with that. The idea is that I set the function into different states, and then parse the elements in the list differently depending on the state. For example, if I see a “<” token, then I go into attribute parsing, and then when I see a “>” I set it back to the default state. The logic is fairly simple, but there are a lot of if statements to weed though if you are trying to implement this yourself.

void parse(TokenList& list, XmlNode* parent){
	ParseType state = PARSE_ANY;
	XmlNode* node = new XmlNode();
	bool created = false;
	bool allWhite = true;
	int openTags = 0;
	string attribName = "";
	string attribValue = "";
	string valueBuffer = "";
	TokenList children;
	TokenList::iterator v;
	for (v = list.begin(); v != list.end(); ++v){
		TokenMap::iterator m;
		for (m = v->begin(); m != v->end(); ++m){
			if (state == PARSE_ANY){
				if (m->second == XML_OPEN){
					state = PARSE_XML_TYPE;
				} else if (m->second == TAG_OPEN){
					state = PARSE_TAG_NAME;
					created = true;
				} else if (m->second == ATTRIB_EQUALS){
					state = PARSE_ATTRIB_VALUE;
				} else if (m->second == GENERIC_TEXT || m->second == WHITE_SPACE || 
					m->second == WHITE_LINE || m->second == WHITE_TAB){
					valueBuffer.append(m->first, strlen(m->first));
					if (m->second == GENERIC_TEXT) allWhite = false;
				if (v + 1 == list.end()){
					if (!allWhite) parent->value = valueBuffer;
					valueBuffer = "";
			} else if (state == PARSE_TAG_NAME){
				node->name = string(m->first);
				state = PARSE_ATTRIB_NAME;
			} else if (state == PARSE_ATTRIB_NAME){
				if (m->second == WHITE_SPACE) continue;
				if (m->second == ATTRIB_QUOTE) continue;
				if (m->second == TAG_EMPTY){
					state = PARSE_ANY;
				} else if (m->second == TAG_CLOSE){
					state = PARSE_TAG_CLOSE;
					openTags = 1;
				attribName = string(m->first);
				state = PARSE_ANY;
			} else if (state == PARSE_ATTRIB_VALUE){
				attribValue.append(m->first, strlen(m->first));
				if (m->second == ATTRIB_QUOTE){
					node->attributes[attribName] = string(attribValue);
					state = PARSE_ATTRIB_NAME;
					attribValue = "";
			} else if (state == PARSE_TAG_CLOSE){
				if (m->second == TAG_OPEN) openTags++;
				if (m->second == TAG_END) openTags--;
				if (m->second == TAG_EMPTY) openTags--;
				if (openTags > 0){
				} else {
					parse(children, node);
					state = PARSE_TAG_END;
			} else if (state == PARSE_TAG_END){
				if (m->second == TAG_CLOSE){
					node = new XmlNode();
					state = PARSE_ANY;
			} else if (state == PARSE_XML_TYPE){
				if (m->second == XML_CLOSE){
					state = PARSE_ANY;
	if (created && node->name.length() > 0){
	} else {
		delete node;

All in all not nearly as bad as I expected. Granted, the algorithm could be a little less hard-coded, but it’s a fairly straight-forward implementation. I also loaded in a COLLADA *.DAE file, and I did not see any errors or problem. Within the next few days I hope to integrate this code into the engine and actually load up a 3D model. Surely there will be some hiccups, but I have faith this can be done soon.


This has got to be one of the more insane physics demos I’ve seen so far. Most physics engine handles the basic rigid bodies and such, but start to fall apart with more complex interactions (i.e. fluid and cloth simulations). With the demo shown above, from Nvidia, it seems these difficult problems have been solved. Cloth, fluid, smoke, and rigid or soft bodies, all interacting with each other? It looks great. The author, Mike Macklin, has even posted a pre-release of the SIGGRAPH paper explaining the technique here. I took a quick look, and I will be giving it some serious investigation soon. These types of complex physics interactions are exactly what I am trying to do myself. Hopefully the implementation will not be that difficult, but I have a feeling I have a long road ahead of me. Wish me luck!

OGRE 3D Instancing

After some more testing, it looks like OGRE is not the savior it seemed like yesterday. While the static geometry boosted frame-rates greatly, it’s only useful for, well, static objects. Meaning the models can’t move or animate. I did find another option, instancing, which initially looked promising. It allows rendering of large amounts of identical objects faster than just having them be individual. Sounds good.

The implementation seemed complex at first, but then I found the InstanceManager which simplified things a whole lot. However, after getting it working, I wasn’t as impressed with the performance. Just rendering the same 13k still cubes I was getting a little over 100 fps. Then when adding rotation animation to the cubes, the speed dropped down to around 33 fps. Certainly this is still better than the naive implementation, however still nowhere close to where I want.

To be completely upfront, my computer is not a power-house. I’m still running a Core 2 Duo @ 3GHz and GTX 470’s in SLI. Getting a little old, I know, but still can play modern games like Titanfall or whatever. Maybe I’m expecting too much, don’t know at this point. I think I will just go back to development on my engine and worry about performance optimization later. Even so, this was still an interesting investigation at least.

OGRE 3D Static Cubes

Looks like I spoke too soon. While OGRE was getting pretty slow with the naive implementation, I was able to find some code on what they call StaticGeometry, which is a system to batch together lots of similar meshes that don’t move (great for my cube example project). With this feature added, the frame rate has sky-rocketed to over 2,600 fps. Most impressive. Keep in mind a blank DirectX window on my machine will get around 3,600 fps. So getting around 2,600 with over 13,000 cubes is very nice. That still doesn’t help me with my physics simulation, since static objects won’t cut it. But it does at least give me a good benchmark as to what is possible on my development hardware.

OGRE 3D Cubes

Seeing as performance has been on my mind recently, I tweaked the core render loop a bit and saw some reasonable gains. The one thing I realized is that most of the objects in the scene are static, and don’t need their combined transformed matrices recalculated every frame. I expected to see wild improvements after caching the values. What I received was a decent 50% gain. Not monster, but certainly substantial. Now the average framerates are in the upper 200’s to lower 300’s. More acceptable but still maybe not where I want it to be.

Just as a sanity check I decided to recreate the same exact 13k cube scene in OGRE, a popular open-source 3D engine. Too my surprise, performance fell to the floor. In OGRE I was only getting between around 30-80 fps, while my custom engine was getting over 5  times the frames-per-second. So this makes me feel a whole lot better about the situation. I’d also like to do the same test in Unity and some other engines and see how they compare. As a quick test, though, I’m quite satisfied.

All things considered, I’d still like the performance of my engine to be a lot better. The reason I am even working on this is because I have an idea in mind that doesn’t seem feasible with current middleware. The core aspect is a robust physics simulation, and I expect to have tens (or hundreds) of thousands of objects animating simultaneously. Maybe the reason no one has done what I want is because current PC hardware and software is not up to the task. Maybe no one has tried. Not sure, but I want to make it happen. We’ll see soon enough.

Stress Test

I don’t have much time, so I will be brief. Basically for the past few days I have been trying to optimize the engine. With the stress test you see above (around 13K cubes) I was only getting around 200 fps. Just slightly above my target of 120 fps, and with such a simple scene I was expecting more. So I got to hacking, fully realizing that early optimization is evil… yeah, yeah. In any case, I needed to know if the engine architecture was flawed in some way, and if I was going down the wrong path. Through some crude debugging I found that my matrix multiply operation was causing the huge sink in performance. My somewhat straight-forward implementation was as follows.

Matrix4x4 Matrix4x4::multiply(const Matrix4x4& rhs){
	Matrix4x4 ret;

	ret.m11 = m11 * rhs.m11 + m12 * rhs.m21 + m13 * rhs.m31 + m14 * rhs.m41;
	ret.m12 = m11 * rhs.m12 + m12 * rhs.m22 + m13 * rhs.m32 + m14 * rhs.m42;
	ret.m13 = m11 * rhs.m13 + m12 * rhs.m23 + m13 * rhs.m33 + m14 * rhs.m43;
	ret.m14 = m11 * rhs.m14 + m12 * rhs.m24 + m13 * rhs.m34 + m14 * rhs.m44;

	ret.m21 = m21 * rhs.m11 + m22 * rhs.m21 + m23 * rhs.m31 + m24 * rhs.m41;
	ret.m22 = m21 * rhs.m12 + m22 * rhs.m22 + m23 * rhs.m32 + m24 * rhs.m42;
	ret.m23 = m21 * rhs.m13 + m22 * rhs.m23 + m23 * rhs.m33 + m24 * rhs.m43;
	ret.m24 = m21 * rhs.m14 + m22 * rhs.m24 + m23 * rhs.m34 + m24 * rhs.m44;

	ret.m31 = m31 * rhs.m11 + m32 * rhs.m21 + m33 * rhs.m31 + m34 * rhs.m41;
	ret.m32 = m31 * rhs.m12 + m32 * rhs.m22 + m33 * rhs.m32 + m34 * rhs.m42;
	ret.m33 = m31 * rhs.m13 + m32 * rhs.m23 + m33 * rhs.m33 + m34 * rhs.m43;
	ret.m34 = m31 * rhs.m14 + m32 * rhs.m24 + m33 * rhs.m34 + m34 * rhs.m44;

	ret.m41 = m41 * rhs.m11 + m42 * rhs.m21 + m43 * rhs.m31 + m44 * rhs.m41;
	ret.m42 = m41 * rhs.m12 + m42 * rhs.m22 + m43 * rhs.m32 + m44 * rhs.m42;
	ret.m43 = m41 * rhs.m13 + m42 * rhs.m23 + m43 * rhs.m33 + m44 * rhs.m43;
	ret.m44 = m41 * rhs.m14 + m42 * rhs.m24 + m43 * rhs.m34 + m44 * rhs.m44;

	return ret;

Feeling that this could be improved, I found some code on StackOverflow to do the same operation using SSE instructions. I was initially considering coding it in assembly, but this looked like a cleaner solution and a little easier to understand (though, of course, nowhere near as cool as getting “pedal to the metal” and writing assembly code). I was told this should be as fast or faster than assembly anyhow. The new function is below.

void Matrix4x4::multiplySSE(float *lhs, float *rhs, float *out) {
	__m128 row1 = _mm_load_ps(&rhs[0]);
	__m128 row2 = _mm_load_ps(&rhs[4]);
	__m128 row3 = _mm_load_ps(&rhs[8]);
	__m128 row4 = _mm_load_ps(&rhs[12]);
	for (int i = 0; i < 4; i++) {
		__m128 brod1 = _mm_set1_ps(lhs[4 * i + 0]);
		__m128 brod2 = _mm_set1_ps(lhs[4 * i + 1]);
		__m128 brod3 = _mm_set1_ps(lhs[4 * i + 2]);
		__m128 brod4 = _mm_set1_ps(lhs[4 * i + 3]);
		__m128 row = _mm_add_ps(
			_mm_mul_ps(brod1, row1),
			_mm_mul_ps(brod2, row2)),
			_mm_mul_ps(brod3, row3),
			_mm_mul_ps(brod4, row4)));
		_mm_store_ps(&out[4 * i], row);

To be honest, I was disappointed. There were some small gains, sure, but I was expecting some a serious improvement. With the same 13K cube scene, I was now getting close to 225 fps. Over a 10% improvement, it’s something, but not what I wanted. So I got it into my head that I would try the DirectXMath library, and at least do some benchmarks to see how it compared. I mean, I did really want to stick with my custom math library, but not if it meant slow performance. I did a few quick test calculations, and the speed seemed nice. Sadly the compiler probably optimized them out (keep reading).

So I spent the next few hours ripping out all of my math calls and adding in the DirectXMath classes and functions instead. Finally I got to a point of having some visuals on screen. What do I see? Slow frame rates. It was much worse than before. Barely even 100 fps. Unacceptable. I fixed up some more of the code and got it to an OK state. Even then, it was only running at around 200, or about the 10% worse than my own custom functions. How could this be? Well, in some sense I feel proud that my my implementation fared well. But on the other hand, I just wasted an entire night for nothing. You live and you learn.


Today I have gotten the camera system to a decent place, and made a simple free look demo. Most of the code had already been implemented, inside the vector and matrix classes, I just had to piece it together into a camera object. I also added a grid of cubes, to better see the camera working. Sadly these extra 200 cubes slowed down the performance by a good chunk. Previously I was getting around 3,500 FPS (with 3 cubes), now I’m only getting around 2,000 FPS (with around 220 cubes).

Granted the performance is bound to drop as more objects are added, but I think I can improve this a lot. At the moment, I am not doing any sort of culling, and when I get that working I feel like it would give a good boost. However, it’s probably not the highest thing on the list since I’m still getting reasonable frame-rates.

Coming up next I would like to fix the lighting system (currently it’s just a hacked on ambient/directional light) and I need to get a COLLADA model importer functioning. I’ll also have to pick up a 3D modelling program and make some better models to test with. I did try to learn Blender a bit, but I found it cumbersome. Gonna give 3DS Max a go again. Haven’t used it in many years, but I was at one point pretty comfortable with the app. I think my first model will be a soda can, as it’s something easy and recognizable. See ya next time!

Though the above video might not seem like an overly impressive jump from the last, there’s actually a ton of work behind it. The new additions include a node-based scene graph hierarchy, more robust math libraries, and keyboard control using DirectInput. Plus, I’ve tried to abstract as much as I can into modular classes and remove the hard-coded hacks I had in there. Finally I hid away the Windows stuff into it’s own class so clients just need to create a normal main() function and can launch the window from there (removing much of the nasty Win32 looking code from sight).

Here is an example of launching an empty window with the engine:

int main(){
	Engine& engine = Engine::get();
	engine.create(1280, 720);

	while (engine.loop()){
		if (engine.control.keyPressed(KEY_ESCAPE)) break;


	return 0;

All in all a vast improvement even if the graphics aren’t too pretty yet (we’ll get there). Coming up next I want to build a 3rd person free camera to navigate around.