Traditionally geometry instancing has been a rather tame concept: Load a model once into memory, render it multiple times in a single frame with different translations.
The idea of geometry instancing is to speed up the rendering of large amounts of similarly structured triangles. The weather renderer in my game engine uses instancing to display lots of snow flakes simultaneously:
Early Tech Demo of Platformer from Meds on Vimeo.
(keep in mind this is running in debug hence the low framerate)
As time has gone on and we’ve slowly discovered the dreaded performance bottleneck of actually submitting triangles to the GPU, simply put submitting geometry multiple times to the GPU over and over again regardless if it’s the same or not forces the GPU into multiple translation sets, possible state changes and probably some more things I’m not smart enough to be talking to you about.
So to get around the problem caused by submitting the same piece of geometry multiple times to the GPU smart programmers have decided, why not just submit it all in one big chunk?
This is what geometry instancing is today, if you have one thing and you want to render it lots of times, bunch it all up into the same vertex buffer and submit them all to render once.
To prepare a model for instancing what I do is,
1) Copy all its vertex normals, positions and UV data from the models vertex buffer into a vector
2) Create a new vertex declaration with the same structure as the models vertex buffer but also with an index value and make this new vertex declaration sixty times larger than the original models vertex declaration.
In other words if the vertex buffer I’m going to be instancing has a struct like follows:
1 2 3 4 5 6 | struct MESHVERT { float x, y, z; // Position float nx, ny, nz; // Normal float tu, tv; // Texcoord }; |
The instanced version will be:
1 2 3 4 5 6 7 | struct MESHVERTInstanced { float x, y, z; // Position float nx, ny, nz; // Normal float tu, tv; // Texcoord float idx; // index of the vertex! }; |
By giving the vertex an index value I can tell apart which instance I’m rendering in the vertex shader, this is important so I can apply different transformations to different instances.
3) Copy the vertex, normals, position and UV data into a larger vertex buffer which is created using the new vertex declaration and set the correct index values for them.
So if I instance a model with 40 vertices defined 60 times the new vertex buffer will have 40*60 vertices and the index value (float idx) will be 0 in the instanced buffer for vertices 0 to 40, it will be 1 for vertices between 40 and 80, it will be 2 for vertices between 80 and 120 and so on and so forth.
Now that we’ve created a nice large vertex buffer which is capable of rendering 60 instances of the same model in one go how do we go about rendering it?
In the vertex shader we define a matrix array, of size 60, and then CPU side we set that matrix array with the translations so we render it 60 times with sixty different translations and every time we do render it the renders come all at once, in other words:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | //In the shader: float4x4 g_vInstanceTransforms[60]; //In the C++: std::vector vcTransforms; /* fill up vcTransforms with values, if it's larger than 60 it's a waste */ m_pEffect->SetMatrixArray("g_vInstanceTransforms", &vcTransforms[i],60); // back in the shader we have, struct VertexShaderInput { float4 Position : POSITION0; float3 Normal : NORMAL0; float2 TexCoord : TEXCOORD0; float idx : TEXCOORD1;//the index value! }; VertexShaderOutput VertexShaderFunction(VertexShaderInput input) { float4 worldPosition = mul(input.Position, g_vInstanceTransforms[input.idx]); } |
And there we have it.
Of course we can change the number of triangles we submit to the vertex buffer to render less than 60 instances at a time and we can render the same instance buffer multiple times if we so chose.
In terms of performance I have found that this method of instancing is way fast, from dodgy performance tests with fraps that took about 30 seconds to do without me putting in much thought into it I found that rendering 60 instances of the same mesh was just two times slower than rendering the mesh once. When rendering the mesh sixty times uninstanced it was about four times slower than rendering the instance buffer once, with the same translations.
And to close off, here’s a video of some early gameplay, note every projectile is instanced so there isn’t a huge performance hit if one at all,
Early Gameplay Demo from Meds on Vimeo.
June 28th, 2010 at 4:45 am
Interesting to see a float as an index. Is that for efficiency, portability, simplicity?
June 28th, 2010 at 4:48 am
Hi Sam,
That might actually just be that I’m silly, I’m not sure if DirectX vertex buffers support shorts/ints, they probably do but I was too lazy to find out.
Probably should have before making this post eh?
June 28th, 2010 at 6:34 pm
What you’re really looking for is a platform specific pbuffer that you buffer over the vertex data once over the bus and then the vertex data stays local in GPU memory to use whenever needed (pbuffers in Windows, and, in more recent GL, just static data buffers). By keeping the memory in GPU memory you don’t have to buffer over the data every frame as in traditional vertex submission (which is where the bottleneck lies).
This works great for static geometry that doesn’t change. Of course, it’s more trickier if your vertex data is ever changing, like with an animated mesh.