glTF in Unity optimization - 1. Buffers, Accessors and Primitives
This is part 1 of a mini-series.
TL;DR:
- Reading the glTF specification carefully right away would have spared me performance issues in corner cases later on.
- How to get from 8 GB down to 0.034 GB of mesh memory usage. glTF may be a bit too generous and allows exporters to create valid, but troublesome data.
Observations
Though glTFast did a decent job on all of the assets in the official sample model repository, some users reported poor performance with certain scenes of theirs.
The investigation showed, that I was approaching loading (mesh) primitive data from the wrong side. Let's look at how a glTF scene is structured.
I'll give a brief overview of the glTF 2.0 specification. If you already know the intrinsic structure of glTFs, you might want to skip ahead to the solutions.
In this article I'll use words that can stand for a general concept and a specific glTF schema entity as well (like a mesh). I'll use capitalized words when I talk about the glTF schema entity in particular and lower-case when it's about the general idea/concept.
glTF scene structure
A glTF scene consists of a hierarchy of Nodes. Nodes have a transform (position, rotation and scale) and can have any number of child Nodes. They can also have a reference to a Mesh, which indicates that the Mesh's geometry is supposed to be rendered with the Nodes transformation. A Mesh can be references by multiple Nodes in a Scene.
glTF data structure
In glTF all geometry data is inside of Buffers, which are plain binary data. Several entities in the glTF's JSON describe how this data is structured and how it can be accessed.
The smallest unit of visible geometry is a Primitive. It consists of triangle indices, references to vertex attribute buffers via Accessors ( positions, normals, texture coordinates, colors and tangents) that those indices reference into and a material. A Primitive belongs to a Mesh, which can contain multiple Primitives.
The Primitive's references to vertex attribute buffers are actually references to Accessors. Accessors define the attribute's type (scalars, vectors of 2,3 or 4 dimensions, etc) and its components data type (byte, short, float). They also reference which BufferView contains the described vertex attributes. The BufferView is the final connection to the actual data inside the Buffers. BufferViews define in which Buffer, at what position (and in case of interleaved attributes, the byte-stride) the actual data is.
This structure gives DCC (digital content creation) tool architects a lot of possibilities and freedom to put 3D scenes into glTFs. It allows efficient re-use of Accessors and Meshes, so there's no need to make data redundant.
glTFast's behavior status quo
When I sketched out glTFast, I apparently was approaching things from the final scene structure's perspective and not from a data perspective. I can hear DOTS folks shaking their heads now 😉
glTFast iterates over all Primitives and then retrieves their data through Accessors and BufferViews from the Buffer. Retrieving means the raw byte values are transferred into C# data types / Unity data structures (like Vector3 arrays for positions) and converted into Unity's coordinate system's space.
This is simple and in many cases fast enough (or not slower), but as soon as multiple Primitives reference the same attributes or indices, those will be retrieved from the Buffer multiple times redundantly. This is bad for two reasons:
- It takes longer
- It wastes memory since more copies of the same data are created
I can't remember why I designed it this way (other than not being aware of this problem), but I probably thought unused Accessors will never get imported this way. A fair point, but a common/sane glTF files don't contain unused data in first place 😃
Analysis of the problem
I didn't want to repeat the mistake of jumping into coding before I understand the problem, so I started researching the glTF structure as well as Unity's Mesh API (the legacy and the new 2019.3 version), because it all comes down to bringing those two together efficiently.
I've had two bad-performing scenes on my hand which illustrate two different problems.
Case 1: Smart re-usage
The first case had Primitives of one Mesh reference the same vertex attributes, but with different materials (and indices). This is actually very common with assets. Triangles that share the Node's transform and even vertices, but have different materials. It's glTF's counterpart to Unity subMeshes and that's what they should end up being imported to.
Case 2: Let's go bonkers
The second case took the concept of re-using Accessors even further. The scene had ~400 Primitives, but only two very large vertex attribute arrays (with Accessors for positions, normals, UVs for each). One buffer counted ~234.000 vertices and the other one ~138.000. This is pretty much the worst-case scenario for glTFast and thus an excellent use-case.
When turned into Unity meshes, Primitives would occupy ~18 MB or ~13 MB each, but most of them consisted of only of a dozen or up to some hundred triangles. The hundreds of vertex data copies took their toll. Overall this 30 MB glTF binary file ate up over 7 GB of memory after import 😱
The official glTF Validator thinks this glTF is perfectly valid. Don McCurdy's glTF Viewer reports 97 million vertices!
.
That's roughly the number of Primitives multiplied by the vertex count. So either this viewer does duplicate all vertex attributes as well, or the number in the report is wrong. It loads it very fast though ( ~3 seconds on my laptop ).
My gut feeling tells me, this scene is not using glTFs ability to re-use accessors to the its benefit.
Finding a solution
The rationale of the analysis is that the ideal glTF importer retrieves the data once per Accessor instead of once per Primitive.
Proper sub meshes
I refactored the loading routine to the following, simplified steps
- Iterate over all Primitives
- Assign the Accessors that they reference their usage type (used for positions or normals or …)
- Iterate over all Accessors
- Now that we know how an Accessor is used, we can properly retrieve and convert its data
- This step is still fast, because threaded via C# Job System
- Wait for all Accessor data to be ready
- Iterate over all Meshes
- Iterate over a Mesh's Primitives and cluster them by identical vertex attribute usage
- Create one Unity mesh per Primitive cluster
- In case of multiple Primitives per cluster, create subMeshes for them
- Iterate over Scene's Nodes
- Create a GameObject/MeshFilter that references the corresponding Mesh.
As a consequence, Accessor data is retrieved only once in any case 🎉🎉🎉
This solution has a bit more pre-computation overhead compared to the old, per Primitive approach, benchmarks showed, that from a performance perspective this is negligible. Scenes that don't re-use vertex attributes might load a couple of milliseconds slower.
Scenes with similar Primitives per Mesh are faster and more memory efficient now. For example, this cube consists of one Mesh with six Primitives (one per side). Instead of importing the vertex data six times and creating six Unity meshes/GameObjects it now does everything only once.
.
These changes will soon be released in glTFast version 0.10.0.
Arbitrary accessors dilemma
Unfortunately, the problem case #2 cannot be resolved 😦
In Unity you render geometry via a Unity Mesh, which is a data structure that tightly bundles vertex attribute buffers and (sub) mesh indices. You cannot keep them separate, even with the improved new Mesh API from 2019.3 (Aras verified that). If Accessors are re-used across multiple Meshes their imported Unity data still gets duplicated.
If your rendering API or game engine gives you access to low level functionality like binding vertex attribute buffers manually, then there's no problem in loading glTF data "as-is".
Solution idea 1
Detect non-overlapping fragments of large vertex buffers and split them up at run-time.
Even if this can be done fairly efficient (which I doubt), it will worsen the computation time of the import. It also means more code to write, stabilize and maintain.
Solution idea 2
Create one Mesh per vertex attributes combination and use Graphics.DrawMesh instead of MeshRenderers for rendering.
In problem case #2 that would mean two giant Unity meshes with hundreds of sub-meshes and a scene hierarchy with custom/non-standard MonoBehaviours for rendering.
I presume this approach is faster than solution 1 at run-time and maybe even more efficient in terms of memory. Still, the non-standard components is what I dislike.
Finally: the "no" solution
This is a classic example of "garbage in, garbage out".
The vast majority of glTF files I came across (including the official sample models), have a sane structure that doesn't yield inefficiencies. As a matter of fact, when trying to produce worst-case example glTFs with the Blender glTF addon, I couldn't achieve it, since it created efficient files with separate vertex attribute buffers all the time.
So I came to the conclusion it's a content problem created by the glTF generator.
Workaround
Fortunately there are tools that optimize glTFs. I used the excellent gltfpack to optimize the named scene and look at the results.
glTFast 0.9 | time | memory |
---|---|---|
original | 16.3 sec | 8.000 GB |
gltfpack | 2.49 sec | 0.034 GB |
glTFast 0.10 | time | memory |
---|---|---|
original | 9.20 sec | 6.010 GB |
gltfpack | 2.53 sec | 0.034 GB |
Optimized glTFs make all the difference! Apart from being ~40 milliseconds (1.6%) slower on optimized content (I assume that's the additional overhead), overall glTFast 0.10 improved quite a bit.
Where to go with these problems
The next steps I eventually intend to make
- Figure out if other glTF importers have problems with this as well
- Certify that this is not an implementation problem of glTFast
- Let affected glTF generators' developers know that their output is troublesome
If this proves to be a generic problem there's gotta be a discussion about possible solutions
- Make the next iteration of the glTF specification more rigorous
- Make the glTF validators throw warnings (or even errors) that inform exporter developers about potential performance problems
Next up
This optimization step, although good and necessary for some scenes, made simple scenes a tiny bit slower. I'm tempted to optimize the nested structure of Coroutines to make up for this right now, but I know that there are lower hanging fruits in other topics first. This has to wait, maybe until after a switch to .NET 4 only and async/await.
Follow me on twitter or subscribe the feed to not miss updates on this topic.
If you liked this read, feel free to