COSC-3P98-Assigment-3/README.md

9.3 KiB

Introduction

  • Jacob wanted to see it.
  • freeglut is required.
    • install with sudo apt install freeglut3-dev
  • build with:
    • git clone --recursive https://github.com/Tri11Paragon/COSC-3P98-Assigment-3.git
    • cd COSC-3P98-Assigment-3/
    • mkdir build && cd build/
    • cmake -DCMAKE_BUILD_TYPE=Release ../
    • make -j 16
    • ./assign3
  • Windows supported (MSVC only), Linux is preferred.

Description

On my hardware, the simple particle fountain can reach 30k particles and features sorted transparency.

Extra Features

  • Ordered Alpha Blending

  • "Spray" Mode (12)

  • Textured Particles/Plane/Cube (16)

  • Particles with Different Textures (18)

  • Extra Feature - "Performance Mode" (23)

Missing Features

The random spin mode was left out intentionally for two reasons. One, I specifically designed the particle structure to fit in 32 bytes, half the width of a cache line. Two, the potential marks were not worth disturbing the particle data structure and further altering the particle system. There is likely little benefit to ensuring the particles fit nicely inside a cache line as most of the CPU time is spent on OpenGL API calls. See "Performance Mode" for more information.

Building

Caveats

The assignment makes use of a non-standard OpenGL extension during texture loading. "GL_TEXTURE_MAX_ANISOTROPY_EXT" should work on all modern Intel/AMD/Nvidia hardware, if it doesn't work on your hardware consider removing the line from texture.h and high_perf.cpp

Build Commands

    mkdir build && cd build
    cmake -DCMAKE_BUILD_TYPE=Release ../
    make -j 16
    ./assign3

Usage

Keybindings and usage instructions are printed at program startup.



Performance Mode

Design

The high-performance mode is the result of a weekend hack-a-ton where I wanted to see how easy it would be to implement a million particle+ renderer. If I had more time I would encapsulate the high_perf.cpp file into the particle system class, allowing for multiple *and customizable* particle systems. If you wish to change settings, most are constants in shaders/physics.comp or high_perf/high_perf.cpp. The rendering engine itself can handle around 20 million particles at about 60fps. With physics enabled, the engine can handle about 6 million particles but the renderer is fillrate limited. Solutions to increase the number of rendered particles are discussed in the future plans section. The new renderer makes use of "GL_POINTS" with a geometry shader to generate the vertices and features billboarding/texturing. A compute shader is used before rendering to update the particle positions and directions on the GPU. This way there is no need to copy the data to and from the graphics card.

Renderer

The legacy OpenGL renderer uses display lists to speed up the rendering of particles. Although this method is faster than using the same draw commands inline, it is highly limited by driver overhead. Modern GPUs are designed to process massive amounts of data all at once and benefit from reducing the amount of synchronization between the GPU and CPU. As mentioned earlier the current renderer uses a vertex buffer object to store all particle positions and directions in one giant array. It then uses those points to render all particles in a single draw call, thereby reducing driver overhead.

Rendering Pipeline

Vertex Shader

The vertex shader is purely used to pass through the particle position to the geometry shader. Since the vertex shader cannot output multiple vertices (not easily at least), we have to use a geometry shader.

Geometry Shader

The geometry shader uses the up and right vectors from the inverse view matrix to generate a quad facing the camera. It takes in the particle position and outputs a triangle strip. This is a highly efficient operation as there is dedicated hardware to handle this particular geometry shader case (1:4)[@amdprogram p. 9].

Fragment Shader

The fragment shader is run once per pixel and is responsible for texturing the particles. I use a texture array as it can be bound once before rendering, therefore particles do not need to be separated by texture. Using an array has the downside of every texture needing to be the same size, to solve this I resize the texture as it is loaded. Unfortunately, this will lead to some textures being distorted but the performance gain is worth it. The modern renderer is constrained by the lack of 'advanced' programming techniques, some of which are discussed in the future plans section.

Compute Shader

Compute shaders are very useful for embarrassingly parallel tasks like updating particles. The compute shader is a very simple (nearly 1:1) translation of the CPU version of the particle system's update function. It handles 'dead' particles by resetting them to the initial position/direction. As a result, particles are initialized with a random lifetime between 0 and the max lifetime to ensure even distribution. If you change the particle lifetime, please modify both constants!

Direction Offseting

Because generating random numbers on the GPU is hard (there is no dedicated hardware random number generator), I generate a random set of offsets at startup and upload these randoms to the GPU. The particle index is then used to access this buffer when the particle is reset; the result is a convincing distribution of particles. The large the number of particles the larger the offset buffer should be. Up to 6 million 8192 should be fine. If things look off consider increasing the value to some larger power of two. Make sure you update both constants here as well!

Usage

Building

Add "-DEXTRAS=ON" to the CMake command.

        mkdir build && cd build
        cmake -DCMAKE_BUILD_TYPE=Release -DEXTRAS=ON ../
        make -j 16
        ./assign3

Running

All particles exist from the beginning, which means all particles start at the initial position and slowly spread out. After starting the program but before moving around, you should press 'p' to allow the compute shader to run, once the particles spread out it is safe to move. The slow performance of all the particles in the same spot has to do with overdraw accessing and writing the same location of the depth texture (hard to do in parallel). Fillrate is a common issue with this particle renderer. See the future plans section for possible resolutions.

Future Plans

Unfortunately, because this is exam season, I do not have time to do anything more with this assignment. I plan to return to this project in the future, and below is a list of features I began looking into.

Lists

I would like to make it so all particles are not rendered all the time. add a dead/alive particle list. This would prevent the issue of all particles starting in the same place, the low performance that causes, and would be helpful in sorting.

Bitonic Sort

Professor Robson spent a good deal of time on this algorithm in the parallel computing class and most of the literature online suggests using this, including the famous GDC talk on optimized particle systems. The next step in implementing a good GPU-accelerated particle system is bitonic sorting, however as I got further into the algorithm it became clear that if I wanted to implement it myself, properly understanding the algorithm would take too much time.

Tiling

Tiling is a rendering technique that works by dividing the view frustum into small sections and then sorting / culling particles within. This way we can reduce the overall rendering load while staying on the GPU. This is an algorithm I've always wanted to implement but lack a solid understanding of. The GDC talk goes further into this and many online resources talk about light clustering. Again, if I had more time I would've learned it for this assignment as I think that would have been cool.

Hierarchical Depth Buffers

The idea is that by generating mipmaps of the depth buffer we can do broad phase culling of particles, thereby reducing the number of fine-grained (per pixel) accesses to the depth buffer. This is a micro-optimization as stated by Mike Turitzin, but "every ms counts".

Lighting

Since tiling particles comes from Forward+ rendering, it would make sense to implement a forward+ renderer.

Figures

20 million particles distributed in a 50x25x50 cube with loadmonitors

20 million particles on the newrenderer

6.4 million particles, fillrate (not compute)limited.

6.4 million particles, zoomed out, showing fillrate as the limitingfactor in speed.

References