"Perpetual Motion Engine" (Early Demo)... By Me! : D

Manufacturer: Scott Michaud

A new generation of Software Rendering Engines.

We have been busy with side projects, here at PC Perspective, over the last year. Ryan has nearly broken his back rating the frames. Ken, along with running the video equipment and "getting an education", developed a hardware switching device for Wirecase and XSplit.

My project, "Perpetual Motion Engine", has been researching and developing a GPU-accelerated software rendering engine. Now, to be clear, this is just in very early development for the moment. The point is not to draw beautiful scenes. Not yet. The point is to show what OpenGL and DirectX does and what limits are removed when you do the math directly.

Errata: BioShock uses a modified Unreal Engine 2.5, not 3.

In the above video:

  • I show the problems with graphics APIs such as DirectX and OpenGL.
  • I talk about what those APIs attempt to solve, finding color values for your monitor.
  • I discuss the advantages of boiling graphics problems down to general mathematics.
  • Finally, I prove the advantages of boiling graphics problems down to general mathematics.

I would recommend watching the video, first, before moving forward with the rest of the editorial. A few parts need to be seen for better understanding.

Click here, after you watch the video, to read more about GPU-accelerated Software Rendering.

The whole purpose of graphics processing is to get a whole bunch of color values: one per pixel per frame of animation. Originally, every game engine was software rendered but there were certain tasks which took up the vast majority of CPU time. Graphics processors were invented to offload those difficult tasks to somewhere better suited. APIs, such as DirectX and OpenGL, were created to harness these computing devices.

Even then, at least for a while, the hardware accelerated engines (while much higher resolution) lost some features to gain that performance. A good example is Quake's "Fullbright" textures.

These graphics processing units (GPUs), due to the complexity of modern shaders and the interest of high performance computing (HPC) customers, are now massively parallel solvers of general math problems. Tim Sweeney of Epic Games has predicted the end of DirectX and OpenGL for quite some time although cost concerns dampened his projections. Back in 2008, Tim expected to write "100% of the rendering code" for their next generation in a GPU-accelerated "real programming language" such as CUDA. The next year, he predicted an order of magnitude increase in development costs. Unreal Engine 4 is based on DirectX and OpenGL.

But, as described in the video above, there are benefits to trailblazing.

"Perpetual Motion Engine" has a secret: it is written in Web Standards. The functional parts of the engine are developed in Javascript, WebGL, and WebCL; a few HTML, CSS, and SVG tricks have also been snuck in. Its platform is any compatible web browser, eventually without plugins or extensions.

Note: Due to a limitation with Nokia Research's WebCL prototype, textures were calculated in WebCL and pass through a 2D Canvas before arriving at WebGL. Nokia is unable, via an extension, to access WebGL buffers. Mozilla would need to add that functionality whenever WebCL is integrated into Firefox directly.

Another Note: Perpetual Motion Engine is designed to not require an internet connection. You could copy the game to a USB thumb drive or a directory on your hard drive and simply point the web browser to its index.html (Firefox will not even throw a security complaint as all files are contained in subdirectories). Of course, if it takes off, some games can be hosted on HTTP and, thus, require an internet connection... but that is their choice.

Eventually, the platform will not matter. Once you establish an acceptable WebCL platform and the drivers stabilize, your only barrier should be raw performance. The same code should run in both a future web browser on a high-end desktop and a future mobile browser on a tablet. Javascript and OpenCL will continue to advance and optimize as time goes on. Perpetual Motion Engine should only get faster and faster.

See why I chose that name?

The real benefit is developer control. Imagine the following four scenarios:

1) If you develop a primarily voxel-based game, like Minecraft, it might make sense to not convert your content to triangles and render them with a scanline algorithm; you might prefer to render voxels directly. (Update: Apparently point-in-triangle, actually just about the algorithm I used in the demo, is commonly used by GPUs to acquire pixels within a triangle and not scanline since like... the Nintendo DS.)

2) A video production house could wish to divide their compute workload across multiple Tesla, FirePro, or Xeon Phi coprocessors in a way not possible with SLi or Crossfire. For instance, they could deliberately render 6-or-more frames in advance and sacrifice input delay (in high quality mode) for the ability to harness a half-dozen fully-utilized GPUs.

3) A game could even combine rendering methods. Imagine a typical scene with a few reflective objects. The scene could be rendered, in the normal way, using the primary GPU; the reflective surfaces could, simultaneously, be raytraced to a separate render buffer using the (normally idle) APU or iGPU. The main image and its highlights could then be layered, like Photoshop, and composited together using any algorithm the developer wishes to implement (normal transparency, add, multiply, screen, etc.).

4) As mentioned earlier, a game have gorgeous scenes comprised of hundreds of millions of triangles (or whatever). If the developer scales back the resolution, polygon count, and features to require some factor less performance? It will run on hardware which are about that factor slower. You could see the same game running on a high-end desktop and a tablet plugged into a TV, just much less pretty.

This has been something I have thought about for quite some time. If you have any questions or comments, please, discuss below! I am still deciding where to go from here, from a project sense, but that is not getting in the way of my development. I believe this would be a good open source project, perhaps BSD-license game code and LGPL-license engine code, but it is still much too early for anything like that.

And, of course, I will keep our readers up to date as it (literally) develops.

October 2, 2013 | 02:19 PM - Posted by YTech

Not sure that I grasp much of the intended message.

I would suggest a little more cleaning and clarity in the presentation.

For the video, please adjust the audio to a constant level during editing.

Other than that, if those API are no longer supported or are removed in future products, how will gaming of older games work?

I would believe DOSbox has it's own API for old VGA games. What about the first Lara Croft game, which I believe uses DirectX?

October 2, 2013 | 04:27 PM - Posted by Scott Michaud

Actually, the demo does use WebGL to some extent. WebCL calculates textures which WebGL draws to a 2-triangle rectangle covering the screen (it has to pass through a 2D canvas element, first, because Mozilla doesn't allow extensions that much control... that will be solve when Mozilla implements WebCL themselves).

This actually leads to some interesting effects. For instance, 2 WebCL instances can draw 2 Photoshop-like layers which WebGL can then, using a shader, blend them together. For instance... my GeForce 670 draws the room... and my Intel HD 4000 draws the raytraced reflections only on one box. The two images -- one the scene and the other just the reflections -- could be merged, again, like Photoshop layers, into the final render buffer.

Technically I could just draw it to a CPU-based canvas... but (after WebGL and WebCL are allowed to share resources) I would save a bit of time not needing to copy the textures FROM the GPU TO the CPU only to copy it BACK to the GPU to output to monitor. That might be a good idea in some cases, though, such as if you want to use 3 or 4 (even mismatched) GPUs.

October 2, 2013 | 02:29 PM - Posted by mvp324 (not verified)

So the PME could/should be considered an Engine like Unreal/Frostbite and so forth? I'm not that savvy with Engines/GPU API's and so forth, I dabbled with OpenGL in College for one semester, was fun but I didn't pursue it, as I don't like coding.

From what I gleaned from reading wiki and your article, an Engine uses an API to perform the calculations needed for each scene to display. Your PME uses various code to display the "program" via a web interface vice an executable.

If I recall correctly wasn't there another engine that was able to do what you accomplished (render in a web interface)? What is the main difference between that engine and your engine?

Could you use your PME as a test for video cards? Do you plan to introduce raytracing into your engine?

October 2, 2013 | 04:23 PM - Posted by Scott Michaud

Oh there's a lot of WebGL based engines... even a few WebCL ones but sofar "photo-realistic non-realtime".

But yes, the intent is to make it a full gaming engine, programmed in Javascript with the 'big workloads' sent to the GPU. The workloads sent to the GPU are not like OpenGL calls, however. They are simplified C-based functions... very similar to pixel shaders or vertex shaders but do not operate on just vertices and fragments found by scanline operations.

For all WebCL knew, I could be calculating nuclear decay patterns for a university project. It does not need to know where the code gets inserted into the pipeline... it doesn't have a pipeline, I have one.

So that is the main difference between my demo and WebGL demos: the give WebGL geometry, a shader script to be run on every vertex, and a shader script to be run on every scanline-drawn pixel. My demo gives the GPU a calculation... it gives me an answer. Whatever that is? My job.

October 2, 2013 | 03:56 PM - Posted by Kyle Tuck (not verified)


October 2, 2013 | 04:16 PM - Posted by Scott Michaud

Lol! Allyn asked me the same thing when I showed him the preview.

No... unfortunately... one of the nerfed WRT54Gs. Served the purpose, never bothered to upgrade.

October 2, 2013 | 06:12 PM - Posted by Anonymous (not verified)

I think a fair amount of this article stems from some unfortunate ignorance. Here goes:

D3D and OpenGL interfaces of "drawing triangles" provide a pipeline to:
1) Hide latency: GPU's nowadays launch many jobs per triangle. When the fragment shader hits a stall (for example from a texture fetch), the GPU switches job to another fragment, hiding the latency of the fetch.
2) Maximize locality. GPU's store textures in a tiled (sometimes called swizzled) format where neighboring texels in both x and y directions are close to each other (in contrast in linear format neighboring x is close but nieghboring y is far away).
3) Massive parallelism. Because fragment shaders are not very branchy, a single execution decoder can run multiple fragments simultaneously. NVIDIA uses their "Single instruction multiple threads" where as AMD and Intel use SIMD style. Typically primitives are rasterized in 2x2 or 4x4 blocks; neighboring fragments of the same primitive usually (like 90% time or so) have the exact same branch, thus SIMD and SIMT are not only feasible, but ideal.

Going further, standard bits of a 3D pipeline such as:
1) rasterization
2) depth and stencil tests (in particular depth compression to cut down on bandwidth)
3) clipping
4) w-divide
5) hardware clipping (both user clip planes and normalized device coordinate clip planes)

are all fixed function bits with dedicated hardware to do these jobs. Doing a "software" renderer with OpenCL (or CUDA since that has much better tools and features than OpenCL) is going to lag badly because lack of dedicated hardware and far less prediction of loads and quite often core starvation from bandwidth issues.

On the subject of ray tracing: the nasty, icky problem of ray tracing is that changes in the scene, like from a door opening, force a massive update of tree structures. In order for the raytrace to be highly parralelized, that means lots of CPU driven updates for these structures and a repack to GPU to make a GPU ray tracer work. Worse: ray tracing is far from a silver bullet for realistic rendering. Surface scattering (without making zillions of rays on each surface hit) is not a pleasant thing to even try on a ray tracer.

But you know if you want to try to make a game with this that won't require the highest gfx card to render at 60hz with the similar visual quality from a rasterize based game running on a much lower end gfx card, go ahead. Plenty have already failed, what is one more, eh?

Even Intel at one point thought the ray tracing thing was the way to go, but we all know how well Larabee turned out eh? Going further back there was even an API for raytracing called OpenRT. That is essentially dead too.

There are ray-tracers hooked to GPU tightly now, for example OptiX, but comparing the visual-quality-performance ratio of that to a rasterized scene, well you get the idea.

October 3, 2013 | 04:07 PM - Posted by Anonymous (not verified)

Someone seems to have had a bad day here :)

October 2, 2013 | 07:30 PM - Posted by Florian (not verified)

The reason we use GPUs built in rasterizers to what you seem to like doing in WebCL/OpenCL/Cuda whatever is not because it's infeasible to rasterize in software. Software rasterization isn't magic or anything, it's extremely simple to do, and sure, it's wonderfully flexible.

The reason is that it's about 50x slower to rasterize in software on a GPU than to use the builtin rasterizer. Of course that's still better than the about 1000x slower than rasterizing on the CPU.

But when you're talking about vertex and fragment troughput, you're basically taking a step back a decade performance wise. So, that may work for some things, where you only use a tiny fraction of the performance anyway. Unfortunately, the biggest usecase for more flexible programmable graphics, is the high end spectrum, and there you want to get every teensy tiny bit of performance, all those millions of triangles you can push trough every frame, all those gigatexels of fillrate, all of it you want. And you wouldn't give up 20%, or half, let alone 98% of it for anything.

October 2, 2013 | 09:47 PM - Posted by Ryan Holtz (not verified)

The anon above brings up several significant issues with what you propose, and I agree with him that a lot of what you're doing seems to stem from a general lack of understanding of how GPU hardware works at a low level.

The really telling part was this:

"If you develop a primarily voxel-based game, like Minecraft, it might make sense to not convert your content to triangles and render them with a scanline algorithm"

GPUs have not rendered triangles using edge-stepping and scanlines for literally years. At this point it is far more efficient for GPUs to use a massively parallel point-in-polygon test across the screen-space bounding box of the primitive after it's been emitted from the vertex shader. Out of curiosity, was most of your knowledge on how GPUs and drivers work based on information from the mid to late 90's?

October 3, 2013 | 01:52 AM - Posted by Scott Michaud

Updated the post to fix.

October 3, 2013 | 04:31 PM - Posted by Anonymous (not verified)

Yeah, there was a greatly insightful series of articles on ryg's blog about tile based rasterization in software that touched this.

But on topic, my thoughts were these:

Yes, the "anon above" seems to be working in the field and have some points to make knowing what he's talking about.
Then again the article explains the path of thought that led to this, with Tim Sweeney as a source of argumentation. The autor is apparently doing this now and betting Tim will be right in the future to get the gains of his work.
The anon above's argument seem right, but do they make Tim Sweeney wrong? Only time will tell. I'm not sure they would have been handed to him in the same manner.

Besides, good thing or not, not only the performance and intrinsic qualities of some project decide its success...
Sometimes being the right thing at the right time is enough, and this one is not a bad idea in it's cross-platform spirit, just in the same way that maybe javascript is not technically optimal but it's widespread and successful. I just thought, let's make the crazy hypothesis that, for a reason or another, maybe independent from the rightness of the initial vision or the pertinence of the goals, this project catches wind, and becomes supersuccessful. Yeah, maybe that's not the most likely but let's suppose.

Then I thought, the anon above, however right he was...
would look just wrong :)

People are allowed to try things.

October 3, 2013 | 04:20 PM - Posted by ezjohny

Scott, I had read your article got some of it, the rest I was up in the clouds. It looks like Open CL is going to be the dominant one because developers are going to do more with it.

October 3, 2013 | 07:34 PM - Posted by Panta

Josh Rules ..


October 4, 2013 | 06:01 PM - Posted by Anonymous (not verified)

The same anon back here.

I think the bits of Tim Sweeney are taken slightly out of context; Lets look at one of the articles:

some of the bits he was commenting on about using a pixel shader to do a computation is/was a hack before compute shaders, CUDA or OpenCL. Additionally, some of what he mentioned was bypassing a GPU rasterizer and other portions he talks about bypassing the API's. Both OpenGL and Direct3D have significant overhead for each draw call, state change and so on. On the other hand, programming directly to the medal can dramatically increase performance at the cost of developer pain. Along those lines, a big issue coming about with AMD's APU's (especially the one in the PS4) are far more flexible memory management, the biggest beans being mostly seemless memory passing between CPU and GPU witho copies eating bandwidth. The current API's both OpenGL and Direct3D need some major love to take advantage of these new features (and PC's in order for this to work need that the main system RAM is high end RAM like GDDR5 instead of DDR3 and so on).

On the subject of Tim Sweeney's quotes, notice that the article is from DX10 and now we have DX11. If there are more recent quotes I'd be curious to see them.

However, at the end of the day is this: the visual quality to performance ratio of rasterizers vs ray tracers is no brainer. Going further, hardware implementation of a piece of functionality will always beet a software implementation by HUGE margins; rasterization is easier on many levels to do in hardware (there are serious challenges there too!) than anything else so far to present 3D scenes to a 2D framebuffer.

Also, what some fail to realize is that much of the extra pretty things that a ray tracer does well, like highly reflected spheres are special cases that a ray tracer can highly optimize; change the mesh to a set of bezier surfaces and the nice super optimization of sphere reflextion is gone and a ray tracer will then hunt for ray-triangle intersections (atleast one can apply bump mapping in ray tracers still).

Always keep in mind what a ray tracer will bring over a rasterizer; the two favorites are reflections and shadows; the common complaint against a rasterizer is that it is harder to get the resolution correct for the image target that generates the reflections and shadows; but in truth that is partial bullsh*t; there are techniques and calculations that give a good handle on what the resolution could be. Going further, one can imagine a variable resolution render target where different regions of a buffer have different fragment densities. This is a relatively minor modification of current rasterizer techniques, but requires hardware support. That alone would greatly mute some of the arguments for ray tracing over rasterizers for reflections on non-flat surfaces (flat surfaces one does not even need to render to an offscreen image first, one can do it directly to the framebuffer with a few tricks). Similiar jazz applies to shadows too (namely project the light viewport to the viewer's to see the required resolution at different parts of the screen).

As for OpenCL taking over the world: I would not bet on it. Compared to CUDA it is hideous. Going further OpenGL 4 added compute shaders (inspite of there being a buffer sharing API with OpenCL). The upshot is that if a developer wants to go the OpenGL route (on desktop) then they do not need to bother with OpenCL at -all-.

October 6, 2013 | 04:30 AM - Posted by Anonymous (not verified)


While I appreciate your intent, there are some very good reasons for doing gpu-accelerated graphics via specialised ASICs with vendor provided drivers, and relatively high level APIs like DX and OpenGL.

I know it's just a very rudimentary tech demo, but you're using about 15% of a GTX670 to calculate some color transformations and rendering a single triangle. Your solution won't scale.

While you're correct that both DX and OpenGL have inefficiencies, the solution is either a better, more modern API (perhaps Mantle), or a new version of DX or OpenGL that breaks backward compatibility for a more efficient handling of API calls, or reducing the number of API calls needed, or both.

Nonetheless, what you're doing is interesting. I like these video tech-blogs.

October 7, 2013 | 06:24 PM - Posted by Scott Michaud

Performance was not a concern for the demo... and there were some *very* bad things I did (some of which I was forced to do... for now... due to WebCL not having access to WebGL's buffers in Nokia's implementation and so forth -- like copy the frame buffer several times to and from both memories).

Also, GPU load is a shockingly bad way to describe performance. You really need to profile execution time and stuff like that (ram usage, etc.).

Also, things start slow and get quicker... I would not immediately say it is unviable. Sure, it could be a gong show... but I don't think so. I think it could work.

Thanks for the compliments though! : D