A group of users were trying to implement a simple, terminal-based video game and found the performance under Windows Terminal to be entirely unsuitable for such a task. The performance issue can be replicated by repeatedly drawing a “rainbow” and measuring how many frames per second (FPS) we can achieve. The one below has 20 distinct colors and can be drawn at around 30 FPS on my Surface Book with an Intel i7-6700HQ CPU. However, if we draw the same rainbow with 21 or more distinct colors it would drop down to less than 10 FPS. This drop is consistent and doesn’t get worse even with thousands of distinct colors.
Initial investigation with Windows Performance Analyzer
Initially the culprit wasn’t immediately obvious of course. Does the performance drop because we’re misusing Direct2D or DirectWrite? Does our virtual terminal (VT) sequence parser have any issues with quickly processing colors? We usually begin any performance investigations with Windows Performance Analyzer (WPA). It requires us to create an “.etl” trace file, which we can be done using Windows Performance Recorder (WPR).
The “Flame by Process” view inside WPA is my personal favorite. In a flame graph each horizontal bar represents a specific function call. The widths of the bars correspond to the total CPU time spent within that function, including time spent in all functions it calls recursively. This makes it trivial to visually spot changes between two flame graphs of the same application, or to find outliers which are easily visible as overly wide bars.
To replicate this investigation, you’ll need to install Windows Terminal 1.12 as well as a tool, called rainbowbench. After compiling rainbowbench with cmake and your favorite compiler, you should run the commands rainbowbench 20 and rainbowbench 21 for at least 10 seconds each inside Windows Terminal. Make sure to have Windows Performance Recorder (WPR) running and recording a performance trace for you during that time. Afterwards you can open the .etl file with Windows Performance Analyzer (WPA). Within the menu bar you can tell WPA to “Load Symbols” for you.
On the left side in the image above we can see the CPU usage of our text rendering thread when it’s continuously redrawing the same 20 distinct colors and on the right side if we draw 21 colors instead. Thanks to the flame graph we can immediately spot that a drastic change in behavior inside Direct2D must have taken place and that the likely culprit is a function called TextLookupTableAtlas::Fill6x5ContrastRow inside Direct2D. An “atlas” in a graphical application is usually referring to a texture atlas and considering how Direct2D uses the GPU for rendering by default, this is likely code dealing with a texture atlas on the GPU. Luckily several tools already exist with which we can easily debug applications running on the GPU.
PIX and RenderDoc – Debug graphical performance issues with ease
While PIX offers support for packaged applications like Windows Terminal (which PIX refers to as “UWP”) and a large number of helpful metrics out of the box, I found it easier to generate the following visualizations using RenderDoc. Both applications work almost identical however, which makes it easy to switch between them.
Windows Terminal ships with a modern version of conhost.exe, called OpenConsole.exe, featuring several enhancements not present in conhost.exe, one of which are alternative rendering engines. You can find and run OpenConsole.exe from inside Windows Terminal’s application package, or from one of Terminal’s release archives. Afterwards you can create a DWORD key at HKEY_CURRENT_USERConsoleUseDx and assign it the value 0 to get the classic GDI text renderer, 1 for the standard Direct2D renderer and 2 for the new Direct3D engine which solves this issue. This trick is helpful for RenderDoc, which doesn’t support packaged applications like Windows Terminal.
Simply drag and drop an executable onto RenderDoc and “Launch” it. Afterwards snapshots can be “captured” and retroactively analyzed and debugged.
Opening a capture will show the draw commands Direct2D executed on the GPU on the left side. Selecting the “Texture Viewer” will initially yield nothing, but as it turns out certain events in the “Output” tab, like DrawIndexedInstanced, will seemingly present us with the state of our renderer in the middle of execution. Furthermore the “Input” tab contains a texture called “D2D Internal: Grayscale Lookup Table”:
The existence of such a “lookup table” seems highly relevant to our finding that anything over 20 colors slows down our application dramatically and seems related to our problematic TextLookupTableAtlas::Fill6x5ContrastRow function we found using WPA. What if the table’s size is limited? Simply scrolling through all events already confirms our suspicion. The table gets filled with new colors hundreds of times every frame, because it can’t fit 21 colors into a table that only fits 20:
If we limit our test application to 20 distinct colors, the table’s contents stay constant:
So, as it turns out our terminal is running into a corner case for Direct2D: It’s only optimized to handle up to 20 distinct colors in the general case (as of April 2022). Direct2D’s approach isn’t a coincidence either, as the use of a fixed lookup table for colorization reduces its computational complexity and power draw, especially on the older hardware it was originally written for. Additionally, most applications, websites, etc. stay below this limit and if they do happen to exceed it, more often than not, the text is static and doesn’t need to be redrawn 60 times a second. In comparison, it’s not uncommon to see a terminal-based application doing exactly that.
Solving the issue with more aggressive caching
The solution is trivial: We simply create our own, much larger lookup table and wrap it around Direct2D! Well, unfortunately we can’t just tell Direct2D to use our custom cache. In fact, relying on its rendering logic at all would be problematic here, since the maximum amount of performant colors would always remain finite. As such we’ll have to write our own custom text renderer after all.
Turning fonts and the glyphs they contain into actual rasterized images is generally very expensive, which is why implementing a “glyph cache” of sorts will be critical for performance. A primitive way to cache a glyph would be to simply draw it into a tiny texture when we first encounter it. Whenever we encounter it again, we can simply reference the cached texture instead. Just like Direct2D with its lookup table atlas for colorization we can use our own texture atlas for caching glyphs. Instead of drawing 1000 glyphs into 1000 tiny textures, we’ll simply allocate one huge texture and subdivide it into a grid of 1000 glyph cells.
Now let’s say we have a tiny terminal of just 6 by 2 cells and want to draw some colored “Hello, World!” text. We already know the first step is to build up a texture atlas of our glyphs:
After replacing the characters and their glyphs in our terminal with references into our texture atlas, we’re left with a “metadata buffer” that is the same size as the terminal and still stores the color information. The texture atlas contains only the deduplicated and uncolored rasterized glyph textures. But wait a minute… Can’t we just flip this around to get back the original input? And that’s exactly how our GPU shader works:
By writing a primitive pixel shader, we can copy our glyphs from the atlas texture to the display output directly on the GPU. If we ignore more advanced topics like gamma-correctness or ClearType, colorizing the glyphs is just a matter of multiplying our glyph’s alpha mask with any color we want them to be. And our metadata buffer stores both, the index of the glyph we want to copy for each grid cell and the color it’s supposed to be.
The exact performance benefit of this approach depends heavily on the hardware it runs on. Generally, however we found it to be at least on par with our previous Direct2D-based renderer while avoiding any limitations regarding glyph colorization.
We measured the performance with the following hardware:
CPU: AMD Ryzen 9 5950X
GPU: NVIDIA RTX 3080 FE
RAM: 64GB 3200 MHz CL16
Display: 3840×2160 @ 60 Hz
We’ve measured CPU and GPU usage based on the values shown in the Task Manager, as that’s what users will most likely check first when encountering performance issues. Additionally, the total GPU power draw was measured as it’s the best indicator for potential power savings, independent of frequency scaling, etc.
≤ 20 colors
≥ 21 colors
≤ 20 colors
≥ 21 colors
“DxEngine” is the internal name for our previous, Direct2D-based text renderer and “AtlasEngine” for the new renderer. According to these measurements the new renderer not just reduces CPU and GPU usage in general, but also makes it independent of the content that’s being drawn.
Direct2D implements text rendering using a built-in texture atlas to cache rasterized glyphs and a lookup-table for coloring those glyphs. The latter exists, because it reduces the computational cost of coloring glyphs, but unfortunately, as a trade-off, requires an upper limit of distinct colors it can hold at a time. Once you exceed this limit by drawing highly colored text, Direct2D is forced to remove some to make space for new ones, which can lead to an excessive amount of time being spent on updating the lookup-table, causing a steep drop in performance.
This issue isn’t much of a problem for most applications, since text is usually rather static, or doesn’t exceed the upper limit in the first place, but for Terminals we routinely see applications coloring their entire background in block characters, animating text at >60 FPS, etc., where this starts to be a problem.
Our new renderer is specifically written with modern hardware in mind and exclusively supports drawing monospace text in a rectangular grid. The former allows us to take advantage of today’s GPUs with their performant calculations, good support for conditionals and branches and relatively large memories. That way we can safely increase performance by caching more data and perform glyph colorization without lookup-tables, despite the added computational cost. And by only supporting rectangular grids of monospace text, we’re able to dramatically simplify the implementation, offsetting the added computational cost and matching or even exceeding the performance and efficiency of our previous Direct2D-based renderer.
The initial implementation can be seen in pull request #11623. The pull request is quite complex, but the most relevant parts can be found in the renderer/atlas subdirectory. The “parser”, or “CPU-side” part of the engine, can be found inside AtlasEngine.cpp as AtlasEngine::_flushBufferLine and the pixel shader, or “GPU-side”, in shader_ps.hlsl.
Since then, several improvements have been added. The current state at the time of writing can be found here. It includes a gamma-correct implementation of Direct2D’s and DirectWrite’s text blending algorithm inside the 3 files named “dwrite”, as well as an implementation for ClearType blending as a GPU shader. An independent demonstration for this can be seen in the dwrite-hlsl demo project.
The post Case Study: How many colors are too many colors for Windows Terminal? appeared first on Windows Command Line.