:Publish Date: 2018-06-04 Optimizing Matrix Operations ============================ `Last time <2018-05-26-performance-tools.html>`_, we reviewed some major frame rate issues I found in `ng <2018-03-11-introducing-ng-engine.html>`_ and established a process to measure and improve on the problem. We reviewed performance of a long-running test and identified the 4x4 matrix and vector modules as major culprits. Today, we'll do line-by-line analysis of the slow parts and review my progress optimizing them. Performance Analysis -------------------- Recapping the last measurements, here are the most expensive functions for the 5 worst exceptional frames, according to callgrind. This list was generated by the performance tool I wrote to do this exceptional frame analysis. .. code-block:: text Frame # ms/frame call site 25798 4167.96 /usr/lib/python3.6/site-packages/ng/math/vector.py:__init__ 27314 4120.22 /usr/lib/python3.6/site-packages/ng/ng.py:_entities_velocity 24394 3972.68 /usr/lib/python3.6/site-packages/ng/math/matrix.py:__matmul_ 21420 3672.30 /usr/lib/python3.6/site-packages/ng/ng.py:_entities_velocity 20191 3374.32 /usr/lib/python3.6/site-packages/ng/math/matrix.py:__init__ The most expensive operations are initializing new matrix and vector class instances and multiplying matrices. These math operations are used per frame, per entity to update entity positions and to prepare MVP matrices for display output. ``_entities_velocity`` indirectly uses vector operations through its own function calls, so it may drop off the list when vector is optimized. We'll focus on vector and matrix right now. The goal of this session is to reduce the number of usages of the vector and matrix modules, as well as to improve the performance of each usage. Let's use ``callgrind_annotate`` to get more information about the call patterns. To reduce the number of calls to a function, we need to know who is calling. We add the ``--tree=caller`` option to get this info. To improve the performance of the function, we need to know which lines perform worst. We add the ``--auto=yes`` option so callgrind will annotate the source lines of each module with the amount of time spent in each line. Here are the header and the call tree for frame 25798, which give us more details on calls to ``vector.py:__init__``. .. code-block:: text $ callgrind_annotate --auto=yes --tree=caller emitter-25798.pstats.cg -------------------------------------------------------------------------------- Profile data file 'emitter-25798.pstats.cg' -------------------------------------------------------------------------------- Profiled target: (unknown) Events recorded: ns Events shown: ns Event sort order: ns Thresholds: 99 Include dirs: User annotated: Auto-annotation: on -------------------------------------------------------------------------------- ns -------------------------------------------------------------------------------- 4,167,334,999 PROGRAM TOTALS -------------------------------------------------------------------------------- ns file:function -------------------------------------------------------------------------------- 415,000 < /usr/lib/python3.6/site-packages/ng/ng.py:_entities_velocity (255x) [] 34,000 < /usr/lib/python3.6/site-packages/ng/render/__init__.py:normalized_vertices (32 x) [] 20,000 < /usr/lib/python3.6/site-packages/ng/entity.py:size:128 (8x) [] 521,000 < /usr/lib/python3.6/site-packages/ng/math/bounding_box.py:aa_box (492x) [] 14,000 < /usr/lib/python3.6/site-packages/ng/entity.py:origin:118 (8x) [] 4,040,391,999 < /usr/lib/python3.6/site-packages/ng/math/vector.py:__iadd__ (242x) [] 26,000 < /usr/lib/python3.6/site-packages/ng/ng.py:entity_create (8x) [] 34,000 < /usr/lib/python3.6/site-packages/ng/entity.py:velocity:170 (8x) [] 87,000 < /usr/lib/python3.6/site-packages/ng/entity.py:__init__ (40x) [] 1,238,000 < /usr/lib/python3.6/site-packages/ng/render/opengl.py:_draw_entity (984x) [] 278,000 < /usr/lib/python3.6/site-packages/ng/math/vector.py:__mul__ (242x) [] 235,000 < /usr/lib/python3.6/site-packages/ng/entity.py:position:137 (250x) [] 11,000 < /usr/lib/python3.6/site-packages/ng/entity.py:scale:159 (8x) [] 4,042,445,000 * /usr/lib/python3.6/site-packages/ng/math/vector.py:__init__ Most of the calls in the graph look innocuous at a glance. There may be opportunity to optimize some of those call patterns later, but nothing stands out since they are all necessary for the update routine. The calls to ``vector.py:__init__`` from ``vector.py:__iadd__`` and ``vector.py:__mul__`` look suspicious, though. These are bound methods used for class instantiation and operator overloading. The callgrind analysis shows us the lines for ``__iadd__``. .. code-block:: text . def __iadd__(self, other: Any) -> 'Vector': 147,000 other = Vector(other) . . self.x += other.x 180,999 self.y += other.y 278,000 => /usr/lib/python3.6/site-packages/ng/math/vector.py:__init__ (242x) . self.z += other.z . . return self This method overloads the ``+=`` operator, performing vector addition with a scalar value and mutating the original vector instance. This operation doesn't need to allocate a new vector, but the first line creates a new vector instance from the user's input. I intended this as a convenience, allowing the user to call the method with another type such as a tuple. Removing this convenience will significantly reduce the number of Vector allocations. Here's the change: .. code-block:: python def __iadd__(self, other: Vector) -> 'Vector': self.x += other.x self.y += other.y self.z += other.z return self The ``__sub__`` and ``__add__`` methods had a similar pattern, so I also changed those. With these changes, we can expect frame 25798 to be improved and can move to frames 24394 and 20191. These both spend a lot of time in the matrix module. I did a quick comparison of 20191 and 24394 and saw that they are similar, so we can review only 24394. Here are the callgrind header and tree for frame 24394. .. code-block:: text $ callgrind_annotate --auto=yes --tree=caller emitter-24394.pstats.cg -------------------------------------------------------------------------------- Profile data file 'emitter-24394.pstats.cg' -------------------------------------------------------------------------------- Profiled target: (unknown) Events recorded: ns Events shown: ns Event sort order: ns Thresholds: 99 Include dirs: User annotated: Auto-annotation: on -------------------------------------------------------------------------------- ns -------------------------------------------------------------------------------- 3,972,006,999 PROGRAM TOTALS -------------------------------------------------------------------------------- ns file:function -------------------------------------------------------------------------------- 3,889,429,000 < /usr/lib/python3.6/site-packages/ng/entity.py:matrix_init (4074x) [] 2,352,000 < /usr/lib/python3.6/site-packages/ng/render/opengl.py:_render_entity (410x) [] 2,343,000 < /usr/lib/python3.6/site-packages/ng/render/opengl.py:_draw_entity (784x) [] 3,881,484,999 * /usr/lib/python3.6/site-packages/ng/math/matrix.py:__matmul__ 57,394,000 < /usr/lib/python3.6/site-packages/ng/render/opengl.py:_render_entity (198x) [] 59,996,999 < /usr/lib/python3.6/site-packages/ng/render/opengl.py:update (7x) [] 21,398,999 * /usr/lib/python3.6/site-packages/ng/render/opengl.py:_render_entity 30,022,000 < /usr/lib/python3.6/site-packages/ng/render/opengl.py:_render_entity (205x) [] 10,284,999 * /usr/lib/python3.6/site-packages/ng/render/opengl.py:_draw_entity 5,807,000 < /usr/lib/python3.6/site-packages/ng/math/matrix.py:__matmul__ (5268x) [] 5,807,000 * ~: 5,312,000 < /usr/lib/python3.6/site-packages/ng/math/matrix.py:__init__ (6767x) [] 5,312,000 * ~: 7,177,999 < /usr/lib/python3.6/site-packages/ng/entity.py:matrix_init (252x) [] 3,882,551,000 < /usr/lib/python3.6/site-packages/ng/entity.py:position:137 (198x) [] 11,371,000 < /usr/lib/python3.6/site-packages/ng/entity.py:rotation:148 (211x) [] 819,000 < /usr/lib/python3.6/site-packages/ng/entity.py:children_add (6x) [] 135,000 < /usr/lib/python3.6/site-packages/ng/entity.py:scale:159 (6x) [] 151,999 < /usr/lib/python3.6/site-packages/ng/entity.py:origin:118 (6x) [] 4,083,000 * /usr/lib/python3.6/site-packages/ng/entity.py:matrix_init The worst performer, ``matrix.py:__matmul__``, is the matrix multiplication operator on the Matrix class. The engine makes heavy use of this operation while calculating entity model matrices and while preparing MVP matrices for display rendering. Again, there might be opportunity to improve caller efficiency, but let's see how we can optimize this method in the matrix module. Here are the line-by-line costs of the methods. .. code-block:: text . class Matrix(): 3,702,999 def __init__(self, values): 5,312,000 => ~: (6767x) 2,327,000 => ~: (6767x) . self._np_array = np.array(values) . . self.values = self._np_array.tolist() . . def _columns(self): . return len(self.values[0]) . 3,881,484,999 def __matmul__(self, other): 5,807,000 => ~: (5268x) 6,831,999 => /usr/lib/python3.6/site-packages/ng/math/matrix.py:__init__ (5268x) . return Matrix(np.matmul(self._np_array, other._np_array)) The only purpose of the ``Matrix`` class is to be a wrapper around the numpy array. The original ng Matrix class implemented matrix operations in Python. That was slow, so I previously changed it to use a numpy array for its underlying operations, leaving Matrix as a wrapper class. The price of the wrapper class is apparent; every time we create a new matrix or multiply two matrices, we incur the cost of a new Python class instance that serves only to create a new numpy array instance underneath it. Wrapping third party APIs is a good practice in certain languages and situations, allowing one to protect one's own APIs from changes to third party APIs. In this case, though, performance is the larger consideration. If the wrapper class were very important to the design, one potential optimization would be to rewrite the matrix class as a C library. In this case, numpy is ubiquitous in the Python computation space and carries a stable API, so I'm comfortable using it directly instead of wrapping it. I ditched the Matrix class entirely and ng now directly uses numpy arrays for all matrix operations. The Data -------- Here's the data from a new run measured after the changes. I reformatted it a bit to compare it more easily with the data from the last run. +-------------------------+---------+--------+ |Metric | Run 0 | Run 1 | +=========================+=========+========+ |Session length (s) | 2467.48 | 628.72 | +-------------------------+---------+--------+ |ms/frame (mean) | 87.80 | 64.17 | +-------------------------+---------+--------+ |ms/frame (median) | 79.32 | 63.10 | +-------------------------+---------+--------+ |ms/frame (std dev) | 75.96 | 15.68 | +-------------------------+---------+--------+ |Exceptional frames | 96 | 46 | |(variation multiple > 0.5| | | +-------------------------+---------+--------+ |- sec/exceptional frame | 25.70 | 13.67 | +-------------------------+---------+--------+ |- ms/frame (mean) | 798.30 | 208.43 | +-------------------------+---------+--------+ |- ms/frame (median) | 278.32 | 149.94 | +-------------------------+---------+--------+ |- ms/frame (std dev) | 1029.25 | 144.21 | +-------------------------+---------+--------+ |Active entities (mean) | 206.07 | 243.30 | +-------------------------+---------+--------+ |Active entities (median) | 205.00 | 247.00 | +-------------------------+---------+--------+ |Active entities (std dev)| 17.90 | 21.54 | +-------------------------+---------+--------+ The first thing to highlight is this run length is much shorter than the last. Because there will be several iterations of optimization, I decided to do shorter 10 minute runs for the intermediate iterations. The caveat is we can't directly compare the summary data from this run to the last; the decreased session length affects the averaged data. That's okay. Again, this will be a series of improvements. We don't need to do a direct session-to-session comparison to answer the questions I want to answer right now. - Are there still exceptional frames? - If so, how often? - If so, are the exceptions getting worse over time? The first two questions are answered by the summary data. Yes, there are still exceptional frames. They happen several times per minute on average. +-------------------------+---------+--------+ |Metric | Last Run|This Run| +=========================+=========+========+ |Exceptional frames | 96 | 46 | |(variation multiple > 0.5| | | +-------------------------+---------+--------+ |- sec/exceptional frame | 25.70 | 13.67 | +-------------------------+---------+--------+ Let's look at the graphs. .. figure:: /_static/ng/2018-06-03_frame_rate_runtime_f74cde-nparray-opt.png .. figure:: /_static/ng/2018-06-03_frame_rate_variation_exc_f74cde-nparray-opt.png .. figure:: /_static/ng/2018-06-03_frame_rate_variation_f74cde-nparray-opt.png The spikes in the graph indicate the same pattern as before. The frame-to-frame variation worsens over the length of the session. Here's the new short list of expensive calls. .. code-block:: text Frame # ms/frame call site 8980 581.74 /usr/lib/python3.6/site-packages/ng/math/bounding_box.py:aa_box 8232 526.43 /usr/lib/python3.6/site-packages/ng/math/vector.py:__init__ 7645 494.95 /usr/lib/python3.6/site-packages/ng/math/bounding_box.py:aa_box 6999 478.29 /usr/lib/python3.6/site-packages/ng/math/vector.py:__init__ 6484 465.14 /usr/lib/python3.6/site-packages/OpenGL/wrapper.py:calculate_cArgs Finally, some good news. The matrix module has dropped off the list of worst offenders and numpy's array multiplication routines are not on the list. This suggests replacing the matrix class with numpy arrays improved things. The vector module is still there, indicating we might be able to improve it further. ``boundingbox_py.aa_box`` is new to the list. That function uses the vector module, so it's reasonable to expect that vector is the bottleneck for ``aabox``. An internal call in the `PyOpenGL `_ library is also new to the list this time. Next? ----- We could continue to optimize the vector module. However, I'm certain we can make improvements there. I have less knowledge and certainty around the PyOpenGL library, because it is third-party code. Also, since ng's OpenGL renderer is my first experience using OpenGL, I made some early design mistakes and already have some TODO items to fix there. For these reasons, PyOpenGL is the greater concern. I'll leave the vector module for now and focus on improving OpenGL usage next. Thanks for reading! Next time, we'll look at my OpenGL optimizations, and review measurements of another run.