Skip to content

Significantly decrease profiling overhead & update build process (4.0.0)#178

Merged
Erotemic merged 46 commits into
pyutils:mainfrom
Theelx:main
Nov 12, 2022
Merged

Significantly decrease profiling overhead & update build process (4.0.0)#178
Erotemic merged 46 commits into
pyutils:mainfrom
Theelx:main

Conversation

@Theelx
Copy link
Copy Markdown
Collaborator

@Theelx Theelx commented Nov 2, 2022

This is a continuation of #165, which was accidentally closed.

This PR is a large amalgamation of work I've been doing over the past few months to make line_profiler better. The main improvements are:

  • The python trace callback that is called with every line of profiled code executed has been rewritten to make more extensive use of C++, and to reduce python interaction. Cython can generate annotated HTML files which indicate more python interaction by making the line more yellow. An explanation for this can be found here. Generally, the more yellow the line is, the Python interaction there will be, so the slower the line/function will be. I have attached screenshots of the HTML files plus their raw originals for easy viewing.
  • The callback is now approximately 2-3x faster. I have noticed in my other projects that lines that used to say they took 0.7µs (microseconds) to execute now say they execute in 0.2-0.3µs. This is because line_profiler isn't able to completely get rid of its own overhead when tracking lines. The overhead has been reduced from approximately 0.4µs to <0.1µs on my laptop, running Ubuntu 20.04.5 on Linux 5.17.9, with an Intel i7-11800H CPU, with turbo-boost disabled, and when plugged into its charger. This translates into some of my example programs going from a 3.5x slowdown with profiling enabled, to a 1.8x slowdown.
  • The build system has been switched to pure Cython. Scikit-build's Cython build system is woefully outdated, and it doesn't support nearly as many options as Cython's native build system. In addition, the Cmake process is unintuitive for many Python contributors, and the C-extension compilation process for local development is undocumented and complex. With the new system, one can just run pip install . when inside the cloned directory, and the build and installation is automatically done for them. An extra anecdotal benefit is that it seems to install a bit faster.
  • The -i option has been added. This was introduced in an Add option to automatically output a file every n seconds #46, but rejected due to not being an asyncio-based approach. At the moment, profiling asyncio code continues to work in 4.0.0 according to tests of other async code of mine, but I'd appreciate it if you could test with any async code you deem relevant. In addition, basic benchmarks at the bottom of the post indicate no noticeable slowdown when using the -i 1 option, and it doesn't mess with the GIL. If this needs to be split into a separate PR, I can do that.
    This release would be 4.0.0 because in order to implement the C++ optimizations, I had to change how the code_map and last_time attributes on the LineProfiler object work. The attributes are still accessible from pure-Python code, but they contain different objects, so any code relying on specific behavior from those attributes may break. However, I don't think those attributes were ever meant to be relied upon by external code anyway.

The Cython code is a bit more complex now, but it should be managable. You'll notice I had to do some manipulation of the code objects of functions, because in order to avoid python interaction I couldn't store the function code objects in a Python dictionary, so I had to hash the function objects. This originally caused problems when there were two different functions with the exact same code, however that was fixed by making line_profiler add no-op instructions to any duplicate functions, so that they could be profiled as separate functions. This still

import asyncio
import math

@profile
async def testing(x):
    for i in range(x):
        y = math.pow(100, 150) + math.pow(100, 150)

asyncio.run(testing(7500000))

Above code without -i 1, 7.5m repetitions:
5.28s
5.30s
5.34s
5.31s

Above code with -i 1, 7.5m repetitions:
5.27s
5.32s
5.28s
5.30s

There is essentially no difference in performance with -i, as long as it's not done like every .001 seconds. Luckily, it only allows intervals in multiples of 1 second, and 0 seconds is the same as disabled, so there isn't really much chance of it breaking. The timings are approximately the same for both cases when removing asyncio.

@codecov
Copy link
Copy Markdown

codecov Bot commented Nov 2, 2022

Codecov Report

Merging #178 (09fa869) into main (71db7da) will not change coverage.
The diff coverage is 100.00%.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #178   +/-   ##
=======================================
  Coverage   50.59%   50.59%           
=======================================
  Files           4        4           
  Lines         253      253           
  Branches       34       37    +3     
=======================================
  Hits          128      128           
  Misses        112      112           
  Partials       13       13           
Impacted Files Coverage Δ
line_profiler/line_profiler.py 52.24% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 71db7da...09fa869. Read the comment docs.

code = <object>py_frame.f_code
if code in self.code_map:
block_hash = hash(get_frame_code(py_frame))
code_hash = compute_line_hash(block_hash, py_frame.f_lineno)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a memory leak here? The get_frame_code function calls PyFrame_GetCode, which returns a PyCodeObject*, and that has its reference count decreased in the same function by Py_DECREF. However, the PyObject* returned by PyCode_GetCode is returned from that function, and (correct me if I'm wrong) I believe it returns a "strong reference" meaning that we need to manually call DECREF, but it never has it's reference count decreased later on. Is that an oversight, or am I misunderstanding PyCode_GetCode?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding is absolutely correct. However, the key piece of this, which I should add a comment mentioning, is that Cython takes care of the DECREF for us. If you compile with the annotated HTML, it shows a bit of yellow at the hash, due to the DECREF. When I get to my computer, I'll add a comment in explaining that.

@Theelx
Copy link
Copy Markdown
Collaborator Author

Theelx commented Nov 7, 2022

@Erotemic Are there any other roadblocks to getting this merged?

@Erotemic
Copy link
Copy Markdown
Member

Erotemic commented Nov 7, 2022

I want to look into the discussion that happened on your fork just to make sure I understand everything. This is just waiting on me having time to do that. But I'll try to make some later tonight.

Comment thread README.rst Outdated
Comment thread line_profiler/CMakeLists.txt
for timing in timings.values():
self.assertEqual(timing.nhits, 1)

def test_gen_decorator(self):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I had a patch that fixed and re-enabled all of these tests. Maybe it got lost?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's possible. I don't recall the patch though, so I have no clue where to find it.

@Erotemic
Copy link
Copy Markdown
Member

This has been waiting long enough. I'm satisfied that there are no security vulnerabilities and that the code is stable. There may be niche regressions, but those users can pin to old versions or help this community driven project out. Thank you @Theelx for your patience, dedication, and impactful contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants