Backtrack meets Gecko Profiler

Backtrack is about to be a new performance tool, focused on revealing and solving scheduling and delay problems.  Those are big offenders of performance, very hard to track, and hidden from conventional profilers.

To find out how long and what all has to happen to reach a certain point – an objective, just add a simple instrumentation marker.  When hit during run, it’s added to a list you can then pick from and start tracing to its origin.  Backtrack follows from the selected objective back to the originating user input event that has started the whole processing chain.

The walk-back crosses runnables and their wait time in thread event queues, but also network requests and responses, any code specific queues such as DOM mutations, scheduled reflows or background JS parsing 1), monitor and condvar notifications, mutex acquirements 2), and disk I/O operations.

Visually the result is a single timeline – we can call it a critical path – revealing wait, network and CPU times as distinct intervals involved in reaching solely the selected objective.  Spotting mainly dispatch wait delays is then very easy.  The most important and new is that Backtrack tells you what other operations or events block (makes the critical path wait) and where from have been scheduled.  And more importantly, it recognizes which of them are (or are not) related to reaching the selected objective.  Those not related are then clear candidates for rescheduling.

To distinguish related and unrelated operations Backtrack captures all sub-tasks that are involved in reaching the selected objective.  Good example is the page first paint time – actually unsuppress of painting.  First paint is blocked by loading more than one resource, the HTML and head referenced CSS and JS.  These loads and their processing – the sub-tasks – happen in parallel and only completion of all of them unsuppresses the painting (said in a very simplified way, of course.)  Each such sub-task’s completion is marked with an added instrumentation.  That creates a list of sub-objectives that are then added to the whole picture.

Screen shot of how Backtrack is integrated to the Gecko Profiler Cleopatra web UI

Future improvements:

  • Backtrack could be used in our perfomance automation.  Except calculation of time between an objective and its input source event, it can also calculate CPU vs dispatch delays vs network response time.  It could also be able to filter out code paths clean of any outer jitter.
  • Indeed, networking has strong influence to load times.  Adding more detailed breakdown and analyzes how well we schedule and allocate network resources is one of the next steps.
  • Adding PCAPs or even let Backtrack capture network activity like Wireshark directly from inside Firefox and join it with the Gecko Profiler UI might help too.

The current state of Backtrack development is a work in active progress and is not yet available to users of Gecko Profiler.  There are patches for Gecko, but also for the Cleopatra UI and the Gecko Profiler Add-on.  The UI changes, where also the analyzes happens, are mostly prototype-like and need a clean up.  There are also problems with larger memory consumption and bigger chances to hit OOMs when processing the captured data with Backtrack captured markers.

1) code specific queues need to be manually instrumented
2) with ability to follow to the thread that was keeping the mutex for the time you were waiting to acquire it


New Gecko performance tool: Backtrack

Backtrack aims to show a complete code path flow from any point back to its source, crossing asynchronous callbacks, threads, processes, network requests, timers and any kind of implementation specific queuing plus capturing any I/O or mutex blockade.  The ‘critical flow execution path’ is put to a context of all the remaining concurrent execution flows.  It’s then easy to examine how the critical flow is blocked and delayed by concurrent tasks.

The work is tracked in this bug, where you also find patches and build instructions.  There is also an add-on that, in Backtrack enabled builds, allows you to view actual captured data.

Click the screenshot bellow to view an interactive previewIt’s capture of load of my blog main page till the first-paint notification (no e10s and no network predictor to demonstrate the capture capabilities.)


Backtrack combines*) Gecko Profiler and Task Tracer.

Gecko Profiler (PSP) provides instrumentation (already spread around the code base) to capture static call stacks.  I’ve enhanced the PSP instrumentation to also capture objects (i.e. 'this' pointer value) and added a simple base class to easily monitor object life time (classes must be instrumented.)

Task Tracer (TT) on the other hand provides a generic way to track back on runnables – but not on e.g. network poll results, network requests or implementation specific queues.  It was easy to add a hook into the TT code that connects the captured object inter-calls with information about runnables dispatch source and target.

The Backtrack experimental patch:

  • Captures object lifetime (simply add ProfilerTracked<class Derived> as a base class to track the object lifetime and class name automatically)
  • Annotates objects with resource names (e.g URI, host name) they work with at run-time
  • Connects stack and object information using the existing PROFILER_LABEL_FUNC instrumentation recording this pointer value automatically ; this way it collects calls between objects
  • Measures I/O and mutex wait time ; an object holding a lock can be easily found
  • Sticks receipt of a particular network response exactly to its actual request transmission (here I mainly mean HTTP but also applies to connect() and DNS resolution)
  • Joins network polling “ins” and “outs”
  • Binds code-specific queuing and dequeuing, like our DNS resolver, HTTP request queues.  Those are not just ‘dispatch and forget’ like nsIEventTarget and nsIRunnable but rather have priorities, complex dequeue conditions and may not end up dispatched to just a single thread.  These queues are very important from the resource scheduling point of view.


  • IPC support, i.e. cross also processes
  • Let the analyzes also mark anything ‘related’ for achieving a selected path end (e.g. my favorite first-paint time and all CSS loads involved)
  • Probably persist the captured raw logs and allow the analyzes be done offline

Disadvantages: just one – significant memory consumption.

*) The implementation is so far not deeply bound to SPS and TT memory data structures.  I do the capture my own – actually a third data collection, side by SPS and TT.  I’m still proving the concept this way but if found useful and bearable to land in this form as a temporary way of collecting the data, we can optimize and cleanup as a followup work.

New Firefox HTTP cache now enabled on Nightly builds

Yes, it’s on!  After a little bit more than a year of a development by me and Michal Novotný all bugs we could find in our labs, offices and homes were fixed.  The new cache back-end is now enabled on Firefox Nightly builds as of version 32 and should stay like that.

The old cache data are for now left on disk but we have handles to remove them automatically from users’ machines to not waste space since it’s now just a dead data.  This will happen after we confirm the new cache sticks on Nightlies.

The new HTTP cache back end has many improvements like request prioritization optimized for first-paint time, ahead of read data preloading to speed up large content load, delayed writes to not block first paint time, pool of most recently used response headers to allow 0ms decisions on reuse or re-validation of a cached payload, 0ms miss-time look-up via an index, smarter eviction policies using frecency algorithm, resilience to crashes and zero main thread hangs or jank.  Also it eats less memory, but this may be subject to change based on my manual measurements with my favorite microSD card which shows that keeping at least data of html, css and js files critical for rendering in memory may be wise.  More research to come.

Thanks to everyone helping with this effort.  Namely Joel Maher and Avi Halachmi for helping to chase down Talos regressions and JW Wang for helping to find cause of one particular hard to analyze test failure spike.  And also all early adopters who helped to find and fix bugs.  Thanks!


New preferences to play with:


Number of kBs we reserve for keeping recently loaded cache entries metadata (i.e. response headers etc.) for quick access and re-validation or reuse decisions.  By default this is at 250kB.
Number of data chunks we always preload ahead of read to speed up load of larger content like images.  Currently size of one chunk is 256kB, and by default we preload 4 chunks – i.e. 1MB of data in advance.


Load times compare:

Since these tests are pretty time consuming and usually not very precise, I was only testing with page 2 of my blog that links some 460 images.  Media storage devices available were: internal SSD, an SDHC card and a very slow microSD via a USB reader on a Windows 7 box.


[ complete page load time / first paint time ]

Cache versionFirst visitCold go to 1)Warm go to 2)Reload
cache v17.4s / 450ms880ms / 440ms510ms / 355ms5s / 430ms
cache v26.4s / 445ms610ms / 470ms470ms / 360ms5s / 440ms


Class 10 SDHC
Cache versionFirst visitCold go to 1)Warm go to 2)Reload
cache v17.4s / 635ms760ms / 480ms545ms / 365ms5s / 430ms
cache v26.4s / 485ms1.3s / 450ms530ms / 400ms*5.1s / 460ms*


Edit: I found one more place to optimize – preload of data sooner in case an entry has already been used during the browser session (bug 1013587).  We are winning around 100ms for both first paint and load times!  Also stddev of first-paint time is smaller (36) than before (80).  I have also measured more precisely the load time for non-patched cache v2 code.  It’s actually better.

Slow microSD
Cache versionFirst visitCold go to 1)Warm go to 2)Reload
cache v113s / 1.4s1.1s / 540ms560ms / 440ms5.1s / 430ms
cache v26.4s / 450ms1.7s / 450ms710ms / 540ms*5.4s / 470ms*
cache v2 (with bug 1013587)--615ms / 455ms*-

* We are not keeping any data in memory (bug 975367 and 986179) what seems to be too restrictive.  Some data memory caching will be needed.


“Jankiness” compare:

The way I have measured browser UI jank (those hangs when everything is frozen) was very simple: summing every stuck of the browser UI, taking more then 100ms, between pressing enter and end of the page load.


[ time of all UI thread events running for more then 100ms each during the page load ]

Cache versionFirst visitCold go to 1)Warm go to 2)Reload
cache v10ms600ms0ms0ms
cache v20ms0ms0ms0ms


Class 10 SDHC
Cache versionFirst visitCold go to 1)Warm go to 2)Reload
cache v1600ms600ms0ms0ms
cache v20ms0ms0ms0ms


Slow microSD
Cache versionFirst visitCold go to 1)Warm go to 2)Reload
cache v12500ms740ms0ms0ms
cache v20ms0ms0ms0ms


All load time values are medians, jank values averages, from at least 3 runs without extremes in attempt to lower the noise.


1) Open a new tab and navigate to a page right after the Firefox start.

2) Open a new tab and navigate to a page that has already been visited during the browser session.


Mozilla Firefox new HTTP cache is live!

mozilla firefox new http cache performance speed crash kill freeze


The new Firefox HTTP cache back-end that keeps the cache content after a crash or a kill and doesn’t cause any UI hangs – has landed!


It’s currently disabled by default but you can test it by installing Firefox Nightly (version 27) and enabling it. This applies to Firefox Mobile builds as well.  There is a preference that enables or disables the new cache, find it in about:config. You can switch it on and off any time you want, even during active browsing, there is no need to restart the browser to take the changes in effect:


  • 0 – disable, use the old crappy cache (files are stored under Cache directory in your profile) – now the default
  • 1 – enable, use the brand new HTTP cache (files are stored under cache2 directory in your profile)

Other new preferences that control the cache behavior:

No longer used any new specific preferences.  We are now backward compatible with the old cache preferences.


  • number of kBs that are preserved in RAM tops to keep the most used content in memory, so page loads speed up
  • on desktop this is now set to 50MB (i.e. 51’200kB)


There are still open bugs before we can turn this fully on.  The one significant is that we don’t automatically delete cache files when browser.cache.disk.capacity is overreached, so your disk can get flooded by heavy browsing.  But you still can delete the cache manually using Clear Recent History.

Enabling the new HTTP cache by default is planned for Q4/2013.  For Firefox Mobile it can even be sooner, since we are using the Android’s context cache directory that is automatically deleted when the storage gets out of space.  Hence, we don’t need to monitor the cache capacity our self on mobile.

Please report any bug you find during tests under Core :: Networking: Cache.


New Firefox HTTP cache backend – story continues

In my previous post I was writing about the new cache backend and some of the very first testing.

Now we’ve stepped further and there are significant improvements.  I was also able to test with more various hardware this time.

The most significant difference is a single I/O thread with relatively simple event prioritization.  Opening and reading urgent (render-blocking) files is done first, opening and reading less priority files after that and writing is performed as the last operation. This greatly improves performance when loading from non-warmed cache and also first paint time in many scenarios.

The numbers are much more precise then in the first post.  My measuring is more systematic and careful by now.  Also, I’ve merged gum with latest mozilla-central code few times and there are for sure some improvements too.

Here are the results, I’m using 50MB limit for keeping cached stuff in RAM.


[ complete page load time / first paint time ]

Old iMac with mechanical HDD
BackendFirst visitWarm go to 1)Cold go to 2)Reload
mozilla-central7.6s / 1.1s560ms / 570ms1.8s / 1.7s5.9s / 900ms
new back-end7.6s / 1.1s530ms / 540ms2.1s / 1.9s**6s / 720ms


Old Linux box with mechanical 'green' HDD
BackendFirst visitWarm go to 1)Cold go to 2)Reload
mozilla-central7.3s / 1.2s1.4s / 1.4s2.4s / 2.4s5.1s / 1.2s
new back-end7.3s/ 1.2s
or** 9+s / 3.5s
1.35s / 1.35s2.3s / 2.1s4.8s / 1.2s


Fast Windows 7 box with SSD
BackendFirst visitWarm go to 1)Cold go to 2)Reload
mozilla-central6.7s / 600ms235ms / 240ms530ms / 530ms4.7s / 540ms
new back-end6.7s / 600ms195ms / 200ms620ms / 620ms***4.7s / 540ms


Fast Windows 7 box and a slow microSD
BackendFirst visitWarm go to 1)Cold go to 2)Reload
mozilla-central13.5s / 6s600ms / 600ms1s / 1s7.3s / 1.2s
new back-end7.3s / 780ms
or** 13.7s / 1.1s
195ms / 200ms1.6 or 3.2s* / 460ms***4.8s / 530ms


To sum – most significant changes appear when using a really slow media.  For sure, first paint times greatly improves, not talking about the 10000% better UI responsiveness!  Still, space left for more optimizations.  We know what to do:

  • deliver data in larger chunks ; now we fetch only by 16kB blocks, hence larger files (e.g. images) load very slowly
  • think of interaction with upper levels by means of having some kind of an intelligent flood control


1) Open a new tab and navigate to a page when the cache is already pre-warmed, i.e. data are already fully in RAM.

2) Open a new tab and navigate to a page right after the Firefox start.

* I was testing with my blog home page.  There are few large images, ~750kB and ~600kB.  Delivering data to upper level consumers only by 16kB chunks causes this suffering.

** This is an interesting regression.  Sometimes with the new backend we delay first paint and overall load time.  Seems like the cache engine is ‘too good’ and opens the floodgate too much overwhelming the main thread event queue.  Needs more investigation.

*** Here it’s combination of flood fill of the main thread with image loads, slow image data load it self and fact, that in this case we first paint after all resources on the page loaded – that needs to change.  It’s also supported by fact that cold load first paint time is significantly faster on microSD then on SSD.  The slow card apparently simulates the flood control here for us.