New Firefox HTTP cache backend - story continues

In my previous post I was writing about the new cache backend and some of the very first testing.

Now we've stepped further and there are significant improvements.  I was also able to test with more various hardware this time.

The most significant difference is a single I/O thread with relatively simple event prioritization.  Opening and reading urgent (render-blocking) files is done first, opening and reading less priority files after that and writing is performed as the last operation. This greatly improves performance when loading from non-warmed cache and also first paint time in many scenarios.

The numbers are much more precise then in the first post.  My measuring is more systematic and careful by now.  Also, I've merged gum with latest mozilla-central code few times and there are for sure some improvements too.

Here are the results, I'm using 50MB limit for keeping cached stuff in RAM.

 

[ complete page load time / first paint time ]

Old iMac with mechanical HDD
BackendFirst visitWarm go to 1)Cold go to 2)Reload
mozilla-central7.6s / 1.1s560ms / 570ms1.8s / 1.7s5.9s / 900ms
new back-end7.6s / 1.1s530ms / 540ms2.1s / 1.9s**6s / 720ms

 

Old Linux box with mechanical 'green' HDD
BackendFirst visitWarm go to 1)Cold go to 2)Reload
mozilla-central7.3s / 1.2s1.4s / 1.4s2.4s / 2.4s5.1s / 1.2s
new back-end7.3s/ 1.2s
or** 9+s / 3.5s
1.35s / 1.35s2.3s / 2.1s4.8s / 1.2s

 

Fast Windows 7 box with SSD
BackendFirst visitWarm go to 1)Cold go to 2)Reload
mozilla-central6.7s / 600ms235ms / 240ms530ms / 530ms4.7s / 540ms
new back-end6.7s / 600ms195ms / 200ms620ms / 620ms***4.7s / 540ms

 

Fast Windows 7 box and a slow microSD
BackendFirst visitWarm go to 1)Cold go to 2)Reload
mozilla-central13.5s / 6s600ms / 600ms1s / 1s7.3s / 1.2s
new back-end7.3s / 780ms
or** 13.7s / 1.1s
195ms / 200ms1.6 or 3.2s* / 460ms***4.8s / 530ms

 

To sum - most significant changes appear when using a really slow media.  For sure, first paint times greatly improves, not talking about the 10000% better UI responsiveness!  Still, space left for more optimizations.  We know what to do:

  • deliver data in larger chunks ; now we fetch only by 16kB blocks, hence larger files (e.g. images) load very slowly
  • think of interaction with upper levels by means of having some kind of an intelligent flood control

 

1) Open a new tab and navigate to a page when the cache is already pre-warmed, i.e. data are already fully in RAM.

2) Open a new tab and navigate to a page right after the Firefox start.

* I was testing with my blog home page.  There are few large images, ~750kB and ~600kB.  Delivering data to upper level consumers only by 16kB chunks causes this suffering.

** This is an interesting regression.  Sometimes with the new backend we delay first paint and overall load time.  Seems like the cache engine is 'too good' and opens the floodgate too much overwhelming the main thread event queue.  Needs more investigation.

*** Here it's combination of flood fill of the main thread with image loads, slow image data load it self and fact, that in this case we first paint after all resources on the page loaded - that needs to change.  It's also supported by fact that cold load first paint time is significantly faster on microSD then on SSD.  The slow card apparently simulates the flood control here for us.

New Firefox HTTP cache backend, first impressions

After some two months of coding me and Michal Novotný are closing to have first "private testing" stable enough build with new and simplified HTTP cache back-end.

The two main goals we've met are:

  • Be resilient to crashes and process kills
  • Get rid of any UI hangs or freezes (a.k.a janks)

We've abandoned the current disk format and use separate file for each URL however small in size it is.  Each file is using self-control hashes to check it's correct, so no fsyncs are needed.  Everything is asynchronous or fully buffered.  There is a single background thread to do any I/O like opening, reading and writing.  On Android we are writing to the context cache directory.  This way the cached data are actually treated as that.

I've performed some first tests using http://janbambas.cz/ as a test page.  Currently as I write this post there are some 460 images.  Testing was made on a relatively fast machine, but important is to differentiate on the storage efficiency.  I had two extremes available: an SSD and an old and slow-like-hell microSD via a USB reader.

Testing with a microSD card:

First-visit load
Full loadFirst paint
mozilla-central16s7s
new back-end12s4.5s
new back-end and separate threads for open/read/write10.5s3.5s
Reload, already cached and warmed
Full loadFirst paint
mozilla-central7s700ms
new back-end5.5s500ms
new back-end and separate thread for open/read/write5.5s500ms
Type URL and go, cached and warmed
Full loadFirst paint
mozilla-central900ms900ms
new back-end400ms400ms
Type URL and go, cached but not warmed
Full loadFirst paint
mozilla-central5s4.5s
new back-end~28s5-28s
new back-end and separate threads for open/read/write *)~26s5-26s

*) Here I'm getting unstable results.  I'm doing more testing with having more concurrent open and read threads.  It seems there is not that much effect and the jitter in time measurements is just a noise.

I will report on thread concurrent I/O more in a different post later since I find it quite interesting space to explore.

Clearly the cold "type and go" test case shows that blockfiles are beating us here.  But the big difference is that UI is completely jank free with the new back-end!

Testing on an SSD disk:

The results are not that different for the current and the new back-end, only a small regression in warmed and cold "go to" test cases:

Type URL and go, cached and warmed
Full loadFirst paint
mozilla-central220ms230ms
new back-end310ms320ms
Type URL and go, cached but not warmed
Full loadFirst paint
mozilla-central600ms600ms
new back-end1100ms1100ms

Having multiple threads seems not to have any affect as far as precision of my measurements goes.

At this moment I am not sure what causes the regression for both the "go to" cases on an SSD, but I believe it's just a question of some simple optimizations, like delivering more then just 4096 bytes per a thread loop as we do now or a fact we don't cache redirects - it's a known bug right now.

Still here and want to test your self? Test builds can be downloaded from 'gum' project treeDisclaimer: the code is very very experimental at this stage, so use at your own risk!

 

Firefox detailed event tracer - about:timeline

I always wanted to see all the little things that happen when loading a web page in Firefox.  I know how it works well as a Gecko developer, but to actually see the interaction and chaining is not at all always obvious, unless you study crazy NSPR logs.  Hence, I started a development of a tool, called event tracer, to get a timeline showing the hidden guts.

Example screenshot will tell the story the best way, a trial run of www.mozilla.org load:

firefox event timeline performance diagnostic tool

Planning improvements

At this time the work is not complete.  There is a lot more to make this an actually useful development tool and my next steps are to hunt those requirements down.

I am using about:timeline to verify patches I review do what they intend to do.  It can also help find hidden bugs.

However, using about:timeline for discovery of perfomance suboptimal code paths showed up being not that simple.  Events are just spread all over around and connection between them is not easily, if at all, discoverable.

Hence, this needs more thinking and work.

First thoughts to improve are to more focus on "the resource" and a single object dealing with it.  It might be better to show what all events are happening on e.g. a single instance of an http channel or http transactions then to tortuously hunt them somewhere in the graph.  There is some simple way to highlight and filter, but that is not enough for an analytical view.

Then, I'm missing a general way to easily recognize how things are chained together.  So, I'd like to link events that are coming one from another (like http channel creates an http transaction, then a connection etc.) and present the timeline more like a gantt chart plus show a critical path or flow for any selected pass through.

From inspecting the timeline it should be visible where bottlenecks and long wait times worth fixing are.  At this time I don't have a complete clear plan on this, though.

 

Still here?  Cool :)  If you have any though or ideas for how to use the data we collect and visualize as a valuable source for the performance optimization surgery please feel free to share them here.  More on how the timeline data are produced check bellow.

 

How it works

firefox event timeline performance diagnostic tool

This event track on the image is produced with a special code instrumentation.  To get for instance "net::http::transaction" traces, following 3 places of the code have been instrumented (in blue):

 

1. WAIT - record time when an http transaction is scheduled:

nsresult
nsHttpTransaction::Init(uint32_t caps,
                        nsHttpConnectionInfo *cinfo,
                        nsHttpRequestHead *requestHead,
                        nsIInputStream *requestBody,
                        bool requestBodyHasHeaders,
                        nsIEventTarget *target,
                        nsIInterfaceRequestor *callbacks,
                        nsITransportEventSink *eventsink,
                        nsIAsyncInputStream **responseBody)
{
    MOZ_EVENT_TRACER_COMPOUND_NAME(static_cast<nsAHttpTransaction*>(this),
                                   requestHead->PeekHeader(nsHttp::Host),
                                   requestHead->RequestURI().BeginReading());

    MOZ_EVENT_TRACER_WAIT(static_cast<nsAHttpTransaction*>(this),
                          "net::http::transaction");

 

2. EXEC - record time when the transaction first comes to an action, it is the time it gets a connection assigned and starts it's communication with the server:

void
nsHttpTransaction::SetConnection(nsAHttpConnection *conn)
{
    NS_IF_RELEASE(mConnection);
    NS_IF_ADDREF(mConnection = conn);

    if (conn) {
        MOZ_EVENT_TRACER_EXEC(static_cast<nsAHttpTransaction*>(this),
                              "net::http::transaction");
    }

 

3. DONE - record time when the transaction has finished it's job by completing the response fetch:

nsHttpTransaction::Close(nsresult reason)
{
    LOG(("nsHttpTransaction::Close [this=%x reason=%x]\n", this, reason));

    ...

    MOZ_EVENT_TRACER_DONE(static_cast<nsAHttpTransaction*>(this),
                          "net::http::transaction");
}

The thread timeline where an event is finally displayed is the thread where EXEC code has been called on.

What is the exact definition of the WAIT and EXEC phase is up to the developer now.  For me the WAIT phase is any time an operation is significantly blocked before it can be carried out, it's the time having the main performance affect we may be in particular interested in shortening.  Few examples:

  • time spent in a thread's event queue - duration from the dispatch to the run
  • time spent waiting for an asynchronous callback such as reading from disk or network
  • time waiting for necessary resources, such as an established TCP connection before an object can proceed with its job
  • time spent waiting for acquirement of a lock or a monitor

How to bind a URL or any identifying info to an event

The following instrumentation is used (in red):

nsresult
nsHttpTransaction::Init(uint32_t caps,
                        nsHttpConnectionInfo *cinfo,
                        nsHttpRequestHead *requestHead,
                        nsIInputStream *requestBody,
                        bool requestBodyHasHeaders,
                        nsIEventTarget *target,
                        nsIInterfaceRequestor *callbacks,
                        nsITransportEventSink *eventsink,
                        nsIAsyncInputStream **responseBody)
{
    MOZ_EVENT_TRACER_COMPOUND_NAME(static_cast<nsAHttpTransaction*>(this),
                                   requestHead->PeekHeader(nsHttp::Host),
                                   requestHead->RequestURI().BeginReading());

    MOZ_EVENT_TRACER_WAIT(static_cast<nsAHttpTransaction*>(this),
                          "net::http::transaction");

 

Here the http transaction event is set a host + path of the resource it loads.

The object's this pointer, that needs to be properly cast by the developer, is what sticks all together.  This is the main difference from how usual profiling tools work.  Event timeline is providing a view of event chaining crossing thread and method boundaries, and not just a pure stack trace.

View details of a tracked event

Each event track is bound with e.g. a URL it deals with, where applicable.  You can inspect the URL (the associated resource) and more details when an event is clicked on:

firefox event timeline performance diagnostic tool event detailWait between is track of the time the event spent waiting, i.e. the time the event has been scheduled and time of the execution start, both since the time tracking has been turned on.  The number in parentheses is simply the wait phase duration.

Execution is track of time spent by execution, i.e. when the intended job it self has started and when it has actually finished.  The parenthesized number is how long the job execution took.

Posted from is name of the thread the event has been scheduled at.

The Filter button is used to quickly filter this particular event plus it's sister events out.  How it work is described bellow.

The Zero time button is used to shift the "time 0.000" of the timeline to the start time of the inspected event, so that you can inspect recorded timing of other events relative to this one particular.

mxr link will open results of search for the event name in the code.  This way you can quickly inspect how this event timing is actually instrumented and collected right in the code.

Filtering timeline events

You can filter events using two filtering functions:

  • By type of an event (e.g. "net::nttp::transaction", "docshell::pageload" etc.)
  • By name of a resource an event has been associated with (e.g. "www.mozilla.org", "/favicon.ico" etc...)

Filter by type - currently there are following event types so far implemented (instrumented).  Yyou get this check box list after the tracer run when you click filter events at the top bar:

firefox event timeline performance diagnostic tool event filterEach event has a namespace, e.g. for "net::http::transaction" the namespaces are "net::" and "net::http::".  You can turn on or off the whole set of events in a namespace easily.  Currently there are only "net::" and "docshell::" top level namespaces worth mentioning.

 

 

 

 

 

 

 

 

Filtering by resource, i.e. usually the URL a particular event or set of events have been bound to, is also possible when you click filter resource:

firefox event timeline performance diagnostic tool resource filter

You can enter the filter string manually or use the provided autocomplete.  The by-the-resource filtering works as follows:

1. we inspect whether the string set as the filter is a substring of the event resource string
2. we inspect whether the event resource string is a substring of the string set as the filter

If one of these conditions passes, we display the event, otherwise, we hide it.

This way when you enter "http://www.mozilla.org/favicon.ico" as a filter, you see the http channel and transaction for this resource load, as well as any DNS and TCP connection setups related to "www.mozilla.org".

 

How to use about:timeline your self

  • Create an optimized build of Firefox with --enable-visual-event-tracer configure option
  • Install about:timeline extension
  • Run Firefox
  • Go to about:timeline
  • Press the orange [ Start ] button, you get the message the trace is running
  • Proceed with your intended actions, e.g. a page load, and let it finish
  • Return to about:timeline tab and press the [ Stop ] button
  • Wait a little to get your events timeline details

 

So, here it is.  It's a work in progress tho.  I'll keep updates.

Firefox 23 now has faster localStorage implementation

Finally, the more efficient implementation of localStorage DOM API has landed and sticks in Firefox 23, currently on the Nightly channel.  The first build having the patches has id 20130416030901.

More about the motivation and details in my previous post on DOM storage code overhaul.

I want to thank all involved people, namely Vladan Djeric, Marco Bonardo and Olli Pettay for their help, ideas and mainly reviews of this large change.  Thanks!

 

 

Usage of QueryPerformanceCounter in Gecko

It's known to be slow and having problems with unpredictable jumps back and forward, but only on a minority of platforms.  But still, some protection measures must be taken.

I've observed really significant performance issues only on older systems running Windows XP (32-bit).

Problems causing the biggest pain are:

  • QueryPerformanceCounter may jump against expected monotonicity when a thread switches a processor core (happens quite often, by the way) ; solution can be sticking one's thread to a single core.  This, though, happens on minority of systems like 64-bit processors running 64-bit Windows XP.
  • QueryPerformanceCounter may not exactly reflect time while a machine has been suspended ; or simply may not increase as expected.
  • When called very often, e.g. in cycles, it may influence even I/O operations performance.

A year or so ago I've implemented high resolution timer to be used in the Mozilla Platform.  The code, as is, is pretty complicated, it tries to calibrate QueryPerformanceCounter values against GetTickCount that is known to be very stable and also fast to call, but has, depending on a system, not better then 15.6 milliseconds interval of increasing.  I've chosen to calibrate QueryPerformanceCounter with GetTickCount based on this compare of Windows API timer functions.

Recently, I've got an idea to simplify the QueryPerformanceCounter usage implementation quite a lot.  It also comes from demand to prevent use of QueryPerformanceCounter for measuring timeouts longer then, say, 500ms.  It would be just wasting of resources.  GetTickCount may perfectly fulfill the purpose for longer durations.

The new code does the following: when the Now() function is called, the result is a complex value keeping both QueryPerformanceCounter and GetTickCount results.  When measuring duration between two calls to Now(), I can check how much QueryPerformanceCounter differentiates from GetTickCount in both timestamps.  If it is too much, I simply use GetTickCount to calculate the duration.  If it differs too much and the interval between the two timestamps is very short, I can even say QueryPerformanceCounter is not stable and consider not to use it at all.

Calculation of duration between two timestamps using QueryPerformanceCounter (QPC) and GetTickCount (GTC):

  • jitter = ABS((QPC1 - GTC1) - (QPC2 - GTC2))
  • if jitter is less then 15.6 x 2, then calculate and return duration as QPC1 - QPC2
  • jitter = jitter - 15.6 x 2
  • if jitter is more then 50ms per 1000ms of duration, then calculate and return duration as GTC1 - GTC2
  • and, if ABS(GTC1 - GTC2) is less then 2000ms, stop using QueryPerformanceCounter at all, it misbehaves

15.6 ms is result of GetSystemTimeAdjustment's timeIncrement.  2000ms is the "too short" duration to let QPC behave wrong.  The 50ms per 1000ms is empirically gathered number.  So far, I know it is able to detect really badly behaving QueryPerformanceCounter.

This radically simplifies the current code and from its nature solves the suspend/wake up issue.  System sleep time will always be correctly reflected by GetTickCount.  QueryPerformanceCounter inconsistency after waking the system up will simply be ignored.

This also allows a very simple implementation of NowLoRes() just by not storing QueryPerfomanceCounter result in the time stamp value.