String parsing made simple with mozilla::Tokenizer

28th July, 2015




I can see FindChar, Substring, ToInteger and even atoi, strchr, strstr and sscanf craziness all over the Mozilla code base. There are though much better and, more importantly, safer ways to parse even a very simple input.

I wrote a parser class with API derived from lexical analyzers that helps with simple inputs parsing in a very easy way. Just include mozilla/Tokenizer.h and use class mozilla::Tokenizer. It implements a subset of features of a lexical analyzer.  Also nicely hides boundary checks of the input buffer from the consumer.

To describe the principal briefly: Tokenizer recognizes tokens like whole words, integers, white spaces and special characters.  Consumer never works directly with the string or its characters but only with pre-parsed parts (identified tokens) returned by this class.


There are two main methods of Tokenizer:

  • bool Next(Token& result);

If there is anything to read from the input at the current internal read position, including the EOF, returns true and result is filled with a token type and an appropriate value easily accessible via a simple variant-like API.  The internal read cursor is shifted to the start of the next token in the input before this method returns.

  • bool Check(const Token& tokenToTest);

If a token at the current internal read position is equal (by the type and the value) to what has been passed in the tokenToTest argument, true is returned and the internal read cursor is shifted to the next token.  Otherwise (token is different than expected) false is returned and the read cursor is left unaffected.

Few usage examples:


  #include "mozilla/Tokenizer.h"

  mozilla::Tokenizer p(NS_LITERAL_CSTRING("Sample string 2015."));

Reading a single token, examining it

  mozilla::Tokenizer::Token t;
  bool read = p.Next(t);
  // read == true, we have read something and t has been filled
  // Following our example string...
  if (t.Type() == mozilla::Tokenizer::TOKEN_WORD) {
    t.AsString(); // returns "Sample"

Checking on a token value and automatically skipping on a positive test

  if (!p.CheckChar('\x20')) {
    throw "I expect a space here!";

  read = p.Next(t);
  // read == true
  t.Type() == mozilla::Tokenizer::TOKEN_WORD;
  t.AsString() == "string";

  if (!p.CheckWhite()) {
    throw "A white space is expected here!";

Reading numbers

  read = p.Next(t);
  // read == true
  t.Type() == mozilla::Tokenizer::TOKEN_INTEGER;
  t.AsInteger() == 2015;

Reaching the end of the input

  read = p.Next(t);
  // read == true
  t.Type() == mozilla::Tokenizer::TOKEN_CHAR;
  t.AsChar() == '.';

  read = p.Next(t);
  // read == true
  t.Type() == mozilla::Tokenizer::TOKEN_EOF;

  read = p.Next(t);
  // read == false, we are behind the EOF
  // t is here undefined!

More features

To learn more enhanced features of the Tokenizer – there is not that many, don’t be scared ;) – look at the well documented Tokenizer.h file under xpcom/ds.

As a teaser you can go through this more enhanced example or check on a gtest for Tokenizer:

#include "mozilla/Tokenizer.h"

using namespace mozilla;

  // A simple list of key:value pairs delimited by commas
  nsCString input("message:parse me,result:100");

  // Initialize the parser with an input string
  Tokenizer p(input);
  // A helper var keeping type and value of the token just read
  Tokenizer::Token t;

  // Loop over all tokens in the input
  while (p.Next(t)) {
    if (t.Type() == Tokenizer::TOKEN_WORD) {
      // A 'key' name found
      if (!p.CheckChar(':')) {
        // Must be followed by a colon
        return; // unexpected character

      // Note that here the input read position is just after the colon
      // Now switch by the key string
      if (t.AsString() == "message") {
        // Start grabbing the value
        // Loop until EOF or comma
        while (p.Next(t) && !t.Equals(Tokenizer::Token::Char(',')))
        // Claim the result
        nsAutoCString value;
        MOZ_ASSERT(value == "parse me");

        // We must revert the comma so that the code bellow recognizes the flow correctly
      } else if (t.AsString() == "result") {
        if (!p.Next(t) || t.Type() != Tokenizer::TOKEN_INTEGER) {
          return; // expected a value and that value must be a number

        // Get the value, here you know it's a valid number
        uint32_t number = t.AsInteger();
        MOZ_ASSERT(number == 100);
      } else {
        // Here t.AsString() is any key but 'message' or 'result', ready to be handled

      // On comma we loop again
      if (p.CheckChar(',')) {
        // Note that now the read position is after the comma
      // No comma?  Then only EOF is allowed
      if (p.CheckEOF()) {
        // Cleanly parsed the string

    return; // The input is not properly formatted


Currently works only with ASCII inputs but can be easily enhanced to also support any UTF-8/16 coding or even specific code pages if needed.

TCPSocket.js/TCPServerSocket.js IPC mess captured

15th July, 2015

TCPSocket implemented in Javascript is a DOM technology providing web pages a direct access to TCP network sockets.  We support both outgoing connections and listening to incoming connections via a server socket.

I’m constantly requested to review changes under this code in /dom/network where TCPSocket et al resides.  And I’m always lost in the mess of all classes and objects involved in the IPC bridging.

As a side result of the last review request I’ve dived into the jungle and crated a “UML” flow chart that helps me understand – and not forget again – the complicated flow of IPC’ing in both TCPSocket and TCPServerSocket.

Here they are.

TCPSocket IPCTCPServerSocket IPC

Not perfect, I’m no UML expert, but I think one can understand and make a picture that may be helpful ;)

Unseen astronomical phenomenon discovered

21st June, 2015

A new bright nebula found very close to a water surface.

Nebula or a wooden stick?

Who recognize what this actually is? ;)

New Gecko performance tool: Backtrack

9th June, 2015

Backtrack aims to show a complete code path flow from any point back to its source, crossing asynchronous callbacks, threads, processes, network requests, timers and any kind of implementation specific queuing plus capturing any I/O or mutex blockade.  The ‘critical flow execution path’ is put to a context of all the remaining concurrent execution flows.  It’s then easy to examine how the critical flow is blocked and delayed by concurrent tasks.

The work is tracked in this bug, where you also find patches and build instructions.  There is also an add-on that, in Backtrack enabled builds, allows you to view actual captured data.

Click the screenshot bellow to view an interactive previewIt’s capture of load of my blog main page till the first-paint notification (no e10s and no network predictor to demonstrate the capture capabilities.)


Backtrack combines*) Gecko Profiler and Task Tracer.

Gecko Profiler (PSP) provides instrumentation (already spread around the code base) to capture static call stacks.  I’ve enhanced the PSP instrumentation to also capture objects (i.e. 'this' pointer value) and added a simple base class to easily monitor object life time (classes must be instrumented.)

Task Tracer (TT) on the other hand provides a generic way to track back on runnables – but not on e.g. network poll results, network requests or implementation specific queues.  It was easy to add a hook into the TT code that connects the captured object inter-calls with information about runnables dispatch source and target.

The Backtrack experimental patch:

  • Captures object lifetime (simply add ProfilerTracked<class Derived> as a base class to track the object lifetime and class name automatically)
  • Annotates objects with resource names (e.g URI, host name) they work with at run-time
  • Connects stack and object information using the existing PROFILER_LABEL_FUNC instrumentation recording this pointer value automatically ; this way it collects calls between objects
  • Measures I/O and mutex wait time ; an object holding a lock can be easily found
  • Sticks receipt of a particular network response exactly to its actual request transmission (here I mainly mean HTTP but also applies to connect() and DNS resolution)
  • Joins network polling “ins” and “outs”
  • Binds code-specific queuing and dequeuing, like our DNS resolver, HTTP request queues.  Those are not just ‘dispatch and forget’ like nsIEventTarget and nsIRunnable but rather have priorities, complex dequeue conditions and may not end up dispatched to just a single thread.  These queues are very important from the resource scheduling point of view.


  • IPC support, i.e. cross also processes
  • Let the analyzes also mark anything ‘related’ for achieving a selected path end (e.g. my favorite first-paint time and all CSS loads involved)
  • Probably persist the captured raw logs and allow the analyzes be done offline

Disadvantages: just one – significant memory consumption.

*) The implementation is so far not deeply bound to SPS and TT memory data structures.  I do the capture my own – actually a third data collection, side by SPS and TT.  I’m still proving the concept this way but if found useful and bearable to land in this form as a temporary way of collecting the data, we can optimize and cleanup as a followup work.

Částečné zatmění slunce 2015

20th March, 2015

Je to hodně z ruky, selhala dálková programovatelná spoušť a na focení po přesných intervalech jsem byl prostě líný :) Aligning není úplně přesný, ale mě se to líbí i tak.

Stativ, Canon EOS 60D, Canon EF 200/2.8 L II, ND8 + Baader Astrosolar.  Každý snímek animace cca 6 – 10 RAW obrazů @ ISO 100, 1/125s, F/4, bez flatfield.  Registax 6.


Just a photo…

23rd January, 2015

Misty and cloudy field

Firefox HTTP cache v1 API disabled

6th June, 2014

Recently we landed the new HTTP cache for Firefox (“cache2″) on mozilla-central.  It has been in nightly builds for a while now and seems very likely to stick on the tree and ship in Firefox 32.

Given the positive data we have so far, we’re taking another step today to making the new cache official: we have disabled the old APIs for accessing the HTTP cache, so addons will now need to use the cache2 APIs. One important benefit of this is that the cache2 APIs are more efficient and never block on the main thread.  The other benefit is that the old cache APIs were no longer pointing at actual data any more (it’s in cache2) :)

This means that the following interfaces are now no longer supported:

  •   nsICache
  •   nsICacheService
  •   nsICacheSession
  •   nsICacheEntryDescriptor
  •   nsICacheListener
  •   nsICacheVisitor

(Note: for now nsICacheService can still be obtained: however, calling any of its methods will throw NS_ERROR_NOT_IMPLEMENTED.)

Access to previously stored cache sessions is no longer possible, and the update also causes a one-time deletion of old cache data from users’ disks.

Going forward addons must instead use the cache2 equivalents:

  •   nsICacheStorageService
  •   nsICacheStorage
  •   nsICacheEntry
  •   nsICacheStorageVisitor
  •   nsICacheEntryDoomCallback
  •   nsICacheEntryOpenCallback

Below are some examples of how to migrate code from the old to the new cache API.  See the new HTTP cache v2 documentation for more details.

The new cache2 implementation gets rid of some of terrible features of the old cache (frequent total data loss, main thread jank during I/O), and significantly improves page load performance.  We apologize for the developer inconvenience of needing to upgrade to a new API, but we hope the performance benefits outweight it in the long run.

Example of the cache v1 code (now obsolete) for opening a cache entry:

var cacheService = Components.classes[";1"]

var session = cacheService.createSession(

    onCacheEntryAvailable: function (entry, access, status) {
      // And here is the cache v1 entry

Example of the cache v2 code doing the same thing:

let {LoadContextInfo} = Components.utils.import(
  "resource://gre/modules/LoadContextInfo.jsm", {}
let {PrivateBrowsingUtils} = Components.utils.import(
  "resource://gre/modules/PrivateBrowsingUtils.jsm", {}

var cacheService = Components.classes[";1"]

var storage = cacheService.diskCacheStorage(
  // Note: make sure |window| is the window you want
    PrivateBrowsingUtils.privacyContextFromWindow(window, false)),

    onCacheEntryCheck: function (entry, appcache) {
      return Ci.nsICacheEntryOpenCallback.ENTRY_WANTED;
    onCacheEntryAvailable: function (entry, isnew, appcache, status) {
      // And here is the cache v2 entry


There is a lot of similarities, instead of a cache session we now have a cache storage having a similar meaning – to represent a distinctive space in the whole cache storage – it’s just less generic as it was before so that it cannot be misused now.  There is now a mandatory argument when getting a storage – nsILoadContextInfo object that distinguishes whether the cache entry belongs to a Private Browsing context, to an Anonymous load or has an App ID.

(Credits to Jason Duell for help with this blog post)

NGC 7000, NGC 6974, IC 1318 a okolí + IR

30th May, 2014

NGC 7000, NGC 6974, IC 1318

NGC 7000, NGC 6974, IC 1318 + Infrared


Dvě téměř zapomenuté fotky z lokace jižně od Prahy, focené loni v létě v noci z 16. na 17. června. Velmi krátká noc, slunce definitivně zapadlo snad až před jedenáctou a po druhé už zase začalo svítat. Zato divokých psů a prasat v okolní vysoké trávě bylo požehnaně :)


Horní fotografie je jen viditelné světlo, dolní má modrý overlay v IR pásmu nad 742nm. Kvalita je sice mizerná, základ je vždy jen jedna fotografie, ale mě se to líbí.


Canon 30D, MC mod
Canon EF 35mm/F2
HEQ5, ustavena tentokrát driftovou metodou
Astronomik CLS-CCD: 1x600s @ F4.0, ISO 1000
Astronomik ProPlanet IR 742: 1x300s @ F4.0, ISO 1000
0x Flat/Dark/Bias
Zpracování v CR a PS

Headless Fedora 20 and VNC with autologin

30th May, 2014

“Oh no! Something has gone wrong” message is all what you get when you VNC to Gnome 3 in Fedora 20 on a box without any physical monitor attached to any of the video outputs with enabled autologin and screen sharing (vino).  There is an error in /var/log/messages ‘TypeError: this.primaryMonitor is undefined’ at /usr/share/gnome-shell/js/ui/layout.js:410.  I haven’t found a Fedora bug open for this.

You cannot also simply configure e.g. tiger-vnc because of other two bugs, one closed and one open preventing login screen from entering the password – as somebody would be pressing the entry key on and on.

I was not able to find a straight and simple fix unless I’ve hit this solution for Ubuntu, and ported it to Fedora 20:

  • #yum install xorg-x11-drv-dummy
  • put this content to /etc/X11/xorg.conf (you will probably need to create the file):

Section “Monitor”
Identifier “Monitor0″
HorizSync 28.0-80.0
VertRefresh 48.0-75.0
Modeline “1280×800″  83.46  1280 1344 1480 1680  800 801 804 828 -HSync +Vsync

Section “Device”
Identifier “Card0″
Option “NoDDC” “true”
Option “IgnoreEDID” “true”
Driver “dummy”

Section “Screen”
DefaultDepth 24
Identifier “Screen0″
Device “Card0″
Monitor “Monitor0″
SubSection “Display”
Depth 24
Modes “1280×800″

You can then VNC to port :0 and you will be logged in directly without a need to enter the user password.  I suggest SSH tunneling.


New Firefox HTTP cache now enabled on Nightly builds

19th May, 2014

Yes, it’s on!  After a little bit more than a year of a development by me and Michal Novotný all bugs we could find in our labs, offices and homes were fixed.  The new cache back-end is now enabled on Firefox Nightly builds as of version 32 and should stay like that.

The old cache data are for now left on disk but we have handles to remove them automatically from users’ machines to not waste space since it’s now just a dead data.  This will happen after we confirm the new cache sticks on Nightlies.

The new HTTP cache back end has many improvements like request prioritization optimized for first-paint time, ahead of read data preloading to speed up large content load, delayed writes to not block first paint time, pool of most recently used response headers to allow 0ms decisions on reuse or re-validation of a cached payload, 0ms miss-time look-up via an index, smarter eviction policies using frecency algorithm, resilience to crashes and zero main thread hangs or jank.  Also it eats less memory, but this may be subject to change based on my manual measurements with my favorite microSD card which shows that keeping at least data of html, css and js files critical for rendering in memory may be wise.  More research to come.

Thanks to everyone helping with this effort.  Namely Joel Maher and Avi Halachmi for helping to chase down Talos regressions and JW Wang for helping to find cause of one particular hard to analyze test failure spike.  And also all early adopters who helped to find and fix bugs.  Thanks!


New preferences to play with:


Number of kBs we reserve for keeping recently loaded cache entries metadata (i.e. response headers etc.) for quick access and re-validation or reuse decisions.  By default this is at 250kB.
Number of data chunks we always preload ahead of read to speed up load of larger content like images.  Currently size of one chunk is 256kB, and by default we preload 4 chunks – i.e. 1MB of data in advance.


Load times compare:

Since these tests are pretty time consuming and usually not very precise, I was only testing with page 2 of my blog that links some 460 images.  Media storage devices available were: internal SSD, an SDHC card and a very slow microSD via a USB reader on a Windows 7 box.


[ complete page load time / first paint time ]

Cache version First visit Cold go to 1) Warm go to 2) Reload
cache v1 7.4s / 450ms 880ms / 440ms 510ms / 355ms 5s / 430ms
cache v2 6.4s / 445ms 610ms / 470ms 470ms / 360ms 5s / 440ms


Class 10 SDHC
Cache version First visit Cold go to 1) Warm go to 2) Reload
cache v1 7.4s / 635ms 760ms / 480ms 545ms / 365ms 5s / 430ms
cache v2 6.4s / 485ms 1.3s / 450ms 530ms / 400ms* 5.1s / 460ms*


Edit: I found one more place to optimize – preload of data sooner in case an entry has already been used during the browser session (bug 1013587).  We are winning around 100ms for both first paint and load times!  Also stddev of first-paint time is smaller (36) than before (80).  I have also measured more precisely the load time for non-patched cache v2 code.  It’s actually better.

Slow microSD
Cache version First visit Cold go to 1) Warm go to 2) Reload
cache v1 13s / 1.4s 1.1s / 540ms 560ms / 440ms 5.1s / 430ms
cache v2 6.4s / 450ms 1.7s / 450ms 710ms / 540ms* 5.4s / 470ms*
cache v2 (with bug 1013587) - - 615ms / 455ms* -

* We are not keeping any data in memory (bug 975367 and 986179) what seems to be too restrictive.  Some data memory caching will be needed.


“Jankiness” compare:

The way I have measured browser UI jank (those hangs when everything is frozen) was very simple: summing every stuck of the browser UI, taking more then 100ms, between pressing enter and end of the page load.


[ time of all UI thread events running for more then 100ms each during the page load ]

Cache version First visit Cold go to 1) Warm go to 2) Reload
cache v1 0ms 600ms 0ms 0ms
cache v2 0ms 0ms 0ms 0ms


Class 10 SDHC
Cache version First visit Cold go to 1) Warm go to 2) Reload
cache v1 600ms 600ms 0ms 0ms
cache v2 0ms 0ms 0ms 0ms


Slow microSD
Cache version First visit Cold go to 1) Warm go to 2) Reload
cache v1 2500ms 740ms 0ms 0ms
cache v2 0ms 0ms 0ms 0ms


All load time values are medians, jank values averages, from at least 3 runs without extremes in attempt to lower the noise.


1) Open a new tab and navigate to a page right after the Firefox start.

2) Open a new tab and navigate to a page that has already been visited during the browser session.


Highslide for Wordpress Plugin