Adventures in Traffic Analytics: Google Referrer Query Strings Debunked Part 2

Read part 1

The first thing I did was to grep through my logs to get all ved parameters. I then sorted them by frequency:

    842 0CC8QFjAA
    726 0CCoQFjAA
    718 0CCwQFjAA
    670 0CC0QFjAA
    602 0CCkQFjAA
        …
      2 0CAUQ_AUoAA
      2 0CAoQvwU
      1 1t:429,r:65,s:500,i:199
      1 0CPwCEBYwPw
      1 0CPwBENUCKAU
      1 0CPoBEBYwFQ

(output of cat veds.txt|sort|uniq -c|sort -rn)

Interestingly enough, there was only one entry, that started with a 1, while all other entries started with a 0. So I think, we got out first Parameter:

The first character of the ved query parameter denotes, wether the value is encoded as plain text or not.

decoding the plain text variant is trivial. What's left is to deduce is, what the abbreviated parameter names t, r, s and i stand for. I would bet my bottom dollar, that r stands for result_number and s stands for start (the pagination parameter).

Now lets move on to the more interesting looking part: the ones, that start with a 0. Anyone who has looked at a raw email, will notice, that CAUQ_AUoAA looks suspiciously like a variant of base64 (not complying with RFC 4648). First of all, the padding equal signs at the end are missing (because they are not really necessary, consume space and it would complicate URL encoding), secondly, _ and - are used instead of + and / (again, because it would complicate things). Never the less, it's easy enough to decode in python:

base64.b64decode(str(s)+'=====', '_-')

This will give you the raw bytes of the message.

To find out, what those bytes actually mean, I dumped all the veds I have as binary and sorted them:

cat veds.txt|./bindump.py|sort|uniq

bindump.py can be found on GitHub

An excerpt of the output is the following:

       a        b       c         d        e        f
1: 00001000|01111111|00010000|00010110|00110000|00001011
2: 00001000|01111111|00010000|00010110|00110000|00001110
3: 00001000|10000000|00000001|00010000|00010110|00110000|…
4: 00001000|10000000|00000001|00010000|00010110|00110000|…

There are a few interesting things to note here:

the first byte is always 00001000
when the first bit (msb in little endian) of the second byte jumps from 0 to 1, the byte in column c jumps to column d. (take a look at row 2 and 3)

As you can see, there is a pattern emerging. the second bullet point means, that this a variable length encoding of the values. It turns out, that I've seen something like that before and in retrospect, it's rather obvious. But good to know, that we're on the right track…

Read how to decode the contents of the ved message.

Continue to Part 3

Adventures in Traffic Analytics

Montag, 19. August 2013

Google Referrer Query Strings Debunked Part 2

Next

Keine Kommentare:

Kommentar veröffentlichen