Adventures in Traffic Analytics

Montag, 19. August 2013

Google Referrer Query Strings Debunked Part 4

In Part 3, we learned, that the ved query parameter is actually protobuf encoded and therefore represents a message. I also provided a little script, that decodes the ved structure. What's now left to do is to deduce the proper names of the variables.

As a reminder, here is the output from the script from Part 3:

…
v1: 58
v2: 22
v6: 3

v1: 67
v2: 22
v6: 4
v7: 10

v1: 67
v2: 22
v6: 4
v7: 10

v1: 6
v2: 22
v5: 2
…

Parameters v6 and v7 are quite easy to deduce, they correspond to the parameters r and s of the plaint text variant. My guess would be, that r and s stand for [r]esult_position and [s]tart. both are zero based indices that denote the page and the number of the result that was visited. You can calculate the absolute position simply via absolute_position=start+result_position.

v5 could actually be called sub_link_position. Also a zero based index, that denotes the position of a related link.

Here are some screenshots to make the matter a little clearer:

sub_link_position in a search result

sub_link_position in related searches

Then we have two parameters, that are called i and t in the plain form. My guess would be, that t stands for type. On normal web searches this parameter is always be 22, on image search it is 429.

The parameter i is an interesting one. In is monotonically increasing on each page of search results. And it changes wether you're logged in or not. When searching for "grumpy cats" and logged into a google account, i was 42 for the first result. When not logged in, it was 57. (at least in my test case)

Sure enough, the results are ordered differently when logged in or not. So maybe the i parameter is somehow related to individualised search? Lets say,i is a value, that denotes the relevance of the search result to the user. I'll call it index_boost. Here is also more research needed, please comment if you find out more.

That makes five parameters: index_boost, type, sub_link_position, result_position and start. As mentioned earlier, protobuf messages are key-value pairs. The key in this case is an positive (excluding zero) integer. The parameters I've found so far have index 1, 2, 5, 6, 7 and none of the veds I've encountered had any other value set. It is possible, that google deprecated earlier parameters to the ved message. Again, more research is needed here. It would be pretty awesome, if some of you guys could provide with huge dumps of veds to do more digging.

Currently I am also able to generate valid veds. Because I've compared the generated ones with given ones and found no difference, I am quite confident, that I've captured all parameters (in my dataset).

You're invited to try out the online demo of the decoder or pull the source from github. Comments and pull-requests are highly welcome. Please keep me posted, if you find out something new.

thats all for now
-- Benjamin

Google Referrer Query Strings Debunked Part 3

Read Part 2

In the last part we learned, that there are two types of encoding for the ved parameter, plain and base64. We also found out, that a variable length encoding is used to store the ved parameter.

As mentioned, I've seen something like that before. Ever heard of protobuf? A library developed by google to encode protocol messages?

[…]Each byte in a varint, except the last byte, has the most significant bit (msb) set – this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two's complement representation of the number in groups of 7 bits, least significant group first.

(quoted from The Protobuf manual)

Bingo! After reading the document, it becomes clear, that the ved parameters are protobuf encoded.

Protobuf messages are actually key-value pairs. This means, that google can easily break the regex tricks some people are doing in Google Analytics just by shuffling up the order of the key-value pairs.

This also explains, why the ved parameter varies in length.

When you read through the protobuf manual, you even find examples, that share a lot with what you'd expect to find inside the ved paramater. When I found this example, I imagined the google engineers laughing their asses of when reading the articles, that suggested using regex to decode the all-so-secret ved parameter. Don't get me wrong, you guys did great work in analysing the data, but it's still worth a lough.

You should note, that a lot of the information and tracking that's done is already available via google webmaster tools (average position, etc.).

It's all there, all that's left to do is to craft a .proto file that resembles the structure of the ved. At this point, I'd like to coin the term VisitEventDescriptor or VisibleElementDescriptor for the ved structure. I think they both would fit nicely.

I hacked up a little .proto file, that assumes, that all values are optional unsigned integers. I then began copy-pasting the ved parameter from the search result pages. It is also noteworthy, that the parameters for image search are not protobuf encoded. The abbreviations give some clue about what the values could actually mean.

here is the prototypic .proto-file:

package test;

message Ved
{
        repeated int64 v1  = 1;
        repeated int64 v2  = 2;
        repeated int64 v3  = 3;
        repeated int64 v4  = 4;
        repeated int64 v5  = 5;
        repeated int64 v6  = 6; 
        repeated int64 v7  = 7; 
        repeated int64 v8  = 8; 
        repeated int64 v9  = 9;
        repeated int64 v10 = 10;
}

you can compile this .proto-file via

protoc -I. --python_out=proto ved.proto

(Note: you have to install the protobuf command line tools first: apt-get install python-protobuf on debian/ubuntu will do the trick)

Here is a python script, that'll decode veds from stdin.

running "cat veds.txt|./decode_veds.py" might yield the following output:

…
v1: 58
v2: 22
v6: 3

v1: 67
v2: 22
v6: 4
v7: 10

v1: 6
v2: 22
v5: 2
…

Now lets play a little Sherlock Holmes and find out, what information those values expose and what the names of the variables could be.

Continue to part 4

Google Referrer Query Strings Debunked Part 2

Read part 1

The first thing I did was to grep through my logs to get all ved parameters. I then sorted them by frequency:

    842 0CC8QFjAA
    726 0CCoQFjAA
    718 0CCwQFjAA
    670 0CC0QFjAA
    602 0CCkQFjAA
        …
      2 0CAUQ_AUoAA
      2 0CAoQvwU
      1 1t:429,r:65,s:500,i:199
      1 0CPwCEBYwPw
      1 0CPwBENUCKAU
      1 0CPoBEBYwFQ

(output of cat veds.txt|sort|uniq -c|sort -rn)

Interestingly enough, there was only one entry, that started with a 1, while all other entries started with a 0. So I think, we got out first Parameter:

The first character of the ved query parameter denotes, wether the value is encoded as plain text or not.

decoding the plain text variant is trivial. What's left is to deduce is, what the abbreviated parameter names t, r, s and i stand for. I would bet my bottom dollar, that r stands for result_number and s stands for start (the pagination parameter).

Now lets move on to the more interesting looking part: the ones, that start with a 0. Anyone who has looked at a raw email, will notice, that CAUQ_AUoAA looks suspiciously like a variant of base64 (not complying with RFC 4648). First of all, the padding equal signs at the end are missing (because they are not really necessary, consume space and it would complicate URL encoding), secondly, _ and - are used instead of + and / (again, because it would complicate things). Never the less, it's easy enough to decode in python:

base64.b64decode(str(s)+'=====', '_-')

This will give you the raw bytes of the message.

To find out, what those bytes actually mean, I dumped all the veds I have as binary and sorted them:

cat veds.txt|./bindump.py|sort|uniq

bindump.py can be found on GitHub

An excerpt of the output is the following:

       a        b       c         d        e        f
1: 00001000|01111111|00010000|00010110|00110000|00001011
2: 00001000|01111111|00010000|00010110|00110000|00001110
3: 00001000|10000000|00000001|00010000|00010110|00110000|…
4: 00001000|10000000|00000001|00010000|00010110|00110000|…

There are a few interesting things to note here:

the first byte is always 00001000
when the first bit (msb in little endian) of the second byte jumps from 0 to 1, the byte in column c jumps to column d. (take a look at row 2 and 3)

As you can see, there is a pattern emerging. the second bullet point means, that this a variable length encoding of the values. It turns out, that I've seen something like that before and in retrospect, it's rather obvious. But good to know, that we're on the right track…

Read how to decode the contents of the ved message.

Continue to Part 3

Google Referrer Query Strings Debunked Part 1

Every breath you take
every move you make…
-- Sting, 1983 Full Lyrics

If google would be a person, you could think about it's behaviour as either romantic or compulsory. As you might know, google is tracking every click and every impression of its search results. When you click on a link in the search results, you don't get redirected to that page right away, you're taken another hop over http://www.google.com.com/url?[tracking arguments here]. If you take a look at the logs of your webserver, you'll probably notice, that a great deal of your traffic got referred from such an URL.

To understand the way traffic is flowing to websites, it is common practice to analyse these referrer URLs to get as much information as possible about what caused the clicks. Recently, google began to eliminate the q parameter from this URL. this is the reason, why in google analytics a great portion of keywords are showing up as "(not provided)". In effect, web masters do no longer know, which keywords caused the clicks. Supposedly to protect the users privacy, which is a little bit of a stretch considering, that advertisers (allegedly) still get this information. Some even say, that this information will be available again in the future via the premium version of Google Analytics (GA).

The absence of the q parameter has sparked some interest in the remaining parameters, because they might still provide intelligence to web masters.

In mid 2012 Tim Minor wrote a Blog Post about the different parameters but left many questions unanswered. In late 2012 a member of the spanish SEO community Blogged about the ved parameter and made some progress in decoding it.

The latter post sparked my interest. So I decided to take a closer look at the ved parameter.

In this article series, I'll guide you through the process of finding the encoding of the ved query parameter. I'll show, how you can implement a decoder for this parameter (including code snippets) and give you insight about what information you can obtain by doing so.

To complement this article I've setup an Online Demo and a Repository on GitHub.

Read which variants of the ved parameter are there and how to decode it.

Continue to part 2

Adventures in Traffic Analytics

Montag, 19. August 2013

Google Referrer Query Strings Debunked Part 4

Google Referrer Query Strings Debunked Part 3

Next

Google Referrer Query Strings Debunked Part 2

Next

Google Referrer Query Strings Debunked Part 1

Next