Montag, 19. August 2013

Google Referrer Query Strings Debunked Part 3

Read Part 2

In the last part we learned, that there are two types of encoding for the ved parameter, plain and base64. We also found out, that a variable length encoding is used to store the ved parameter.

As mentioned, I've seen something like that before. Ever heard of protobuf? A library developed by google to encode protocol messages?

[…]Each byte in a varint, except the last byte, has the most significant bit (msb) set – this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two's complement representation of the number in groups of 7 bits, least significant group first.

(quoted from The Protobuf manual)
Bingo! After reading the document, it becomes clear, that the ved parameters are protobuf encoded.

Protobuf messages are actually key-value pairs. This means, that google can easily break the regex tricks some people are doing in Google Analytics just by shuffling up the order of the key-value pairs.

This also explains, why the ved parameter varies in length.

When you read through the protobuf manual, you even find examples, that share a lot with what you'd expect to find inside the ved paramater. When I found this example, I imagined the google engineers laughing their asses of when reading the articles, that suggested using regex to decode the all-so-secret ved parameter. Don't get me wrong, you guys did great work in analysing the data, but it's still worth a lough.

You should note, that a lot of the information and tracking that's done is already available via google webmaster tools (average position, etc.).

It's all there, all that's left to do is to craft a .proto file that resembles the structure of the ved. At this point, I'd like to coin the term VisitEventDescriptor or VisibleElementDescriptor for the ved structure. I think they both would fit nicely.

I hacked up a little .proto file, that assumes, that all values are optional unsigned integers. I then began copy-pasting the ved parameter from the search result pages. It is also noteworthy, that the parameters for image search are not protobuf encoded. The abbreviations give some clue about what the values could actually mean.

here is the prototypic .proto-file:

package test;

message Ved
{
        repeated int64 v1  = 1;
        repeated int64 v2  = 2;
        repeated int64 v3  = 3;
        repeated int64 v4  = 4;
        repeated int64 v5  = 5;
        repeated int64 v6  = 6; 
        repeated int64 v7  = 7; 
        repeated int64 v8  = 8; 
        repeated int64 v9  = 9;
        repeated int64 v10 = 10;
}

you can compile this .proto-file via


protoc -I. --python_out=proto ved.proto

(Note: you have to install the protobuf command line tools first: apt-get install python-protobuf on debian/ubuntu will do the trick)

Here is a python script, that'll decode veds from stdin.

running "
cat veds.txt|./decode_veds.py" might yield the following output:


…
v1: 58
v2: 22
v6: 3

v1: 67
v2: 22
v6: 4
v7: 10









v1: 6
v2: 22
v5: 2
…

Next

Now lets play a little Sherlock Holmes and find out, what information those values expose and what the names of the variables could be.

Continue to part 4

1 Kommentar:

  1. hi ben, have you done any further research into the VED? The only way I've been able to generate them so far is on image search and they're a lot longer than the ones in these post. Would be great to see what information they hold now!

    AntwortenLöschen