Rigel Group

They shoot Yaks, don't they?

Map/Reduce Job to Select Specific Keys

Riak will politely tell you about all the keys in a specific bucket, all you need to do is ask, like this:

curl http://localhost:8098/riak/my_bucket

The problem is what if you have a million keys? You can tell Riak to stream you the keys, but what if you only want certain keys, like all the keys that start with foo, for example. In that case, MapReduce is your friend. In Ruby, it looks like this:

1
2
3
4
5
6
 results = Riak::MapReduce.new(client)
               .add("my_bucket")
               .map("function(value,keyData,arg) {
                       var re = new RegExp(arg);
                       return value.key.match(re) ? [value.key] : [];
                     }", :keep => true, :arg => "^foo").run

You can pass in any regular expression in the :arg parameter. Since keys in Riak have to be unique, you will never get duplicates and don’t need a reduce phase.

Update: Note that this code is pretty slow to execute on a bucket with many keys, so is best used in background jobs, not for interactive queries. For example, on a single node, small EC2 instance, with 10,000 JSON objects (3K each in size) in a bucket, running the above map reduce code takes 60 seconds.

To see how much of that time is spent marshaling the JSON objects, we removed the JSON body of each object and left only the key, and then ran the code again, which took 30 seconds, still not even in the right ballpark for interactive use. Of course, YMMV.