I have an existing application that uses a mongodb to store metadata on files to make them searchable. Since searching in those records is not very time sensitive for me, I thought about migrating the whole database to S3/Athena. The advantage here would be, that there would be no longer a database to maintain, since Athena runs serverless on AWS.

I started with a JSON-export of mongodb, to have a look at the document structure.

mongoexport --db db --collection jens --out kanri.json

Unfortunately, this only worked until I discovered the different format that mongodb is storing my filesize information.


So I decided so export my data through the following php script to unify the different number formats into one.


$m = new MongoClient();
$db = $m->selectDB("db");

$cursor = $db->jens->find();
$file = 'kanri_export.json';
foreach($cursor as $item){
	echo json_encode($item)."\n";

{%endhighlight %}
So here is my document structure to begin with.

  "_id": {
    "$id": "4f75dfe6c06b88c92c000000"
  "description": "njams datasheet",
  "files": [{
    "type": "application\/pdf",
    "name": "nJAMS datasheet.pdf",
    "size": 636649,
    "md5": "4ad2052803db0e6e7455fd5af0e9e6d6"
  "modified": {
    "sec": 1344177194,
    "usec": 676000
  "tags": ["njams", "fasi", "datasheet"],
  "ts": {
    "sec": 1333132200,
    "usec": 0

I uploaded the whole document into a S3 bucket and then started to look into Athena.

As I found out, Athena (and hereby presto), has some issues with this source format, since it cannot access any element containing a special character like ‘$’ or ‘_’. For those characters I needed to create a mapping for the SerDe to handle those fields.

Here is the table definition I came up with:

CREATE external TABLE structured(
  id struct<oid:string>,
  description string,
  files array<struct<type:string,
  modified struct<sec:bigint,
  tags array<string>,
  ts struct<sec:bigint,
row format serde 'org.openx.data.jsonserde.JsonSerDe'
location 's3://kanri-meta/';

In addition to searching column-wise, mongo also supports searching per RegEx through the complete document. Since Athena only works in columns and rows I created a second table, which only contains the raw JSON string.

create external table raw(
  content string
row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
LOCATION 's3://kanri-meta/';

Now whenever I needed the Regex capability, I could join in the raw-table and run the RegEx on this column. Since Athena separates the read-logic from the actual storage, I also did not need to duplicate my data pool for this.