Converting a CSV to ORC files usually takes a Hadoop cluster to perform the task. Since I only wanted to convert files for later uploading into an existing cluster, I tried some different approach. Searching for some tool to do the task, I arrived at Apache NiFi.
Here is the flow I used to transform my data.
- step 1 - list all exiting CSV files
- step 2 - read each file into memory
- step 3 - convert content into AVRO
- sadly AVRO needs a schema of you data to do the actual conversion. so here is the simple schema I used for my data:
{
"name": "jobs",
"type": "record",
"fields": [
{"name":"jobStart","type": "string"},
{"name":"jobEnd","type": "string"},
{"name":"logId","type": "string"},
{"name":"corrid","type": "string"},
{"name":"parentid","type": "string"},
{"name":"jobId","type": "string"},
{"name":"process","type": "string"},
{"name":"machine","type": "string"},
{"name":"duration","type": ["null","long"]},
{"name":"status","type": "string"},
{"name":"domain","type": "string"},
{"name":"deployment","type": "string"},
{"name":"engine","type": "string"}
]
}
- step 4 - convert AVRO to ORC
- step 5 - UpdateAttribute: set the target filename
- step 6 - write the ORC file to the target location
Here is the NiFi flow I used. You will have to change the file locations and data schema to really use it.
[CsvToOrc template][1]
[1]:{{ site.url }}/assets/CsvToOrc.xml