Converting ODK XML Data to CSV

Context: Household survey in rural Tanzania

Task: Combine data from XML files and display in an Excel File

Challenges:

  • Nested loop within survey which required questions to be repeated so survey and questions were not one-to-one
  • Number of fields for survey exceeded 255
  • Each survey was in a separate file
  • >350 submission files so need mass combination and conversion

Data Provided:

  • Survey in XML format for ODK Collect
  • Submissions in XML format using ODK Collect. Naming of submission files based on survey name, data and time collected.

The following methods were attempted in extracting data. Some successful, some not, with clear reasons as to why not.

Method 1: Use Excel

Excel’s XML to CSV or XLS function is very easy to use but issues were:

  • Each XML file is saved to a separate CSV file so does not aggregate data.
  • Does not take into account nested XML and will only add in columns for the primary level in XML

Method 2: Use MS Access

The idea was to first import into Access and then to CSV. Importing to Access worked exceptionally well. It automatically creates related tables for nested XML tags. However issues were:

  • The size of the survey itself was too big to store in an Access table. It would have taken far too long to change data types one by one. In addition if the table was broken up, the automatic feature for import would not work. There is probably a way to do all this in Visual Basic but again time constraints.

Method 3: Install ODK Aggregate and use inbuilt stable version of ODK Briefcase:

ODK Briefcase is the companion tool for ODK Aggregate to import and export data from ODK generated files. However the inbuilt version could only:

  • Import one file at a time
  • ODK Aggregate is a little complicated to install. You can either install it on Google App Engine or locally but the installation procedure is a tad confusing.

Method 4: KoBo Post Processor

KoBo Post Processor is a third party tool to aggregate data from XML files. However the issues with this were:

  • Could not read the nested XML data
  • Had it’s own algorithm for displaying columns and did not follow the survey format

Method 5: Stand Alone Beta Version of ODK Briefcase

Using ODK Briefcase v1.0 Beta 2.jar is the method which worked but since there wasn’t much documentation on this, it took awhile to figure out and find. There are only a couple of threads on the ODK Developers Google Group which mention this. There were obviously a ton of issues considering it’s in Beta mode but after a number of tweaks and re-positioning data it worked. Issues faced during this process and remedies:

  • ODK Briefcase requires it reads the data from a very specific directory structure. It needs to be the exact same as that created by ODK Aggregate for forms. Example of the directory structure in the ODK Briefcase folder:

    Forms

    Tanzanian Survey

    Instances

    TanzanianSurvey-11032011-1203

    submission.xml

  • It is important that the XML data file be named submission.xml which is only possible if each submission is in a different folder. Unfortunately our data was not named submission.xml so I used Bulk Rename Utility. There are other ways to bulk rename in Windows but the search and replace using F2 method adds a number in parenthesis which would not work.
  • Regarding the nested XML, since there were only 37 files where this was the case, we went through each file and flattened it to one level. However Windows doesn’t let you search within XML files so you’d probably need to download some software like XML Search

Ideally all of this could have been done by writing a simple script and writing to the same CSV file. However I was aiming to find something which could have been replicable by a non-tech individual. Looking forward to the next version of ODK Briefcase.

  • Share/Bookmark

Leave a Reply