Member-only story
How to load nested XML data from GCS into BigQuery via Dataflow
If you are one among the few who still uses XML data to transfer / communicate between systems and wants that data to be loaded into BigQuery for data analysis , then this post is for you.
To get started enable the below APIs
- Dataflow API
- Dataflow Pipelines API
- BigQuery API
- Cloud Scheduler API
- Cloud Functions API
First of all , you can’t load xml data directly into BigQuery , you need to convert it into JSON data. (A sample snippet on how to convert xml to json is given below)
import json
import xmltodict
with open("data.xml") as xml_file:
data_dict = xmltodict.parse(xml_file.read())
json_data = json.dumps(data_dict)
with open("data.json", "w") as json_file:
json_file.write(json_data)
Post xml to json conversion , you need to identify a common JSON schema for all your JSON data (reason is , there might be additional fields or some field values missing among your transformed json data).
So how do you identify the schema? ( using Auto-Detect Schema in BQ !!)
well !, it is a Trial and Error process, you take a set a sample json files (converted from xml data) and do the upload…