How to load nested XML data from GCS into BigQuery via Dataflow

Rajathithan Rajasekar
5 min readAug 11, 2023
Photo by matthew Feeney on Unsplash

If you are one among the few who still uses XML data to transfer / communicate between systems and wants that data to be loaded into BigQuery for data analysis , then this post is for you.

To get started enable the below APIs

  • Dataflow API
  • Dataflow Pipelines API
  • BigQuery API
  • Cloud Scheduler API
  • Cloud Functions API

First of all , you can’t load xml data directly into BigQuery , you need to convert it into JSON data. (A sample snippet on how to convert xml to json is given below)

import json
import xmltodict


with open("data.xml") as xml_file:
data_dict = xmltodict.parse(xml_file.read())
json_data = json.dumps(data_dict)
with open("data.json", "w") as json_file:
json_file.write(json_data)

Post xml to json conversion , you need to identify a common JSON schema for all your JSON data (reason is , there might be additional fields or some field values missing among your transformed json data).

So how do you identify the schema? ( using Auto-Detect Schema in BQ !!)

well !, it is a Trial and Error process, you take a set a sample json files (converted from xml data) and do the upload…

--

--

Rajathithan Rajasekar

I like to write code in Python . Interested in cloud , dataAnalysis, computerVision, ML and deepLearning. https://rajathithanrajasekar.medium.com/membership