How to load nested XML data from GCS into BigQuery via Dataflow

Rajathithan Rajasekar
5 min readAug 11, 2023
Photo by matthew Feeney on Unsplash

If you are one among the few who still uses XML data to transfer / communicate between systems and wants that data to be loaded into BigQuery for data analysis , then this post is for you.

To get started enable the below APIs

  • Dataflow API
  • Dataflow Pipelines API
  • BigQuery API
  • Cloud Scheduler API
  • Cloud Functions API

First of all , you can’t load xml data directly into BigQuery , you need to convert it into JSON data. (A sample snippet on how to convert xml to json is given below)

import json
import xmltodict


with open("data.xml") as xml_file:
data_dict = xmltodict.parse(xml_file.read())
json_data = json.dumps(data_dict)
with open("data.json", "w") as json_file:
json_file.write(json_data)

Post xml to json conversion , you need to identify a common JSON schema for all your JSON data (reason is , there might be additional fields or some field values missing among your transformed json data).

So how do you identify the schema? ( using Auto-Detect Schema in BQ !!)

well !, it is a Trial and Error process, you take a set a sample json files (converted from xml data) and do the upload from a GCS source to the BQ Table. An example is given below

Now use the auto detected schema as reference and prepare the schema json file manually. ( Modify the record type and mode , based on your requirements)

A sample json schema file is given in the below link

https://gist.github.com/rajathithan/84b62c10a3a1cf1bc2b6a75030bfd403

Test it with bq load command,

bq load \
--replace \
--source_format=NEWLINE_DELIMITED_JSON \
xmldata.table0009\
gs://xmltestpoc/json/test2json.json \
./JSONSCHEMA.json

--

--

Rajathithan Rajasekar

I like to write code in Python . Interested in cloud , dataAnalysis, computerVision, ML and deepLearning. https://rajathithanrajasekar.medium.com/membership