Our client is an advertising technology company focused on providing solutions to marketers to enable them to plan, execute and measure their digital media campaigns. The client has a huge data bank of 1.2 billion registered user profiles which serves as the primary foundation for their people based advertisements.


The client’s aim was to develop an Audience builder tool, the crown jewel of a deterministic identity management platform. The tool is intended to build an actionable customer list based on factors such as age, gender, ethnicity, purchase history, online interests, TV viewing behaviour, etc. The crux of the solution is to build a propensity score model for the 1.2 billion people data set by leveraging the huge offline customer purchase data from Direct match data partners such as Nielsen Catalina Solutions, Neustar.


The problem was one of handling “BIG” data volumes. For one run, the application should be capable of handling 134 raw segment files; each file having data of 1M customers, with attributes running to 900 columns – totally amounting to 120 Billion data points.


Congruent designed a solution using Google BigQuery and Python as the primary
technology stack.

The client’s people data set from Google BigQuery was downloaded into hash tables. The segment files from partners were downloaded from the FTP folder, unzipped. The data from the two sources was used to arrive at propensity scores for each household member. Members with propensity scores above a set threshold were identified and pushed to the audience builder file in csv format. The csv files were finally uploaded to GBQ using the chunking approach.

Some highlights of the solution design are: