In just a few years, Outfit7’s portfolio has grown to include 19 games with over 8 billion downloads. But big numbers bring big challenges. One of them is how to deal with such huge amounts of data in a timely and cost-effective manner. Our games hit the backend infrastructure with around 8 billion rows of data per day that take up 2.5 TB of space – and those numbers grow daily. This data is in a raw form and, to be useful, it needs to be processed, aggregated, extracted, and transformed. Above all, it has to be treated securely.
If we were to use normal relational database solutions, building a system that would be able to ingest and operationally function with such amount of data – not to mention handling the explosion of hardware storage space and related costs – would be a daunting task. It would require an army of sysops/devops/DB engineers.
So, to find a way around this, we decided to look at cloud platforms. Google’s solution, the Google Cloud Platform (“GCP”) is steadily becoming one of the best. Outfit7 Group heavily leverages one of the GCP’s flagship services, BigQuery; a serverless, highly scalable, low cost enterprise data warehouse. The service has no problems ingesting the data Outfit7 deals with. In fact, the querying power is just astonishing.
BigQuery Across Different Departments
We use BigQuery in numerous ways across the company. Our backend department has a special data team that – complying with all laws including GDPR – takes care of data ingestion, transformation, and delivery to appropriate stakeholders. To give you an example, we generate around 50 aggregates that form the basis for further analysis in the Analytics and Controlling departments. The main aggregates we prepare for the Analytics team are the retention and user segments, while the Controlling team is more focused on the daily and monthly revenue aggregates, user live time values, and daily active users aggregates.
We also closely collaborate with the Ad Ops and App Sales teams. In this part of the company, daily ad sales, ad mediation and paid user acquisition reports and aggregates need to be calculated. The data is then consumed and evaluated with the preferred tool of choice for each department. The Analytics department relies heavily on iPython and R, Controlling is mainly focused on Tableau reports and the visual representation of data, and the Ad Ops and Ad Sales departments rely on custom made dashboards, Google DataStudio reports, and ad-hoc analysis with Excel/Sheets.
To illustrate the point further, consider the following query. Anyone can run it on the publicly available BQ datasets that come bundled out of the box with a free GCP account.
Fig. 1: Query example on BQ.
The underlying “trips” table contains 130GB and more than 1.1 billion rows of data that represent yellow taxi fares in New York City from 2009 to 2015. The query included a yearly breakdown of vendors operating in the NYC area, their revenue, average fare cost, the distance traveled, and the total number of fares. It took 4.5 seconds to produce 16 rows of a high level report, which could then be downloaded as a csv or JSON file, saved to a new table, or a Google Sheets document, etc.
Fig. 2: High level aggregation report from a table containing more than 1.1 billion rows of data, which took 4.5 seconds to produce.
Imagine that: it took BigQuery less time to scan 130GB of data and spit out a condensed report than it would take you to say “Where’s the star schema and other data warehousing buzzword shenanigans?” If that isn’t impressive for a data guy, I don’t know what is. All the upsides aside, however, there’s no such thing as free lunch. If that wasn’t a test, the above query would cost $0.15, excluding the cost of storage, which would bring the amount up to $2.60 per month, decreasing to $1.30 per month after three months.
But the days of a data backend engineer aren’t just filled with writing and running BigQuery jobs in one of the various language flavours we’re using, like Bash, Python, Java, etc. They’re also filled with other engineering tasks that are necessary to hold the whole infrastructure together. Nevertheless, if, at the end of the day, we had to choose and point to one of the tools in the GCP arsenal that currently saves the Data and Backend teams the most amount of time, we’d probably say it’s BigQuery. In a competitive industry such as gaming, you need to have a reliable support system that continues to pave the way forward and, for us, that’s BigQuery.