Engineering Insights from Unlikely AI: Accelerating Database Access at Scale with DynamoDB and EMR.

On our quest to pioneer safe, trustworthy and powerful intelligence, we want to showcase some parts of our journey through the medium of blog. We’re in stealth mode, so we can’t give away all the details, but we still want to share some of the interesting engineering work we do at Unlikely. By examining the real challenges we face, we will provide insight into the company’s expertise and innovations. We hope some of our lessons learned can inspire others who are tackling similar problems.

We recently embarked on a project to accelerate the way we access a large corpus of data that we hold in MongoDB. We have found MongoDB to be an adaptable general purpose database. However, in a recent internal Hackathon (which we hold regularly to encourage innovative ideas), we identified a particular set of frequent queries that have an outsized impact on overall latency. Migrating this data to an in-memory store (Hazelcast, in this case) greatly improved the system’s performance with a small sample of data. We then needed to see if we could apply this to our data at scale.

The hackathon project changed two things about the system: the format in which data was stored, and the system used to store it. Based on initial estimates from the Hackathon, it seemed plausible that we could use a clustered in-memory store, like Hazelcast or Redis, but we wanted to consider cheaper (and more scalable) options as well. This kicked off a performance investigation into how we could apply these ideas to a few data storage solutions, with the goal of selecting one technology for this new “index.” The best solution would balance its performance, infrastructure cost, and setup and maintenance costs.

As with most work at Unlikely, this project was collaboration heavy. Two engineers designed a few implementations, and collaborated with two more engineers to implement each approach. A large chunk of the work involved setting up a common benchmark that represented realistic queries, and a common framework for importing large amounts of data into the new systems, with support for rapid iteration. Setting up this big data pipeline is where a lot of the interesting lessons came from, as we were able to use AWS Elastic Map Reduce (EMR) to run jobs in an hour that would have taken weeks on a single machine.

Initially, we tried Apache Beam for the EMR pipeline. This is a high-level API for big data processing, which can run on many different platforms (including Apache Spark, which is supported by EMR). It’s well documented and has lots of built-in source and sink connectors that make it easy to read and write to different systems. However, we found it didn’t scale well to our full corpus. It struggled to identify a reasonable number of shards for data stored in S3, artificially limiting our parallelism. We forced it to increase parallelism by manually partitioning data, but this blew up the Spark monitoring UI. When we ran this on our whole dataset and tried to monitor it, it caused the EMR-managed monitoring site to crash with an OOM error. This severely limited our ability to monitor and debug the pipeline.

On top of this, we’ve recently been experimenting with integrating Kotlin into our mainly Java codebase (expect a future post about this), and unfortunately this resulted in some compatibility issues in Beam. Specifically, when we needed a DoFnto receive a PCollection containing a List or Iterable, Beam would throw an exception about mismatched types. We hypothesise that this is caused by an error during reflection when comparing Kotlin’s custom List and Iterable types with the java.util versions. On the whole, Beam provided a nice API, but it was abstracting away things we wanted control over (e.g. the degree of parallelism), and causing us more headaches than it solved.

Thankfully it was a fairly trivial exercise to migrate this to Apache Spark, which eliminated all of the above issues. The Spark job had far fewer steps than the one that Beam generated, and the monitoring UI was once again fast and useful. By documenting our challenges as we built this pipeline, these lessons can inform future pipeline development across the team.

What about our original project of building an accelerated index? We ultimately tested a few implementations on Redis, Hazelcast, DynamoDB with DAX (its clustered in-memory accelerator), and also a custom in-house system. Hazelcast demonstrated slower responses than expected when operating as a cluster, and Redis was a poor fit because we needed to index more data than originally estimated, so the costs would have been unreasonable. The custom system was the fastest by far, but also the least mature. While we love to push the envelope at Unlikely, we also have to be pragmatic. We therefore settled on using DynamoDB + DAX with a new data structure as our first implementation. The DynamoDB implementation demonstrated considerable advantages over MongoDB for certain query types, and the DAX cluster allowed us to achieve Redis-level response times for the most frequently accessed data, with minimal maintenance, and full control of the size and cost of that cache. We’re still keeping our custom system in mind though for our next performance leap.

Hopefully, this gives you an idea of the exciting and diverse tools we use here at Unlikely AI. There is a lot more to talk about, so expect to see regular updates on this blog. Keep an eye out on our careers site for all of our current opportunities, and please don’t hesitate to get in touch with our recruitment team to find out more.