Buckle up! Data bootstrap at full speed
Bootstrapping large volumes of data — “Why”, “How”, “Tips & Tricks”
Building new enterprise technical solutions to cater to the ever growing business needs is an everyday thing now. Newer solutions do not necessarily mean tech modernisation alone. The new solutions themselves can be ground breaking from a customer experience and business strategy perspective. Whatever is the kind of solution, there would be multiple systems and services involved, designed into a well-crafted ecosystem. And the crux of this beautifully crafted ecosystem, is the data.
The data for any new system can belong to one the following categories:
- Data Inception: Data gets created and grows with time, if the system is completely new. This is the simplest case and what we can call as the organic growth.
- Data Lift-and-Shift: Data gets migrated from the old system to the new system, in some cases with transformation needed as necessary. Think — ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform).
- Data Bootstrap: Data already exists in different systems. But now, it needs to be brought into, aka bootstrapped into the new ecosystem, where each system will have to either act on it / react to it / own it — whole data / parts of it.
In this article, I’d like to talk about the different aspects related to data bootstrapping, involving large scale systems and high volumes of data. Before that, here’s a brief description of the different “states of data” (Source: Wikipedia) that I will be mainly considering in this article.
The data could be “at rest” — database / storage service / file system.
The data could be “in transit“ — API request/responses / async events / streaming services.
Planning the bootstrapping process is the first and foremost activity. To do this, you will have to first understand the data, how it moves across the ecosystem, transforms itself in the process and could in the process, generate new sets of data as well. Things to account for here:
1. Source of data
- Identify if the data needs to be extracted from a single source or multiple sources.
- Identify if the data from multiple sources need to be linked / merged / aggregated.
- Identify the format of data of each source.
Once this is identified, data extraction and data ingestion scripts would have to be created purely for bootstrapping purposes.
2. Data extraction constraints
- Understanding the data extraction constraints early on, is very necessary. This is because, the extraction method itself may be complicated and you might want to invest time and effort for this.
- I’ve seen systems that do not have an easy way to extract the needed data. You may want to explore options like Extracting Data Dumps / Traditional ETL / ELT on Hadoop / Custom Scripts / Apache Spark, to name a few.
- I’ve also seen systems that do not support read-heavy transactions. Beyond a specific read volume, the systems might crash. That is undesirable. But a lot of reads are needed to extract large volumes of data. In such cases, you will have to plan to do incremental reads for smaller volumes and at scheduled intervals — preferably off peak hours
3. Data transformation needs
- Identify if there are any steps needed to cleanse and harmonize the data, during data extraction or post that.
- Traditional ETL / ELT on Hadoop / Apache Spark / Apache Storm are some of the technologies that are commonly used for this purpose.
Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. (Source: Wikipedia)
Data harmonization is the process of bringing together your data of varying file formats, naming conventions, and columns, and transforming it into one cohesive data set. (Source: Datorama.com)
4. Volume of data
- Understanding the volume of data would be a step you would have done in detail during the architecture and design phase itself. During that phase, you would understand the total data storage needs, data inflow / outflow rates and organic growth. But what’s important to understand during the bootstrap phase is the volume of data that has to be bootstrapped and the rate at which it can be throttled. This will help you plan the total time (in days / weeks) needed for the entire bootstrapping to be complete. While this might sound easy, I’ve seen that breaking down the total data set into subsets of specific size / being able to throttle at desired speeds is definitely not an easy task in large scale systems.
- Manageable subsets: Check if the total data set can be broken down into logical and manageable subsets. Let’s say, for example, we want to bootstrap the data about all the users of a fairly large social networking website. This data itself could be in millions or even billions. You could break it down and first bootstrap Users data by Geographic location. To break it down further, you could Bootstrap “Active” Users data by Geographic location, as active users will be first needed and the rest can come in later. This is not as easy as it looks. I’ve seen datasets, where, even after breaking them down into logical subsets, the volume of data is huge. Think about a geographic location with fairly large share of active users.
- Side-effects: One common mistake teams do is to just estimate the initial volume of data to be bootstrapped. People forget to estimate (or estimate inaccurately) the data orchestration effects (or should we side-effects?). Consider for example: Creation of a User. This could potentially create 2 additional side-effects — Assigning the User to a Group and Activation of the User. If you just estimated for the user creation event, then you are potentially only accounting for one-third of the entire event volume that is being orchestrated through the ecosystem. This could be catastrophic and cause loss of data during the orchestration / cause systems to fail / cause data inconsistencies.
5. Capacity Estimation
- Once you’ve understood the volume of the data to be bootstrapped, the side-effects it causes through its journey, and you are able to break it down into manageable sub-sets, you have your basic information for infrastructure capacity estimation.
- Understand if the ecosystem of the new solution is able to withstand the volume of the data subset that you are bootstrapping. When I say withstand, it could be in terms of the Read/Write Ops, Data Retention Size / Data Retention Duration on message queues / brokers, CPU and Memory metrics etc. as bootstrapping activity could have heavy loads than regular usage of the new ecosystem.
- To understand that, you would need to do the load testing of your system and gather the metrics. How to do the load testing is a vast topic in itself and I would not dwell into that topic here.
- Once you have these metrics, you will have a baseline number. Now see if the system can withstand bootstrapping of the data subset. If not, you will have to scale out your ecosystem, on a temporary basis. Emphasis on the word “temporary” as we don’t want to over provision forever and break the bank :)
Execution & Validation
Now that you’ve put in all the effort to plan things out so well and have the entire bootstrap schedule chalked out to a tee, it’s now time to get that plan into action — Execution! In fact, Execution and Validation go hand-in-hand. After every data subset bootstrap, it is vital to do the validation to ensure that all that you’ve executed is indeed reflecting accurately. There could be teething issues with this process itself, if teams are doing this activity for the first time. Sometimes this process itself can be a learning process even for highly experienced teams, while dealing with complex systems. Hence this can be an iterative process.
Small and diverse data subset: As a first step, pick a handful of data records for the first iteration of the data bootstrap. This data set should consider data which is diverse in nature. This means that the data set should cover all use-cases and edge-cases. This definitely needs some understanding both on the Business and Engineering side. So it’s time again when Business, Product and Engineering teams need to come together to decide on the dataset and define the expected results for each data set. This serves as the foundation of all the validation strategies and scripts to be written for validation. Eventually this becomes your E2E Validation guideline for the entire bootstrapping process.
Once you have this ready, it’s time to ingest this dataset through the data bootstrap pipeline. As discussed earlier in Planning section, we would have the necessary scripts ready for initiating the data bootstrap.
Once this is done, it’s time for Validation!
- Validating Counts: This works well, when we have to ensure that every system that was supposed to react to and own the data, has the expected counts. I would say that this is the easiest of all, as long as you know the target counts for each system in your ecosystem, for the input data that was ingested. For ex: Ingest 100 records → System A should have 100 records → System B should have 80 records → so on and so forth
- Validating the Values: Not necessary that every system that reacted to an event / message / API, should result in increase of counts. It could have mutated the existing data in different ways. For ex: Creation of a user → Increases count. But Activation of a user → mutates the existing data. Depending on the diverse nature of the data, there could be so many use-cases to account for and so many systems acting on the same data, thereby changing its value multiple times. This calls for a very thorough understanding of the journey of the data through the pipeline.
By now, I’m sure you will agree that validation is very crucial. But how do we do it? Especially when volume of data is large and the systems involved in the ecosystem are so many. Even if we know what to validate, what tools can we use?
Validation tools: Here are some of the many utilities / tools / methods that you can use, depending on your requirement:
- Scheduled / On-demand Custom Scripts and Jobs
- Reporting Dashboards — Ex: SQL dashboards like Tableau, Grafana, Apache Superset / Custom dashboard applications
- Aggregated reports from logs — Ex: ELK (ElasticSearch, Logstash, Kibana) stack, Splunk reporting
- Observability via Aggregation metrics — Ex: Aggregation on event streams using Apache Storm, Apache Spark, Akka Streams, Kafka Streams etc.
Learn & Rectify
Once you do the validation, you may find issues. The issues could be in the data ingestion scripts / performance issues of the system / bugs in the validation scripts itself (Yes! Validate the validation script!). It’s time to learn from the mistakes and take corrective actions. The learnings could be about many things:
- Data Volume estimation: Miscalculated data volume which doesn’t take into account the full side-effects of the data orchestration can cause systems to fail. We’ve talked about this in the Planning section (See Side-effects). So you may want to validate the calculations and rework your plan.
- Use-cases and Edge-cases: Even after multiple teams getting together and deciding on all the use-cases and edge-cases, the data could still surprise you. There could be some corner case!
- Performance Optimization: During the validation, pause and look at the performance. Look for opportunities where we can bring in optimizations. Since data bootstrap is a one-time activity, usually before launching the new solution or just after launching it, there may be many instances where processing of data can be optimized. Think, De-dup logic, Avoiding Redundant processing, Stale data checks (useful during retries), Last-write-wins (where you can) etc.
- Infrastructure Capacity resizing: Once you find your learnings on the above categories, see if capacity resizing is needed and work on it.
Once you’ve completed this cycle of Execute → Validate → Learn → Rectify, you will now do the same with larger data subsets that you had planned for. Repeat this until you have completed bootstrapping of the entire dataset. Bootstrapping of these larger data subsets is where you will be going full throttle and full speed. It’s now time to buckle up!
Once this is done, give yourself and your teams, a big pat on the back, release any additionally provisioned hardware capacity for bootstrapping purposes (Remember! Don’t break the bank) and celebrate!! :)