Kubernetes & cloud architecture consultancy for blockchain startup
IT Svit was contacted by a cryptocurrency platform operating the most secure, performant, and cost-effective block production nodes for decentralized PoS protocols on behalf of institutional investors. The customer needed help with restructuring the system layout and adding new cryptocurrencies to the range of offers.
Industry: Cryptocurrency stacking and lending
Partnership period: August 2019 – ongoing
Team size: 1 Team Lead, 2 DevOps engineers
Team location: Kharkiv, Ukraine
Services: AWS infrastructure performance optimization, CI/CD and monitoring implementation
Expertise delivered: Cloud infrastructure management and optimization, monitoring and alerting implementation, CI/CD configuration
Technology stack: AWS tools, Kubernetes, Docker, Ansible, Jenkins, Prometheus, Grafana, AlertManager, MySQL Aurora, SMTP2Go, SumoLogic, OpenVPN, EFS Provisioner, Ingress controllers, Certificate manager, Falco, Vault.
The customer required the following services:
- Building a multi-cloud architecture for running containerized blockchain nodes.
- Containerize all cryptocurrency apps running in the system
- Configure a Kubernetes cluster to run EOS and deploy all the pods/services/deployments prepared
- Ensure Multi-Availability-Zone deployments within a single AWS region with High Availability within that region
- Enable minute monitoring of system performance
Challenges and solutions
While working on this project, we had to solve several challenges:
- Ensuring fault tolerance with minimal downtime for all blockchains supported. This had to be done for Tendermint blockchains — Cosmos, Terra, Iris, etc. We needed to ensure only one block producer is active at any given time and the downtime between spinning up different producers is minimal.
Solution — We deployed the system in 2 regions with the same infrastructure and launched a HashiCorp Consul server tracking the state of the active region. This enabled monitoring the platform performance, and if any issues arise in one region, the producer is stopped in one region and launched in the other.
- Ensuring an automated system failover and data recovery. There is a possibility that blockchain podes will remain offline long enough for the blocks to be completed with errors. In this case, we need to evaluate the data relevance and replace the outdated blocks with the relevant data from the backups in another region.
Solution — we enabled separate pode backups for every region and created a script checking the timestamp of the last file update.
- Ensuring continuity of operations and automated bans for blockchains after 30 minutes out of consensus. Due to the fact that Tendermint blockchains produce blocks very quickly (once every 6-7 seconds), we need to be informed within 1-2 minutes if the blockchain is out of consensus.
Solution — All logs are monitored with Log exporter, alerts are sent to Alert Manager in Prometheus and we are able to react very quickly.
- Automated checks for new blockchain versions and building Docker images for them. We need to build the blockchain applications using the latest stable version, after checking the source code and the installed packages for security vulnerabilities. We must also ensure the apps are not launched from under the root user.
Solution — We solved this by configuring 2 Jenkins jobs. The first job checks for new blockchain app version, the second one builds a new image after testing the code and packages for vulnerabilities. When the image is updated, the second job sends a notification to Slack.
- Enabling real-time two-level monitoring. We needed to ensure nobody was able to mishandle the blockchain apps, which meant the need to implement an in-depth monitoring solution capable of instantly alerting of any suspicious activity.
Solution — We use Falco and custom rules to monitor external container access attempts or incorrect node behavior.
- Implementing robust user authentication and access rights configuration.
We had to ensure there is no way any Kubernetes cluster pod is able to access the cloud provider’s resources.
Solution — We used kiam (Kubernetes Identity & Access Management) agents to configure limited access rights assigned to each individual pod.
IT Svit team was able to configure the multi-cloud architecture able to run on AWS resources, IBM Cloud and Google Cloud Platform. We’re running and maintaining the following blockchains so far:
- EOS CryptoKylin
- Factom on production
- Factom on staging
The ecosystem is built leveraging Kubernetes autoscaling and auto-healing features including load balancing and failover to several geographically distributed regions. It has enabled robust CI/CD implementation and new blockchain version rollout automation. Infrastructure as Code approach has empowered the speed of the new blockchains onboarding from weeks at the start to hours now.
In-depth two-level real-time monitoring leveraging Sumologic, Prometheus, Grafana, Sysdig and Falco saves us a lot of time and effort for tracing and nailing issues down, as well as raises team awareness about relevant blockchain states. As a result, the system elastically scales up to handle heavy workloads, keeps a high quality of service and improves platform popularity.
Due to having precise project requirements, the IT Svit team was able to provide all the deliverables on time. Once the first set of blockchain apps was containerized and the monitoring implemented, the only remaining challenge was to add new cryptocurrency to the system as quickly as it was available according to the customer needs.