Golem is a next-generation open source platform for building and running highly reliable services. The platform is powered by a new abstraction called durable computing, which lets you quickly build and deploy highly robust services in a fraction of the time otherwise required.
Take a peek behind the scenes into how the Golem platform delivers on its promise of transparent durable execution and automatic error recovery.
The building blocks of your Golem application are worker templates. Worker templates represent a component of your overall application that has been compiled to WebAssembly. Worker templates have a public API (defined using WIT, an alternative to Protobuf), your business logic, and the rest of your code, including whatever frameworks and libraries you use.
Golem uses worker templates to create workers, which are running instances of your templates. Workers are isolated from each other and different workers from different versions of the same worker template may execute concurrently without issue. Golem allows you to use one worker for each incoming request, or you can use the same worker to handle related requests.
Workers execute independently from one another. Failures in one worker (even catastrophic ones, such as out of memory) have no effect on other workers. Deploying new versions of a worker template only affects new workers created from the template. Older workers created with earlier templates continue to run unaffected by the updated worker template.
If you execute your worker directly, then it interacts directly with your operating system, and if your machine dies, your worker dies. But when you execute on Golem, then Golem records all your worker’s interactions with the outside world. This log of host interaction (oplog) can be used for recovery in the event of a failure that requires the worker to be relocated to a healthy node.
Because Golem proxies all host interactions, it is able to identify failed interactions that may be recoverable. For example, HTTP requests, gRPC calls, and database queries. Golem automatically retries recoverable interactions with third-party cloud services, databases, and microservices, using sophisticated and user-configurable retry policies.
Golem actively supervises executing workers. In the event of a failure, such as the node a worker is executing on is restarted or goes down, Golem relocates the worker to a new node. Golem uses the oplog to restart the worker, and play back all host interactions, thus fully restoring the complete state of the worker to the moment before the failure event occurred.
Using our Management Console, CLI, or REST API, you can define what your API should look like and how every endpoint should map to code that you have implemented in your backend. You can choose to map each endpoint to a new worker, or use long-lived workers to statefully handle endpoints.
Using our Management Console, CLI, or REST API, deploy your API to any subdomain of your choosing. Your versioned and automatically validated API will support encryption, CORS, caching, request logging, and other features without you having to write any code or manage any infrastructure.
Golem will automatically scale up infrastructure in order to satisfy any load on your API. If you have mapped each endpoint to a new worker, then you have effectively infinite scalability. If you use long-lived workers to handle some endpoints, you can specify caching settings to handle high traffic.
Golem Cloud supervises each worker and performs continuous incremental snapshotting. If any of your workers are interrupted due to restarts, updates, or infrastructure failures, Golem Cloud restores and resumes the workers on new nodes, ensuring your business logic executes to completion every time.
Connect to running workers to see their status and output, or use our control panel with dashboard to monitor and manage running workers. Test long-running workers for compatibility with hot updates and switch-on features like automatic retries, logging, and telemetry (Coming Soon).
Durable execution ensures that complex, multi-step workflows are executed reliably, even in the face of system failures or restarts. With Golem, workflows involving multiple services and tasks continue seamlessly, maintaining state and progress without manual intervention. This feature is crucial for workflows that require long-running processes or depend on the completion of preceding steps, making Golem ideal for automating and streamlining intricate workflows with guaranteed execution integrity.
In managing business processes, durability is key to ensuring consistent and uninterrupted operations. Golem's durable execution allows businesses to implement processes that remain active and consistent, unaffected by system disruptions or downtimes. This robustness is particularly beneficial for critical business operations like order processing, inventory management, and customer relationship activities, where reliability directly impacts customer satisfaction and business continuity.
Infrastructure orchestration involves coordinating various IT resources and services. Golem's durable execution ensures that orchestration tasks are resilient to failures. This is vital when managing complex deployments, scaling operations, and performing system updates. By leveraging Golem, IT teams can confidently orchestrate their infrastructure, knowing that each component will maintain its state and functionality, even in unpredictable cloud environments.
Business transactions often require high levels of reliability and atomicity. Golem's approach to durable execution ensures that transactions are executed to completion, maintaining data integrity and consistency, even in the event of failures. This reliability is crucial for financial operations, e-commerce transactions, and any other scenario where transactional integrity is a must, reducing the risk of data corruption or loss.
In advanced AI applications, particularly with complex models like LLMs, execution often involves multi-step processes or iterative refinements. Golem's durable execution ensures the continuity and reliability of these long-running and computationally intensive tasks. This capability is essential for AI models requiring iterative feedback and adjustment, guaranteeing that each computational step is completed without disruption. The practical benefits include uninterrupted iterative processes, efficiency in long-running tasks, and scalability for expanding AI operations. Golem's robust framework is particularly valuable for embedding AI into applications, providing the resilience needed to handle the complexities of advanced models like LLMs efficiently.
ETL processes are critical for data management and analytics, often involving large datasets and complex transformations, which can take a long period of time to complete. Golem's durable execution guarantees the resilience of ETL pipelines, ensuring data integrity and process completion, even if system disruptions occur. This is especially important for businesses relying on timely and accurate data processing for analytics and decision-making.
Streaming analytics involves analyzing data in real-time as it flows through systems. With Golem's durable execution, these analytics processes become more reliable, ensuring continuous data processing and analysis, even in the face of network instability or hardware failures. This ensures that real-time insights are consistently delivered, crucial for scenarios like fraud detection, real-time marketing, and operational monitoring.