MMS • Patrick Zhang
Uber has developed a new tool named CheckEnv to address the complexity of its microservices architecture, where numerous loosely coupled services interact through remote procedure calls (RPCs). This tool is designed to swiftly detect and address RPC calls crossing between different environments, such as production and staging, which could lead to undesirable outcomes like data inconsistencies or unexpected behaviors.
CheckEnv utilizes dependency graphs, which represent service-to-service calls, providing insights into communication patterns and dependencies. This visualization helps in pinpointing cross-environment RPC calls. The system employs advanced graph analysis techniques to automate the detection process, integrating these capabilities into Uber’s monitoring and alerting systems for prompt resolution of such issues.
The tool incorporates both real-time and aggregated dependency graphs. The real-time graph is updated continually, capturing essential metrics and identifying potential issues in service dependencies. The aggregated graph, on the other hand, provides a historical perspective of service interactions, aiding in the analysis of system performance over time.
CheckEnv operates on two graph data storage systems, Grail and Local Graph, within Uber. These platforms aggregate and store call graph data, with CheckEnv providing APIs to access and retrieve information like service dependencies and paths leading to production dependencies. This setup enhances the ability to identify anomalies, troubleshoot issues, and optimize the microservices architecture.
An example of CheckEnv’s application is in Uber’s synthetic load testing platform, Ballast. Here, it detects potential cross-environment calls during load tests, ensuring a secure and reliable testing environment by alerting users to any potential issues before they escalate.
Looking ahead, Uber plans to expand the capabilities of CheckEnv and its underlying data ingestion pipeline, MazeX, to construct a more powerful graph. This expansion aims to enhance the system’s ability to analyze communication patterns between services, optimizing data flow and improving service efficiency. This graph-based approach is expected to address various challenges within the microservice architecture, like real-time fault detection and workflow management.