The center for machine learning is providing support to several lines of business (LOB) at an enterprise level. Some of these LOBs support critical systems that require a very high level of availability; that level of service availability brings its own challenges. The key challenges include implementing (more) relevant service level indicators and objective metrics for all tenants of the platform, standardizing approaches to resolving incidents, and minimizing the downstream impact of incidents. Without addressing these issues the potential negative impacts are a degradation of service quality, inconsistent reporting and a drop in productivity. All impacts the team looked to avoid. Proper documentation of critical systems is especially problematic knowing that if a system goes down there is potential for a regulatory process to not run and key information to not reach the appropriate parties in a timely manner. These are all challenges for which we provided support, suggested improvements and action plans to mitigate.