10 Best Practices to Improve Data Center Uptime
The cost of data center outages can be staggering, and data center professionals report that large outages are becoming more expensive. A recent survey found that 16% of outages in 2020 cost more than $1 million, up from 10% in 2019. 40% of outages cost between $100,000 and $1 million, up from 28% in 2019.
Not only is the cost of downtime rising, but so is the number of preventable outages. In 2019, 60% of downtime incidents were deemed preventable, and that number rose to 75% in 2020. Power and cooling issues were the cause of 50% of outages.
Following Best Data Center Practices
Data center managers need to follow best practices to reduce downtime from these preventable incidents.
Here are the top 10 best practices that successful data center managers follow to improve uptime with Data Center Infrastructure Management (DCIM) software:
- Leverage health polling of metered devices. Ensure that intelligent rack PDUs and other metered devices are operating and accessible via your network with health polling so you can be the first to know if you’ve lost surveillance of equipment or have a power outage. Health polling allows you to receive an immediate alert that a device is down so you can quickly react and get back to service before there is an issue.
- Set and monitor thresholds. It is best practice to monitor and receive traps for intelligent PDUs and other metered devices. Then, set warning and critical thresholds on the data you collect to easily understand the status of your equipment. Use an enterprise health dashboard for at-a-glance views of threshold violations with easy-to-understand red-yellow-green color-coding. If you have a violation, use your dashboard to drill down and see the exact alarms causing those warning or critical conditions.
- Use trend charts to see changes over time. Trend charts are extremely useful because even if you haven’t violated a threshold yet, you can still see if power or temperature readings are increasing over time. This enables you to be proactive and react before you have a threshold violation and potential incident. Send your charts in automatic weekly reports to your management to keep them informed of what’s happening in the data center.
- Follow ASHRAE guidelines with psychrometric cooling charts. Ensure your equipment meets ASHRAE recommendations for temperature and humidity with cooling charts that give you the ability to see a large number of sensors in one view. You can then instantly identify which devices are operating outside of recommended ranges and act accordingly to maintain uptime.
- Visualize temperature sensor readings with heat map time-lapse videos. Turn your environmental sensors data into horizontal or vertical heat maps with time-lapse videos to quickly identify and eliminate hot spots before they damage equipment.
- Monitor cabinet capacity and redundancy. Create a daily report that highlights racks that are low on capacity and are dangerously close to being outside your redundancy requirements.
- Use dashboards for at-a-glance views of health, power, and cooling. Remote data center management dashboards are incredibly helpful for turning data into actionable information that is easily shareable and enables data-driven collaboration. Must-have KPIs you should monitor include peak power load per cabinet, days of power capacity remaining, cabinet power failover redundancy, power chain breaker utilization, latest temperature per cabinet, delta-T per cabinet, and maximum temperature per cabinet.
- Monitor capacity at each breaker. Use data center management software that automatically tracks the power at each breaker connection to ensure ratings are not exceeded. With live readings from inlet or outlet meters, the software will prevent you from applying a load that will exceed breaker limits.
- Three phase load balancing. Unbalanced power can lead to premature circuit breaker trips and high voltages that can reduce the useful life of equipment. Set thresholds on three phase power to receive alerts when a device is in violation. Then, act upon this information to maintain balance on all phases and maintain uptime.
- Simulate failover and test what-if scenarios. Don’t wait until it’s too late to find out what happens in the event of a failure. Use DCIM software to simulate failover and ensure that power is always available to IT equipment. You can also test what-if scenarios with reports that identify available capacity to provide coverage in the event of a failure.
Don’t Wait for A Data Center Outage to Occur
The value of preventing outages is enormous. The best data center managers recognize this and follow these best practices to maintain uptime. Follow their example and leverage a complete DCIM solution that enables best-in-class monitoring and reporting capabilities, and you will potentially save your organization millions.
Want to see how Sunbird DCIM helps improve uptime? Take a free test drive today.