Top 20 tips for best practice datacentre maintenance
Complying with modern datacentre best practices for design and operations is challenging enough, but facilities must be properly maintained to keep up a reliable level of service.
- Safety First. Datacentres contain numerous hazards that can impact the life and health of technicians. Datacentre staff need to be aware of potential safety hazards when performing preventive maintenance activities. Ensure datacentre staff are familiar with health and safety processes by documenting them and providing regular safety training.
- Scheduled Maintenance. Performing preventive maintenance on UPS and batteries greatly reduces the chance of failure during power outages. The same is true for other critical systems such as HVAC and generators. Regular preventive maintenance can reduce the chance of failure, reduce the amount of energy consumed and extend equipment lifetime. The manufacturers recommended preventive maintenance is a good guide.
- Standardised Checklists. These ensure datacentre staff know what to do during the preventive maintenance, while ensuring the same standard checks are being performed every time.
- Enforce compliance. It is important to complete preventative maintenance and to complete it on time. The easiest way to do this is to measure and enforce PM compliance. Your preventive maintenance compliance (PMC) score is the percentage of scheduled PM work orders that get done on time, reducing the time variable variation, thus improving reliability.
- Documentation. If things go wrong, insufficient documentation can cause further problems. Well-documented reports ensure the data is readily available whenever the auditors come to inspect and historical work order information can be used to identify chronic equipment problems or unacceptable levels of downtime.
- Tiles. Holes in tiles should be covered and openings protected with safety cones or temporary guardrails. No more than four contiguous floor tiles should be removed at any time. These best practices will prevent injuries, minimise the amount of air lost through the openings and keep the floor structure stable. Tiles should always be lifted with a reliable tile puller and set aside where they won’t create a tripping hazard.
- Cables. Pay attention to your cable management. Cables strung across the floor during installation are a trip hazard. They can also shed dirt or create static build-up as they’re pulled across floors. Hold a damp cloth around the cable bundle to remove surface dirt as it is removed from the box. Always block of the aisle(s) in which cables are strung and never leave cables on the floor any longer than absolutely necessary – certainly never over breaks or overnight.
- Cooling. Good cooling is essential to keeping equipment reliable and a maintenance contract with a qualified service company is just the start. Check that blanking plates are installed in unused rack and cabinet spaces. Make sure that the filters in the air conditioners are checked in addition to the filters and heat sinks in computing equipment and make cleaning or replacing these filters routine with temperature and humidity readings verified at least annually. Facilities using cool aisle containment should calibrate the differential pressure sensors and all air conditioning monitoring systems should be tested regularly to ensure that alarms work.
- Noise. Don’t forget about protecting your staff from noise damage. Cooling equipment and server fans can be very noisy. Ensure staff use hearing protection and make it readily available to everyone, with instruction on where and how to use it.
- Load balance. Unbalanced loads are energy inefficient and can lead to unnecessary replacements of UPS through the mistaken belief it is running near its maximum capacity. Large UPS systems deliver three-phase power, and many racks and cabinets today are circuited with either two or all three of those phases. Power draws should be checked regularly at each point in the power chain: racks and cabinets, power distribution units and finally at the UPS. Maintaining load balance will get maximum power from your UPS at the highest efficiency.
- Battery Monitoring. Invest in a good battery monitoring system. Weak batteries are the most common cause of UPS failures. Battery failure usually happens at the very worst time when power goes out and load is suddenly put onto the system. A good monitor can alert to failing cells before it’s too late. It can also extend the life of the full battery string by identifying cells for replacement before they degrade the rest of the cells.
- Qualified electricians. Qualified electricians must perform any electrical work, but it is also essential that anyone working in the datacentre understands the sensitivity of computer operations and the associated risks in a live operating environment.
- Generator Maintenance. The two most common causes of generator failure are dead start batteries and fuel contamination. In cold climates, check block heater operation. Good generator maintenance is critical.
- Water. Anything with water should be checked regularly such as sprinkler pipes, floor, system and air conditioner condensation drains along with liquid detectors. Make sure also that the roof is regularly surveyed for leaks and any other water sources above your datacentre
- Fire. Ensure all staff have full fire incident training, which should cover the locations and proper use of fire extinguishers. The maintenance plan should also include verifying that extinguishers are properly charged.
- Storage. Don’t store equipment inside the IT area. Boxes bring in particulate contamination. Opening boxes or uncrating equipment creates serious contamination that can clog filters and heat sinks, raising the operating temperatures of computing hardware and contributing to early failure. Cardboard and paper are also fuels for a fire.
- Cleaning. Keep your datacentre is kept clean. Make sure the floor is routinely damp-mopped. Foot wipe mats at entrances should be changed regularly.
- Even Floors. If you have a raised access floor, make sure it is level; uneven floors leak expensive cool air and create a tripping hazard.
- Lifting. Use lifting gear. Mechanical lifters are faster, more efficient and safer for installing equipment.
- Food & Drink. Food and drinks should never be allowed inside the datacentre. They create mess and contamination.
Regular, scheduled maintenance can easily pay for itself by preventing unplanned downtime events thanks to battery or capacitor failure, clogged air filters, welded relays and even outdated firmware.
This is quick check list of suggestions to include in your datacentre facilities operation and maintenance plan – it’s by no means absolute but use it to check against your current plan and update as needed. If you don’t have a datacentre-specific maintenance program then why not use this list as a handy starting guide to develop your own and get everyone on board.