Monday, September 17, 2012

SysAdmin 101: Typical Power related symptoms - or TG4s Leinstergate for Dummies

If you are a sysadmin, and probably on call for at least some nights and weekends, one thing you need to learn very fast is about data centre problems, hardware problems and typical patterns.

Basically power and network problems have very similar symptoms.  If you've got a multi-level service or application that maybe consists of a database sitting on a Linux or Unix cluster, served out by a Windows cluster based application, all of it sitting on shared storage, you'll see typical alert patterns.

When power fails, you generally notice two things - everybody, everything loses access.  Applications elsewhere that depend on this or monitor it go ballistic.  Network alerts indicate loss of link.  This is important for the next type of alert.  Users cannot access the application, but this often is secondary to the fact that they are scrambling with their powered down PCs in the dark anyway.  Some data centres may go into generator modes as secondary power kicks in.  With the latest UPSes, you'd hardly notice.  Though commonly, local users cannot use their own PCs, so the UPS generally allows a good sysadmin sufficient time to talk local staff through a safe shutdown.

If there is a local WAN failure and the application or service has no external dependencies, users will all work away just find, but you'll see a flood of missed responses.  Generally this is relatively easy, as long as you're not the network admin.  However what can be complex these days is most sites have backup network lines that often don't have identical configurations.  So sometimes traffic shaping that previous gave your critical business application priority over internet traffic might not apply.  You WAN provider should be sorting this one out.  This is important to monitor for 3rd party service providers as alerts often have to be managed even if the network is outside the support boundary.

Power failure is very much an all or nothing scenario.  You don't get any response from a server or router that is powered off.  If there is a front end server elsewhere it will probably indicate missing content.  This isn't, however, what TG4s front end reported on Saturday - it indicated that a page pool had run out.  As a once suffering backup administrator, this is familiar: its common for servers to exhaust the paged pool and basically stop permitting new connections.  There are a few cures. Reboot the server - and wait until the same problem occurs again as available memory is overwhelmed.  Add additional memory - good solution if its only occurring occasionally.  Cluster the service onto a number of servers and use load balancers to share out the load.  Or replace the server with more powerful front ends, more resilient applications or filter traffic via network devices.

But as the TV licence goes, one of them is not - wait until the problem occurs, and then completely misrepresent the actual problem to viewers on the assumption that they won't be able to break down what actually occured.  The reality of the Leinster debacle on Saturday was that TG4 did not have sufficient capacity to stream live the Leinster game, there may also be secondary issues with suppliers, and their excuses today are hollow.

No comments: