I'm going to go off on a bit of a (somewhat grumpy) lecture here in hopes that people will stop long enough to listen. A little Gestalt therapy, if you will. Ultimately I hope at least one person recognizes a need and acts on it.
If I had a dime for every time I have personally seen this one issue bite someone in the backside, I'd be a rich man. There are a zillion things that can go wrong on a mission-critical network, but of those things there are actually just a few that account for a substantial portion of the issues that typically bring critical services down.
So, if you run a network and have not addressed the one issue I will describe below, please take the time out of your day to start a plan to remediate the problem ASAP. Along the same lines, if you are not sure where you stand with regard to the issue, or if you have never checked but you feel confident because everything works today and always has so it can't possibly be an issue... Again, please just take the time to inspect your infrastructure and put a plan in place.
I should also say that if I had a dime for every time I've said exactly what you just read in the paragraph above, I'd be a rich man. I lost count long, long ago of the number of hours spent watching people try to avoid - in any way possible - checking the obvious and addressing it. Usually that's due to those egg-on-face concerns that go along with being they guy who missed something so simple and critical (albeit not too obvious) when it came time to learn the detailed intricacies of running a high-availability network.
Okay, enough with the harshness. Time for the issue at hand.
The number one network mistake I have seen people make on IP networks, over and over again, is using the default settings on their switches and servers that cause the network interfaces to auto-negotiate the speed and duplex settings.
Seriously, if your requirement is to provide high availability and your SLAs require your services be up, do not neglect the critical (but often skipped) process of manually configuring your NICs and switches to the proper setting. Just because the interface says it's running 100mbps and full-duplex doesn't mean it's working, and when your network takes a dive and you start losing packets you'll be sorry.
Along the same lines, never assume that one half of one percent of packet loss is no big deal. Seriously, if you are seeing retransmits on your network interfaces, something is likely wrong. Also, chances are that .5% loss is not being scattered evenly across your traffic. It may all be happening at once in bursts, and that hurts - a lot.
Again, if I had a dime for every time I (or someone working with me) recommended inspecting the interface settings, recommended changing them, and flagged interfaces where traffic analysis showed data transmission loss that was obviously causing network apps to fail... Well, let's just say it's amazing how hard it is to convince some people that their network is the cause of the issue.
Why am I being so blatantly blunt about this? Because I hope that the message will carry, that administrator egos will be set aside, and that people will understand that the real-world evidence based on years of actual experience, proven over and over again, bears out the fact that this will eventually happen to you if you have not already taken the steps to ensure it doesn't. Don't let that happen. Protect that ego now, rather than waiting for it to be damaged.
Finally, don't fall prey to the idea that just because you have high-grade HP, IBM and Dell Servers and Cisco switches that the money you (smartly) spent negates the need to set things up the right way, or that these vendors have everything figured out for you and set as defaults. Point of fact, this issue occurs just as often (if not even more so) with your expensive, data-center class hardware. In fact, Cisco switches have been somewhat famous for requiring intervention of the manual-configuration type. They even have a troubleshooting support article here that you can refer to for your configuration needs.
You have been advised. Now go do something about it. And forward this to every network administrator you know. The network (and ego) you save may be theirs. :)
Member discussion: