Flowmon

Journey from Reactive IT to Proactive Control

23/03/20

This is the story of a medium-sized company (250-1000 employees) on its transition from a reactive IT department that acted like a firefighter, to a modern team having full control and visibility in their digital environment.

It was the first day in my new job. I had just been introduced by my manager, the company CFO, as the head of IT to my new team. The time was just right. The company was growing, there were already more than 600 users in the Active Directory, and the company business relied more and more on modern technologies, such as automation and SaaS applications. My team had two people taking care of the server and network environment and three people dealing with user requirements and regular stuff, such as user accounts, new equipment, virus removal and multifunctional printers. The next day I planned one-on-one meetings with my team members. The conclusion was clear. These people worked like firefighters, responding to users who were having IT problems. There was no time to work on the strategic development of IT and adopting new technologies.

I became familiar with the company environment. It was Windows heavy with a few Linux-based servers, and a few managers had Macbooks. The employees worked on desktops or laptops. The company network was delivered two years ago by an outsourcing partner who had a valid service contract. My IT department had limited access, as well as limited knowledge to manage the network. Security was ensured by a standard firewall that acted as a VPN, as well as for remote users, and an antivirus from a major vendor installed on all Windows-based user devices. There was also wifi for the internal network, separated from the guest wifi network.

By the end of the first week it was very clear that I was lacking infrastructure monitoring. A standard SNMP/VMI based monitoring that would give me a single dashboard to understand availability and utilization of my network, server environment and critical services. I did not really want to go for an open source option. I understood that this would require knowledge and capacity in the team that I did not have at the time, and as infrastructure monitoring is really a commoditized technology it made more sense to look for a professional solution with reasonable implementation costs, quick time to value/ROI, and professional support. I approached my CFO with a quick presentation demonstrating the benefits, showing the implementation plan and requesting a corresponding budget. My CFO was the right man in the right place. He knew why he hired me and my request was approved without much discussion.

Finally, my infrastructure monitoring was in place and was properly configured. It took a few weeks but finally I could present a dashboard showing that all the servers were running properly, the utilization of CPU and memory were at a reasonable level, and internet bandwidth utilization was less than half of the provided capacity. We started to have trending data, so we could plan upgrades of disk storage and server infrastructure. As a benefit to the business I could present availability reports for critical systems and services, and show how IT actually contributed to the success of the company.

Additionally, we had alerts, so I did not need to wait until users called us. As time went on, I experienced a new type of alert that was not experienced before. Our internet uplink was fully saturated. I knew something strange was going on. My team was on the case trying to figure out what was happening and what the root cause was. And users were already calling us. My infrastructure monitoring tool collected network telemetry, including flow data from the internet router but as this was just before the firewall we could see only our public IP address uploading a lot of data to the internet. Firewall logs did not help us either, and we did not have a tool in place to analyze them anyway. We tried to track it through bandwidth utilization on our switches but this simply did not work, as we could not distinguish between internal and internet traffic. The situation got back to normal in an hour but my frustration persevered for a few days and I did not have any answers for my manager. And I knew that it could happen again. I did not know though, just how bad it was soon going to get.

01-infra.png

Friday morning all looked normal as usual. Our infrastructure and service dashboard was green. I was exploring options on how to get rid of some legacy stuff and take advantage of software as a service consumed from the cloud. It is one priority we had agreed with the CFO for the next year. Suddenly, we received a user complaint that one folder on the shared disk storage looked strange and files could not be opened. I acted immediately and we disconnected the data storage. We soon confirmed that we had been hit by a ransomware attack. The impact did not look that critical as we knew we could restore the data from the nightly backup, but it would probably be a weekend job. On the other hand we had to take down the local data storage, which affected productivity of around 100 users for the whole day. It was most important to find the source. We knew that infrastructure monitoring would not help us and it was really a pity that the antivirus did not catch the infection. We explored the logs on the data storage and just after lunch (that we skipped), we identified the infected laptop. It was removed from the network and reinstalled promptly. Two people from my team took a Saturday shift to recover the system from the attack, restore the data, and I spent my weekend on the phone.

We were lucky to recognize it that early and find it in the logs that were not yet rotated. It was clear to me that we needed a central collection of logs from critical servers but I also understood that it would not help us to prevent such issues in the future. The budget for the log management appliance was approved on Monday without any discussion. In three weeks it was in place and we spent some additional time configuring all the servers and firewall to provide log data. So far so good. On the other hand it was fair to say that our strategic IT development projects suffered from a lack of resources as the team was disturbed by these issues.

During the summer holiday we scheduled an upgrade of our order processing system with the supplier. From my experience, I also asked for acceptance tests to make sure that the new version was bug free and performed well on our infrastructure. All this was done and the tests passed. We switched users over the weekend to the new system after migrating the data over the weekend. I was proud that everything worked but I could not have been more naive. During Monday we received many complaints about the slow response of the system and even about some errors. Our infrastructure monitoring, however, did not show any issues with service availability, and resource utilization on the server was completely normal. On Tuesday we opened the case with the supplier but as all the acceptance tests had passed and the system was running on our infrastructure the conclusion was straightforward. Sorry but it was your network. Maybe. But what about the application errors? I was not happy with the conclusion from our supplier but missed the proof.

It was already Thursday, our company productivity was damaged as we could process only two-thirds of orders than before. This led to overtime in the order processing team and it was the first time I experienced a very unpleasant conversation with our CFO. But he was right. It was the IT department's responsibility to handle such situations. We were not able to move back to the older version as one week’s worth of data was already in the new system and there was no downgrade procedure.

During some research to find a solution to internet bandwidth utilization issues we had experienced a few times, I had come across NPMD technology. NPMD stands for Network Performance Monitoring and Diagnostics. This technology is able to give you a breakdown of the network traffic by user, service, application, and helps you to understand bandwidth utilization. But there is much more. The technology also helps you to understand what the network delay is and what the application delay is, which was exactly what we needed right then. I found a vendor of such technology pretty close to us, as our network outsourcing partner had the technology in their portfolio.

We agreed on a pilot project with a single goal – isolate the root cause of the performance issues. In a week we received a visit from an engineer who brought along a server. He explained to us that it was a network probe with a built-in collector able to provide and analyze network telemetry. He configured a mirror port on our core switch, connected the probe and surprised us by having very detailed network visibility from the moment we plugged the device into the network. Together we configured specific reports and views for our system. The network delay looked normal, sometimes we saw real values of application delay, especially between the web and database server. So, it started to be obvious that it was not the network, but something was wrong with the database.

The engineer had an ace in his hands. He configured the application performance monitoring capability that completely revealed the transactions between the web server and database server. Some SQL statements executed took seconds to perform and some of them even crashed. It was an outstanding level of visibility we got as well as proof where the problem was. This helped us to resolve the problem with our supplier quickly and get things back to normal.

03-apm.png

We also configured reports to understand internet utilization and provided those reports to the CFO. With such technology, identification of the root cause was a matter of minutes. We were curious what else we could do. The engineer from our partner demonstrated to us another software module of the system called Anomaly Detection System. It took us one hour to configure it and during that configuration we realized that our IP address plan and inventory evidence in general should be improved. During a quick overview we understood that there were two workstations in the finance department doing extensive cryptocurrency mining due to a malware infection. We also identified several terminals in the labs trying to connect to systems that we replaced some time ago. So this was an obsolete configuration. We also noted a server that had its management available from the public internet though the firewall. This required an update of firewall rules as it created a potential entrance to our network. It was amazing what was hidden in the network traffic.

With the results we achieved there were enough arguments to approve the purchase of network and application performance monitoring to cover our major needs. Calculating the return on investment in this case was no issue as it was very fresh in the memories of the whole management as to how well we performed after upgrading the system for order processing. In fact, the system had already paid for itself. And we planned a budget for the anomaly detection module for the next year. During the two days of training we familiarized ourselves with the system, and I appointed a primary administrator for it among my team. It was obvious that we were going to get much more value, such as understanding the performance of SaaS applications, planning our future bandwidth needs, troubleshooting network connectivity issues or even extending the deployment in the future into a virtual or public cloud environment due to the flexibility. Last but not least, we wanted to minimize the scope of systems we needed to work with on a daily basis. So, we configured logging to our log management server and SNMP traps for the most important alerts towards our infrastructure monitoring tool that was still our major single view of our infrastructure, now extended with network traffic and application performance metrics.

You would like to ask me what I deployed. In our case it was PRTG followed by Logmanager and Flowmon.

This was not the end of my story. There were still topics for us. We needed to get rid of employees’ phones and tablets in the internal wifi network and move them to the guest network, as we did not really have control of those devices. We realized that with the adoption of SaaS services we needed a backup internet connection. And we had no visibility in our two remote locations at the time. But it had been one year since I had joined the company and look at how we had moved on from being in the dark and having a reactive approach, to understanding the environment, to having network visibility, and being at least a bit more proactive, and what is more important, having a vision and plan for the future. These technologies and their proper use gave us back our time to work on strategic topics, instead of dealing with day-to-day firefighting.