Boosting Enterprise IT to the Next Level

It was the first day in my new job as the CIO of a well-established media company. The CIO’s role was new in the company and had been created as part of the new company strategy “digital first”. I reported directly to the CEO and had the mandate and budget to introduce changes and onboard new solutions. One of the management concerns was security, especially after recent ransomware attacks on various industry verticals. So, it was time to become familiar with the IT team and company infrastructure.

Setup

The IT team was not formally structured. There were leading senior engineers for specific domains, such as network management, server/storage/virtualization and general support for users and endpoints. The IT department was trying to follow best practices based on ITIL to keep track of assets, services, user requests and configuration changes. The network team took care of Cisco switches, and was also responsible for firewalls that included integrated intrusion detection features.The server/storage/virtualization team took care of compute nodes and resources needed to operate servers and run company applications. There were no formal security procedures or practices. The support team worked according to requests tracked in Jira to deal with user related issues, and was responsible for Active Directory and endpoint security.

There was an overspread of infrastructure monitoring tools in use in place. The network team had its own monitoring ready to ensure smooth network operations based on Nagios, and additional open source tools. Servers and services were monitored by the SolarWinds toolset, which was used by the server/storage/virtualization team. Responsibility for security of the whole digital environment was not formalized and spread among the teams without any governance, and proper tools were absent. There was a partial implementation of log management that collected logs from server infrastructure. There were no tools for threat hunting in the company environment, and individual tools were isolated.

Let's do a little security test

Having all the information, I decided to conduct a simple test and invited an ethical hacker I knew to perform a scan of the internal environment and look for potential security issues. We discovered various vulnerable services that could be exposed to attacks once malware got over perimeter protection or bypassed it. However, my main concern was that nobody noticed this activity and did not report it. So, we were not able to recognize such activities in our internal network. When asking the network team to provide the analysis of a network activity of a specific IP address in the network, they figured out that some kind of malicious activity had happened but could not provide sufficient details to track the attacker’s activity in detail. It was clear to me that we lacked network monitoring and anomaly detection, as well as vulnerability assessment, that would improve and help to prioritize security patching.

Let's check our troubleshooting and root cause analysis capabilities

A typical example showing the need for network traffic monitoring and analysis was troubleshooting and root cause analysis. I remember a case where a user complained about not being able to reach several internal systems, despite the internet working completely fine. Instead of examining the user’s network sessions, we had to analyze the device, its configuration and try the connectivity manually for different systems. Finally, it turned out the device had a manual configuration of an obsolete DNS server no longer able to resolve the local domain while internet domains worked fine. With proper network performance monitoring and diagnostics we could have immediately seen the device was not properly translating domain names to IP addresses and that in fact no sessions were being established towards internal systems. With the proper setup we would have been even able to detect and alert on such situations immediately.

Next, I focused on how we tracked and reported on user experience and what kind of metrics we collected to understand application performance for our major systems. I was able to get reports on service availability and response times from synthetic monitoring right away. But nobody was able to confirm the real user experience based on monitoring ongoing interactions of users with systems. So, I asked the team how they dealt with user complaints on application performance. The typical answer was that it was probably the network and that it was not happening all the time. So, this was my next argument for network monitoring that would also track real network and application performance.

One day the status quo was disrupted by an upgrade of our internal information system deployed over the weekend. While in the sandbox environment everything worked well, but the production system experienced a significant deterioration of application response times. The issue got attention from the top management, as users were complaining severely. I had to prioritize and involve engineers from the system supplier to help remediate the situation. My team spent a couple of days working on the issue. At the same time I knew that we could be able to handle and maybe even to prevent the issue while having real user experience monitoring tools in place.

Of course, we experienced various incidents over time, ranging from troubleshooting of users not being able to connect, to applications to infected laptops that encrypted data on shared drives. The work of the IT team on those issues was hard, mainly because of having no tools to help us, no single source of truth, and no proactivity of tools in place.

Another day we received a user complaint that one of the folders on the shared disk storage looked strange and files could not be opened. I acted immediately, and we disconnected the data storage. We soon confirmed that we had been hit by ransomware attack. The impact did not look that critical as we knew we could restore the data from the nightly backup, but it was probably going to be a night shift. It was crucial to find the source. It took us some time to analyze the logs on the data storage and after four hours we identified the infected laptop. It was removed from the network and reinstalled promptly. Two people from my team took a shift to recover the network from the attack. Once again it showed us the importance of proper handling of logs from all the systems. Analysis by downloading the logs and working offline with them took too much time, and having network visibility and anomaly detection we would have been even able to recognize the activity automatically.

Preparing a plan

After a few months I was able to come up with a short-term plan. The following priorities had to be addressed:

Make infrastructure monitoring a single, consolidated system
Deploy a central log management system and collect systematically all the log data
Implement a network and application performance monitoring and diagnostic toolset
Implement network traffic analysis and anomaly detection
Put a regular vulnerability assessment in place
Train the IT team for the knowledge and skills needed

The next step would focus on the following areas:

How to expand monitoring capabilities in the public cloud environment
Evaluate the benefits of SIEM to provide event correlation across the infrastructure
Implement network access control for company infrastructure and BYOD management

Consolidating IT monitoring tools

To address consolidating infrastructure monitoring we chose to continue with the Solarwinds platform, which fulfilled the current requirements, could replace Nagios, and was ready for the future needs of public cloud. There was a variety of log management systems on the market but again consolidating different tools was imperative, so we decided to extend our infrastructure monitoring with log analytics from the same vendor. Network and application performance monitoring and diagnostics was the domain of vendors, such as Riverbed or Netscout, focused on the needs of large enterprises.

For network anomaly detection I could deploy Cisco Stealthwatch platform or look for new vendors, such as Darktrace or Vectra. Surprisingly, I did not choose any of them. My idea was again to consolidate and connect network and security operations. I did not want to spend time and effort by integrating, so my choice was Flowmon, a single platform connecting network and security operations together. Another reason was the broad support for network traffic monitoring in public cloud. For vulnerability assessment Qualys would do the job. Last but not least was knowledge and skills. My team had my trust to handle all the new tooling. In fact, they were really good technicians. They had simply been applying the wrong tools to the infrastructure under their command.

Next on my list was proof of concept for SIEM. I knew that proper SIEM deployment and use would not happen without team growth. In return we would get a security umbrella for all our tools in place.