as I promised in R75.40VS – VSX installation Odyssey – My first SK I will keep you updated with the progress we made on troubleshooting this installation.
The good news: plenty new problems to troubleshoot and write about! Well not plenty but this time an interesting one:
5. Stopping a cluster member (cpstop) leads to Cluster Outage
Yeah right the goods tuff just keeps coming! After a lot of email back and forth with Checkpoint support they said that they found an error in the Cluster-Synchronization-Code and provided us with a hotfix.
As we had well working Checkpoint clusters in the past we decided to just stop the passive Clustermember, apply the patch and see how it goes to decide if we would fail over and continue patch the other member.
So we issued a cpstop and BOOM total traffic outage over the Firewall Cluster…
The passive cluster member was stopped and active cluster member went to “Active Attention” and showed a couple of networks/intefaces are having a problem (excerpt):
vsid 2: ------ Required interfaces: 2 Required secured interfaces: 1 Sync Inbound: UP Outbound: DOWN (21.1 secs) sync(secured), broadcast wrpj257 UP non sync(non secured), broadcast wrpj320 UP non sync(non secured), broadcast eth2-01 Inbound: UP Outbound: DOWN (15.3 secs) non sync(non secured), multicast (eth2-01.1257)
This really threw me of. Why would a stopped cluster member interfere with the active cluster member. As the Checkpoint Support mentioned problems with the sync-mechanisms I speculated in that direction but as it turned out the issue was a different:
After we spent a sunday morning in the office to apply the hotfix on both cluster members with downtime we could finally answer a millions questions from checkpoint support and assure them that we properly applied their patches. So they agreed to perform a remote session with us.
One thing you got to love is that Israelis work sundays! Apparently their weekend goes from friday – saturday. Bad if you need support on thursday tough ;)
So we got to spend another sunday at the office and a whole one this time. But it was worth it as we could single out the issue pretty well in the end.
First we had to set up a separate dial up line because the tech would get disconnected when the firewall goes down. But as we had this out of the way he was able to troubleshoot the issue.
At some point we got a pretty weird phenomena: when a tcpdump was running on both cluster members, while stopping the passive cluster member, the problem did not appear. Up until now I don’t have a definite answer why that was but I have some ideas as I will explain in a bit.
But now to get to the interesting part: at some point we captured the ARP traffic on the Interface leading to the transfer subnet leading to the internal network. In wireshark we examined the the capture:
Checkpoint-aa:bb:cc = active clustermember
Checkpoint-dd:ee:ff = passive clustermember
Cisco-* = Cisco Nexus Router running hsrp
- Packet 1-9 are showing the active checkpoint cluster an the cisco router sending gratuitous arp packets to each other.
- Packet 7 is the first interesting packet as the active cluster member is announcing a MAC Address for an internal VSX IP Address to the network
- Packet 10-11 are weired as now the passive cluster member also announces an internal VSX IP Address to the network
- Packet 12-21 even weirder: the passive cluster member now scans the network for active hosts. Normally ClusterXL does this to determine connectivity. However this was after issuing a cpstop!
- Packet 22-23 the Cisco Routers respond to the arp requests of passive cluster member
- Packet 24-26 The active cluster member now asks for the internal VSX IP of the passive cluster member
- Packet 27-39 the passive cluster member scans the network for active hosts again (still cpstopped!)
- Packet 40-47 active cluster member and cisco routers are exchanging gratuitous APRs again
So that was what we saw in the captures. But remember: when we dumped the traffic the problem did not occur! So this is a capture that represents an unproblematic cpstop of the passive cluster member.
But why does a running capture on both Clustermebmers prevent the problem from happening? I suspect that Packets 40-47, from the above capture, could be the result of the capture trying to resolve hostnames via dns or some other totally unrelated coincidence.
Unfortunately it was sunday and no Cisco administrator was available to us to perform a capture on the Cisco side while we issue a cpstop without running captures. But what we were able to do was to log into the Cisco Routers and watch the ARP table during a problematic cpstop.
We then found out what was going on. After issuing a cpstop the Cisco Routers learned the MAC Address of the passive Cluster member for the VIP of the LAN facing Interface!
To make this a bit more graphic I will show you diagram focussing on this error:
After we identified this we ended the remote session and the Checkpoint Tech took this informations and a couple of kernel debugs to Checkpoint R&D.
A couple of days later they told us the Problem is that the Cisco Routers are learning the MAC Address of the passive cluster member from the interface active checks (arp requests) it performs.
As a solution they proposed to either “deactivate this feature on cisco” or to set this kernel parameters on the Checkpoint side:
- fwha_ips_scanned_at_a_time = 0 (default 5)
- fwha_monitor_if_link_state = 1 (default 0)
What changing this fw kernel parameters would do is practically dumb down ClusterXL to not probe attached networks anymore but instead just go for Interface Link state (up/down).
So we applied this fw kernel parameters and they worked. We now can “cpstop” a Clustermember without any outage.
FYI: fw kernel parameters changed at runtime will not survive reboot. Refer to sk26202 to learn how to set them permanently!
New problem found, new problem solved. All the other issues described in my last post sadly still exist. We will continue to work on them and I will keep you posted about possible solutions to those problems.