R75.40VS – The Saga continues – VSX Cluster Cisco ARP problem

Hello,

as I promised in R75.40VS – VSX installation Odyssey – My first SK I will keep you updated with the progress we made on troubleshooting this installation.

The good news: plenty new problems to troubleshoot and write about! Well not plenty but this time an interesting one:

5. Stopping a cluster member (cpstop) leads to Cluster Outage

Yeah right the goods tuff just keeps coming! After a lot of email back and forth with Checkpoint support they said that they found an error in the Cluster-Synchronization-Code and provided us with a hotfix.

As we had well working Checkpoint clusters in the past we decided to just stop the passive Clustermember, apply the patch and see how it goes to decide if we would fail over and continue patch the other member.

So we issued a cpstop and BOOM total traffic outage over the Firewall Cluster…
The passive cluster member was stopped and active cluster member went to “Active Attention” and showed a couple of networks/intefaces are having a problem (excerpt):

vsid 2:
------
Required interfaces: 2
Required secured interfaces: 1

Sync Inbound: UP Outbound: DOWN (21.1 secs) sync(secured), broadcast
wrpj257 UP non sync(non secured), broadcast
wrpj320 UP non sync(non secured), broadcast
eth2-01 Inbound: UP Outbound: DOWN (15.3 secs) non sync(non secured), multicast (eth2-01.1257)

This really threw me of. Why would a stopped cluster member interfere with the active cluster member. As the Checkpoint Support mentioned problems with the sync-mechanisms I speculated in that direction but as it turned out the issue was a different:

After we spent a sunday morning in the office to apply the hotfix on both cluster members with downtime we could finally answer a millions questions from checkpoint support and assure them that we properly applied their patches. So they agreed to perform a remote session with us.

One thing you got to love is that Israelis work sundays! Apparently their weekend goes from friday – saturday. Bad if you need support on thursday tough ;)

So we got to spend another sunday at the office and a whole one this time. But it was worth it as we could single out the issue pretty well in the end.

First we had to set up a separate dial up line because the tech would get disconnected when the firewall goes down. But as we had this out of the way he was able to troubleshoot the issue.

At some point we got a pretty weird phenomena: when a tcpdump was running on both cluster members, while stopping the passive cluster member, the problem did not appear. Up until now I don’t have a definite answer why that was but I have some ideas as I will explain in a bit.

But now to get to the interesting part: at some point we captured the ARP traffic on the Interface leading to the transfer subnet leading to the internal network. In wireshark we examined the the capture:

arp-capture

Checkpoint-aa:bb:cc = active clustermember
Checkpoint-dd:ee:ff = passive clustermember
Cisco-* = Cisco Nexus Router running hsrp

  • Packet 1-9 are showing the active checkpoint cluster an the cisco router sending gratuitous arp packets to each other. 
  • Packet 7 is the first interesting packet as the active cluster member is announcing a MAC Address for an internal VSX IP Address to the network
  • Packet 10-11 are weired as now the passive cluster member also announces an internal VSX IP Address to the network
  • Packet 12-21 even weirder: the passive cluster member now scans the network for active hosts. Normally ClusterXL does this to determine connectivity. However this was after issuing a cpstop!
  • Packet 22-23 the Cisco Routers respond to the arp requests of passive cluster member
  • Packet 24-26 The active cluster member now asks for the internal VSX IP of the passive cluster member
  • Packet 27-39 the passive cluster member scans the network for active hosts again (still cpstopped!)
  • Packet 40-47 active cluster member and cisco routers are exchanging gratuitous APRs again

So that was what we saw in the captures. But remember: when we dumped the traffic the problem did not occur! So this is a capture that represents an unproblematic cpstop of the passive cluster member.

But why does a running capture on both Clustermebmers prevent the problem from happening? I suspect that Packets 40-47, from the above capture, could be the result of the capture trying to resolve hostnames via dns or some other totally unrelated coincidence.

Unfortunately it was sunday and no Cisco administrator was available to us to perform a capture on the Cisco side while we issue a cpstop without running captures. But what we were able to do was to log into the Cisco Routers and watch the ARP table during a problematic cpstop.

We then found out what was going on. After issuing a cpstop the Cisco Routers learned the MAC Address of the passive Cluster member for the VIP of the LAN facing Interface!

To make this a bit more graphic I will show you diagram focussing on this error:

cisco mac problem

After we identified this we ended the remote session and the Checkpoint Tech took this informations and a couple of kernel debugs to Checkpoint R&D.

A couple of days later they told us the Problem is that the Cisco Routers are learning the MAC Address of the passive cluster member from the interface active checks (arp requests) it performs.

As a solution they proposed to either “deactivate this feature on cisco” or to set this kernel parameters on the Checkpoint side:

  • fwha_ips_scanned_at_a_time = 0 (default 5)
  • fwha_monitor_if_link_state = 1 (default 0)

What changing this fw kernel parameters would do is practically dumb down ClusterXL to not probe attached networks anymore but instead just go for Interface Link state (up/down).

So we applied this fw kernel parameters and they worked. We now can “cpstop” a Clustermember without any outage.

FYI: fw kernel parameters changed at runtime will not survive reboot. Refer to sk26202 to learn how to set them permanently!

New problem found, new problem solved. All the other issues described in my last post sadly still exist. We will continue to work on them and I will keep you posted about possible solutions to those problems.

Regards
Sebastian

About SebastianB

read it in my blog
This entry was posted in Checkpoint and tagged , , , , , , , , . Bookmark the permalink.

9 Responses to R75.40VS – The Saga continues – VSX Cluster Cisco ARP problem

  1. Anonymous says:

    Hi Sebastian,
    We have just gone through the exact same procedures you have described above you have save us a ton of time… Thanks!!

  2. SebastianB says:

    Hello,

    really nice to hear that!

    We still have not fixed that cluster up a 100%. As it turns out the code for Virtual Routers in R75.40VS maybe has some flaws. Then again maybe its just our unique setup.

    So we are still waiting for Checkpoint support to produce a fix and thinking about removing VRouters from our configuration altogether.

    If you like to share a bit more about your setup/your problems with R75.40VS(X) I would be glad to learn about it and get a better picture about reliability of this version.

    Again I am glad to have helped you.
    Regards
    Sebastian

    • Anonymous says:

      HI Sebastian, we have similar set up as per your diagram minus the Virtual Routers. When we issue cpstop on standby command, the whole thing just gone bad. will let you know how it goes after we implment the fix/workaround

  3. Borek says:

    Hey Sebastian, what are the other issues you’ve got with VR or 75.40VS in general? Can you share them with us or the main issue was the downtime problem after the cpston on Standby cluster member?

    • SebastianB says:

      Hello,

      our main issue with 75.40VS was that we could not enable SecureXL because it would bring down the ClusterXL and that we saw Clusterfailovers and dropped traffic on policy push.
      Checkpoint Support found some “secure XL tagging” bugs in the Virtual Route Code and provided us with several patches we tested over some months.

      In the end we are now able to enable Secure XL on our Firewall Virtual systems now but not on VS0 and not on the Virtual Router. However as the Firewall VS are the ones with the highest load this reduced CPU arround 10-15% and we are now able to push policy without problems like 99% of the time.

      The Cluster does still not “feel” perfect however its working reliably now. From time to time we still see traffic beeing dropped on policy installation and the CPU Load is between 50-60% on Average during the day with an average of 50k connections and maybe 300mbit throughput. Only Firewall and IPS Blade activated.

      I hope this answers your question.
      Regards
      Sebastian

  4. Pingback: R75.40VS – Light at the end of the tunnel? | IT-Unsecurity

  5. Anonymous says:

    Very Helpful blog. Thanks Sebastian for sharing your experience with such great details.
    Excellent work.
    -Ashok

  6. Basha says:

    Hi,

    We are facing a weird issue in R75.40 VS environment. The issue is, after we upgrade to R75.40, when we put both the appliance into production under (Active/Standby) mode, we found that their is a cluster fluctuation issue, the issue is Primary sometime showing Active/Down and Secondary on the same VSX showing Active/Down, the traffic sometimes works and sometime not.

    We did a remote session with CP support and one time they issue this set cphaconf set_ccp broadcast command and this calm down the firewall. But one new problem came (i.e) 3 VS’s are showing Active/Down on both Pri & Sec. Still support is working on this issue to analyses but could not come up with a solution till now.

    Any idea on this issue?

    Appreciate your response.

    Thank you.

    • SebastianB says:

      Hello,

      sadly i do not have a solution for you.

      After nearly one year of problems with R75.40VS we decided to completely remove it from our environment again.

      CP Support did a lot of debug sessions with us and produced a lot of patches but the Cluster never became nearly stable for us…

      As to your problem description: We also saw a lot of ClusterXL flapping issues in our VSX cluster. Disabling SecureXL seemed to help but the performance was very bad compared to data sheet specs afterwards.

      Just a quick tought about “cphaconf set_ccp broadcast”: Did you verify that you set the Sync to broadcast for all Virtual Systems and not just VS0?
      A “cphaprob -a if” should show the sync mechanism per VS (broadcast vs. multicast).

      Regards
      Sebastian

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.