R75.40VS – Light at the end of the tunnel?

Hello,

it has been a while since my last post R75.40VS – The Saga continues – VSX Cluster Cisco ARP problem.

This is because I wanted to wait with this post until we have stable operations or another major problem that is worth blogging about. Gladly no major problems did arise so this means we have a mostly stable Firewall now after working on it with Checkpoint support for half a year (yay).

Let me talk you through the last open issues we had (mentioned in my first post) with the cluster and what we did about them to stabilize the cluster:

  • ClusterXL problems when turning on SecureXL: This was solved by the Checkpoint Support through a hotfix that was based on 3-4 debugs I had to take while implementing a temporary deubg kernel (just temporarily replacing a couple of files and reboot). In the end we now can enable SecureXL on our Firewall VirtualSystems. Strangely enough cpha still goes down when we enable SecureXL on VS0 but we can live without that.
  • High CPU Load: With SecureXL enabled and only Firewalling and IPS Blade running we are now at around an average of 40% CPU Load during normal daily operations (arround 50.000 connections and 300mbit throughput). This still sounds to high for my taste but for now it is working stable and there might be some room for optimization in disabling IPS protections (tough i don’t see the point in implementing an IPS and then disabling all protections). I would love to hear some performance comparisons of your environment in the comments! 
  • Cluster Failovers on policy push: This behavior still accompanied us until two weeks ago. But since then we have not witnessed any unwanted failovers anymore. We have implemented a couple of Kernel Parameters that Checkpoint suggested to stabilize the cluster in this regard. However I am not a 100% certain if a specific Kernel parameter fixed this problem. Also two weeks is not that long of a time but probably the longest since the beginning that this cluster was working 100% without any glitch! You can find the exact Kernel Parameters we have set now below.
  • Sync Problems: I did not mention this in my earlier posts specifically. We sometimes saw, that after an unwanted Cluster-Failover the active sessions doubled or sometimes tripled and only very slowly, sometimes 2-3 hours, went back to normal levels. We could not see any real connections rising over our normal level so this pointed to a sessions synchronization bug. Since our cluster did not failover unwanted in the last two weeks I did not witness this problem anymore. Also manual initiated failovers worked as expected without raising the active session count.
    Checkpoint Support told us that they found problems in there syncing code and I am currently waiting for a hotfix.

So if this last hotfix works and the cluster keeps running as smoothly as it did in the last weeks it seems that we finally tamed this beast! I am very glad about this but at the same time this leaves me with a bitter taste about Checkpoints software maturity!
I got a response from another Checkpoint customer that verified this exact same issues in his environment and Checkpoint Support acknowledged bugs in their code for which they provided me patches.

I still love the idea/model behind the central Checkpoint Management and VSX Firewall Virtualization, however I would advise anybody against getting R75.40VS in their productive environment! Checkpoint Support told me that they incorporated the hotfixes that were provided to me and more into R76. If I had the luxury to start all over again I would wait to R76.10 (or what release will follow) and hope that all of this problems are amended with that.

Kernel Parameters we have now set:

  • fwha_enable_state_machine_by_vs=0 – This was already pre set. This disables VSLS
  • fwha_freeze_state_machine_timeout=60 – this is supposed to freeze the Cluster status for 60 seconds after a policy push so that a policy install does not failover the cluster, I have witnessed this not to work properly at times!
  • fwha_ips_scanned_at_a_time=0 – disables scanning of IPs in attached subnets to assess link functionality (see last post)
  • fwha_monitor_if_link_state=1 – asses link status only on physical link state instead by smart probing (see last post)
  • fwha_enable_early_probing=1 – see sk31665
  • fwha_skip_first_retrans_req=1 – eases up the sync retransmissions to prevent failover under load because of few lost sync packets
  • fwha_cul_mechanism_enable=1 – No deep explanation given by the support. Is supposed to ease failover in high load scenarios.

Please do not regard this list as a suggestion to implement in your environment. Rather see it as a starting point if you have similar issues and discuss these parameters with the Checkpoint Supporter on your case.

FYI: fw kernel parameters changed at runtime will not survive reboot. Refer to sk26202 to learn how to set them permanently!

So hopefully this post will actually end this post series on R75.40VS. We are going to implement another Firewall Cluster in the near future. If this VSX Cluster performs well until then I will write a quick post summaizing either how we reproduced another stable R75.40VS cluster or how R76.? (VSX) will perform.

As always I hope this information is helping someone out there and I encourage you to contact me or leave comments with questions regarding the topic.

Greetings
Sebastian

Advertisements

About SebastianB

read it in my blog
This entry was posted in Checkpoint and tagged , , , , , . Bookmark the permalink.

5 Responses to R75.40VS – Light at the end of the tunnel?

  1. concerning performance, you have to look into it in more details. How many cores are enabled? How are they assigned to NICs, VSs, etc? What is actually taking most of CPU time? Is it a process, system, IRQs?

    • SebastianB says:

      Hey,

      yeah performance troubleshooting can be a tricky thing. I actually planned on doing this when the firewall is running stable / surviving policy pushs without problems.

      Sadly the situation is worse again and checkpoint support is not really able to resolve this issue for good…

  2. Anonymous says:

    Great post. Will see if those suggested fixes will make it onto R76.10.

  3. alex says:

    Apparently, Checkpoint published it
    http://dl3.checkpoint.com/paid/34/List_of_relevant_kernel_parameters_for_cluster_flapping_prevention.pdf?HashKey=1378083653_33f2a05fe391349d5fd8d4cb4a5f098b&xtn=.pdf

    I have the feeling R76 might be on the same fate as R75.40vs.
    Maybe this is fixed in R76.10 & R77??

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s