Just yesterday we encountered a problem with one of our VPN Sites that lost its VPN connectivity right when we wanted to go to lunch >;(
As we encounter glitches with Edges often we suspected the problem on the Edge’s end of the VPN Tunnel and not our central Checkpoint VPN Firewall Cluster. After an hour of frutiless Edge-Rebooting, vpn tu resets, removing and re-adding the edge to VPN Comunitys followed by endless policy installations, we noticed a lot of “Unknown SPI” entries in the logs when filtering on the Public IPs of both VPN Peers:
Querying the Checkpoint KB we found sk36375 which states:
Repeated logs may indicate that the relevant kernel tables are full and new VPN-related data cannot be recorded.
However the SK is not really helpfull in explaining how to check if this is really the case. But the solutions at least states the two FW Kernel Tables in question: ‘vpn_queues’ and ‘IKE_SA_table’.
So we queried these tables using fw tab (as you might know from ‘fw tab -t connections -s”):
[Expert@GW:5]# fw tab -t IKE_SA_table localhost: -------- IKE_SA_table -------- dynamic, id 367, attributes: keep, sync, kbuf 1, expires 3600, , hashsize 8192, implies 366, limit 1200 <00000000, 701daac2, c194d907, 1f72a622, e6c6630c; 031cf004, 00000000, 50f7818c, 00000004; 29233/86400> <00000000, 23335609, b9cda941, 00425d38, d69deaba; 03b42806, 00000002, 50f7de56, 00000004; 52987/86400> <00000000, 347a9a25, 29eb6b33, ac789b72, 89ceb056; 03fa0803, 00000002, 50f8100c, 00000004; 65713/86395> <00000000, d5578d0d, 4519d3b5, 530c0906, d04d4a17; 033fd803, 00000000, 50f7144d, 083cfce0; 1266/1440> <00000000, 96287d30, 1bf5a67c, ac804d62, a1317827; 03b18803, 00000002, 50f8286d, 00000004; 71954/86394> <00000000, 96c6dfbd, b91ca581, 3a3bfedf, 84600a18; 03423806, 00000002, 50f80f86, 00000004; 65579/86400> <00000000, f1d14c1b, 7e634b84, 259f54ad, 202aeabe; 03917006, 00000004, 50f7aed1, 00000004; 40822/86399> <00000000, 913506bb, c25f8baf, e501cec6, 2b52666f; 03ebc006, 00000004, 50f74909, 00000004; 14766/86399> <00000000, 2860a0ba, c3570502, 015a820f, 4b3862bb; 03528003, 00000004, 50f7ab03, 00000000; 39848/39905> <00000000, dcd49307, 9f90921d, 2d67e09a, 13f68898; 03365800, 00000004, 50f7d238, e9f16320; 49885/86396> <00000000, 0c7dc142, 5796eb81, 553d4a2b, e81efcd8; 03733800, 00000004, 50f7dd32, 00000004; 52695/86399> <00000000, 3af2a649, 8983c6dd, ba01fdab, 5bd6112b; 03426800, 00000002, 50f7cc02, 00000004; 48295/86400> <00000000, 55158ac9, c2cbc5f4, 0a742c4d, d69e2e94; 038e8003, 00000004, 50f851c6, 00000004; 82539/86399> <00000000, bea80f86, b826724a, 7bb0ab12, a01d02c1; 032c9803, 00000004, 50f8529f, 00000004; 82756/86400> <00000000, 8678ce33, 62c4180e, bd851e3e, 00742416; 03459803, 00000004, 50f82cf4, 00000004; 73113/86399> <00000000, 272318b6, a718b7ad, 40b73a5b, adf3ba45; 036be003, 00000004, 50f83bfa, 00000004; 76959/86399> …(1183 More) [Expert@GW:5]#
As you can see at the beginning the Table Size is limited to 1200 Rows and we had 16 Rows + 1183 More Rows which brought us to 1199 Table entries. So our problem was really a filled up Kernel table.
Doesn’t happen often that you search the Checkpoint KB for a Tracker Message and there is one hit and it addresses exactly your problem, does it?
So now we were left with figuring out what to do next: As we maybe have 40-50 Site to Site VPNs I could not imagine why we would need 1200 IPSec SAs for those. If you really have that much VPNs terminated on one GW you can go to the Cluster Object in SmartDashboard and Raise simultaneous IPSec Limit under Capacity Optimization. The Value entered there does not represent the Table limit 1:1 but the limit derives from what you enter there so just play around and see what fits for you.
But for us the issue was somewhere else. Running “vpn tu” and listing all IPSec and IKE SA’s showed that another Edge was constantly building up new SA’s and was in fact flooding our table. After removing the Edge from the Encryption Domain and pushing policy to disable VPN with that device the Table entries dropped to around 200.
So we exchanged the faulty Edge that was long overdue to get replaced by an N Model anyhow and our problem was fixed.
I hope this will help someone out there!