Reply To: Only 4/7 memory expansion nodes join primary

Home Forums Using vSMP Foundation Only 4/7 memory expansion nodes join primary Reply To: Only 4/7 memory expansion nodes join primary

#7448

David
Participant

I haven’t tried rebooting yet. Responding to your other questions first:
Fabric is single subnet, FDR for the most part, ~500 nodes.
Primary is on one 36 port mellanox switch, other 7 nodes are all on
one leaf in a 18-leaf director switch. The ones without node numbers below.
node924-927 came up, 928,930,932 did not.
Switch 36 “S-e41d2d0300445400” # “MF0;mlnx20:SX6518/L17/U1” base port 0 lid 5 lmc 0
[1] “H-0cc47affff5f56e8″[1](cc47affff5f56e9) # “node917 HCA-1” lid 41 4xFDR
[2] “H-0cc47affff5f569c”[1](cc47affff5f569d) # “node918 HCA-1” lid 51 4xFDR
[3] “H-0cc47affff5f5304″[1](cc47affff5f5305) # “node919 HCA-1” lid 49 4xFDR
[4] “H-0cc47affff5f56ec”[1](cc47affff5f56ed) # “node920 HCA-1” lid 46 4xFDR
[5] “H-0cc47affff5f7bcc”[1](cc47affff5f7bcd) # “node921 HCA-1” lid 57 4xFDR
[6] “H-0cc47affff5f8018″[1](cc47affff5f8019) # “node922 HCA-1” lid 61 4xFDR
[7] “H-0cc47affff5f7c88″[1](cc47affff5f7c89) # “node923 HCA-1” lid 75 4xFDR
[8] “H-0cc47affff5f7ce4″[1](cc47affff5f7ce5) # “MT25408 ConnectX Mellanox Technologies” lid 73 4xFDR
[9] “H-0cc47affff5f800c”[1](cc47affff5f800d) # “MT25408 ConnectX Mellanox Technologies” lid 42 4xFDR
[10] “H-0cc47affff5f82c0″[1](cc47affff5f82c1) # “MT25408 ConnectX Mellanox Technologies” lid 59 4xFDR
[11] “H-0cc47affff5f82ac”[1](cc47affff5f82ad) # “MT25408 ConnectX Mellanox Technologies” lid 63 4xFDR
[12] “H-0cc47affff5f7a54″[1](cc47affff5f7a55) # “MT25408 ConnectX Mellanox Technologies” lid 79 4xFDR
[13] “H-0cc47affff5f7d98″[1](cc47affff5f7d99) # “node929 HCA-1” lid 56 4xFDR
[14] “H-0cc47affff5f7a5c”[1](cc47affff5f7a5d) # “MT25408 ConnectX Mellanox Technologies” lid 76 4xFDR
[15] “H-0cc47affff5f7dc0″[1](cc47affff5f7dc1) # “node931 HCA-1” lid 54 4xFDR
[16] “H-0cc47affff5f833c”[1](cc47affff5f833d) # “MT25408 ConnectX Mellanox Technologies” lid 52 4xFDR
[17] “H-0cc47affff5f7d90″[1](cc47affff5f7d91) # “node933 HCA-1” lid 44 4xFDR
[18] “H-0cc47affff5f7c40″[1](cc47affff5f7c41) # “node934 HCA-1” lid 74 4xFDR

All the nodes were running GPFS with VERBS RDMA over the same HCAs (integrated on the motherboards) before we pulled them from production to try ScaleMP.

Each node has 128 Gig of memory. No other vSMP instance here, though we first attempted to boot two other nodes (IBM 3755-M3 quad core with 512G memory) as secondaries, these crashed when booting off the USB sticks. I tried to open a ticket but was told mixing AMD and Intel was not supported. The crash still occurred when the primary was powered off, so I’m skeptical of this statement.
Perhaps there is something on the primary that remembers these nodes, that I need to clean up?