Only 4/7 memory expansion nodes join primary
August 7, 2017 at 2:15 pm #6177
Trying to bring up 8 nodes using free edition. The primary came up fine, and the first four secondary nodes that I booted joined the backplane. However the last three are stuck between “Configuring Backplane OK (ID: 14)” and the
next line which should be “Configuring System”.
Why would these nodes fail to join? I started all seven and the first four progressed past this point before I brought up the primary.
Is there a way to look at the list of secondary boards, or add another secondary board once the primary decides it has its fill?
Brown University CCVAugust 7, 2017 at 5:14 pm #6178
those 3/7 secondaries that fail to connect – are those always the same 3 (i.e. if you power-cycle all 8 nodes, would the same 3 fail to connect?)
If yes – then I’d try to trace the issue to the hardware. It could be the HCAs, could be cables, could even the the USB-key holding the boot image. You can swap those and see if the problem follows the parts.
If no, then maybe this is a software limit or similar issue; how much memory does each node carry?
and, regardless of the answer to the question above, connectivity issues can also come from the fabric: (1) is there another vSMP instance booting on the same fabric? (2) is the fabric shared with some other cluster nodes (e.g. has another subnet manager running on it)?
BenziAugust 7, 2017 at 5:30 pm #6179
I haven’t tried rebooting yet. Responding to your other questions first:
Fabric is single subnet, FDR for the most part, ~500 nodes.
Primary is on one 36 port mellanox switch, other 7 nodes are all on
one leaf in a 18-leaf director switch. The ones without node numbers below.
node924-927 came up, 928,930,932 did not.
Switch 36 “S-e41d2d0300445400” # “MF0;mlnx20:SX6518/L17/U1” base port 0 lid 5 lmc 0
 “H-0cc47affff5f56e8″(cc47affff5f56e9) # “node917 HCA-1” lid 41 4xFDR
 “H-0cc47affff5f569c”(cc47affff5f569d) # “node918 HCA-1” lid 51 4xFDR
 “H-0cc47affff5f5304″(cc47affff5f5305) # “node919 HCA-1” lid 49 4xFDR
 “H-0cc47affff5f56ec”(cc47affff5f56ed) # “node920 HCA-1” lid 46 4xFDR
 “H-0cc47affff5f7bcc”(cc47affff5f7bcd) # “node921 HCA-1” lid 57 4xFDR
 “H-0cc47affff5f8018″(cc47affff5f8019) # “node922 HCA-1” lid 61 4xFDR
 “H-0cc47affff5f7c88″(cc47affff5f7c89) # “node923 HCA-1” lid 75 4xFDR
 “H-0cc47affff5f7ce4″(cc47affff5f7ce5) # “MT25408 ConnectX Mellanox Technologies” lid 73 4xFDR
 “H-0cc47affff5f800c”(cc47affff5f800d) # “MT25408 ConnectX Mellanox Technologies” lid 42 4xFDR
 “H-0cc47affff5f82c0″(cc47affff5f82c1) # “MT25408 ConnectX Mellanox Technologies” lid 59 4xFDR
 “H-0cc47affff5f82ac”(cc47affff5f82ad) # “MT25408 ConnectX Mellanox Technologies” lid 63 4xFDR
 “H-0cc47affff5f7a54″(cc47affff5f7a55) # “MT25408 ConnectX Mellanox Technologies” lid 79 4xFDR
 “H-0cc47affff5f7d98″(cc47affff5f7d99) # “node929 HCA-1” lid 56 4xFDR
 “H-0cc47affff5f7a5c”(cc47affff5f7a5d) # “MT25408 ConnectX Mellanox Technologies” lid 76 4xFDR
 “H-0cc47affff5f7dc0″(cc47affff5f7dc1) # “node931 HCA-1” lid 54 4xFDR
 “H-0cc47affff5f833c”(cc47affff5f833d) # “MT25408 ConnectX Mellanox Technologies” lid 52 4xFDR
 “H-0cc47affff5f7d90″(cc47affff5f7d91) # “node933 HCA-1” lid 44 4xFDR
 “H-0cc47affff5f7c40″(cc47affff5f7c41) # “node934 HCA-1” lid 74 4xFDR
All the nodes were running GPFS with VERBS RDMA over the same HCAs (integrated on the motherboards) before we pulled them from production to try ScaleMP.
Each node has 128 Gig of memory. No other vSMP instance here, though we first attempted to boot two other nodes (IBM 3755-M3 quad core with 512G memory) as secondaries, these crashed when booting off the USB sticks. I tried to open a ticket but was told mixing AMD and Intel was not supported. The crash still occurred when the primary was powered off, so I’m skeptical of this statement.
Perhaps there is something on the primary that remembers these nodes, that I need to clean up?August 7, 2017 at 6:07 pm #6180
Tried rebooting, and the same four nodes had the same behavior.
I got a message on the primary when I booted it that
vSMP ??? version mismatch
???? ???? 04:00:0#1 =>..:
The DARK RED type on top of the DARK BLUE background
is IMPOSSIBLE TO READ!!!!!!!!
The last .12.14.16 correspond to the missing nodes.
I made all the flash drives from one copy of the latest free download.
I rewrote the ones that had been used in the AMD SMP nodes.
Tried to hit F5 on the primary, but running remote VNC desktop (Fast-X)
and watching the boot using Java console emulation, the keystrokes were
ignored. The machine room is a mile away.August 7, 2017 at 7:10 pm #6181
Sorry about the color palette, I’ll ask fr that to be reproduced and if we see the same unreadable palette used, request a change to it as an enhancement request.
As for the problem at hand – the output you provided suggests that 5 of the nodes share the same IB switch while the other nodes are on a different IB switch (and those switches are interconnected)
1. can it be that those nodes are on a physically different switches? If yes, are those switches possibly running each a different subnet manager?)
2. are those nodes identical from hardware perspective? or are they in a different rack/switch as they are a different generation? if they differ, could that be the issue? if all are the same hardware, could it be that those nodes have different BIOS settings (e.g. some have CPU virtualizaiton support turned on, and some not)?
I would recommend the following:
1. power all nodes off
2. power on only the nodes that are ‘ok’ and connect to the primary. leave the others powered off.
3. when the nodes finish the aggregation, you should be prompted to “hit ESC to continue with X out of 8 nodes” (or, you could wait fr that to time out. the system should be able to come up. if it does, we know the SW can operate on that hardware model with your environment/settings – and should then focus on understanding what is different with the other nodes (the most probable suspect would be the different fabric)
BenziAugust 7, 2017 at 7:23 pm #6182
The 7 nodes are identical, there is only one IB fabric. One subnet manager.
IF you notice, there are LIDs assigned to all the ports on the shared leaf switch. The 3 that don’t come up are on the same leaf switch as the 4 that do come up.
The 8th node (primary) is newer, on a separate but connected switch, same fabric. Same HCA, same amount of memory, same motherboard, newer processors.
Broadwell vs Haswell.
I would consider moving the primary to another Haswell node, but that would probably entail reissuing the license. I would very much rather not do that.
I am considering rewriting the USB sticks of the three misbehaving nodes.
The Primary node is up and running right now with 5/8 nodes.
I still need to figure out how to get into F5 setup to tell the system
to expose the QDR card in the Primary node for OFED use, so I can mount GPFS.
Would much prefer a system which is serial-console friendly.
— ddjAugust 7, 2017 at 7:49 pm #6183
Does the HCA firmware version need to match exactly?
node1049: fw_ver: 2.35.5100
node924: fw_ver: 2.33.5100
The red lettering says
vSMP Foundation version mismatch
(various versions)August 7, 2017 at 7:58 pm #6184
On one hand, this would make the error message misleading.
On the other hand, clearly a difference between the nodes is causing an issue – so if possible I’d recommend upgrading the firmware to at least be the same on all nodes (and even better, if all nodes were upgraded to the latest firmware revision available from Mellanox for those HCAs)August 9, 2017 at 3:12 pm #6185
After a lot of head scratching and experimentation, it turns out four of the USB thumb drives we were using were found to be extremely poor quality, with no serial numbers, Vendor ID of ABCD and product ID of 1234. Substituting proper USB sticks made the problem go away.
Hope nobody else runs into this problem…. thanks for all the help over the last few days.
You must be logged in to reply to this topic.