Only 4/7 memory expansion nodes join primary

Home Forums Using vSMP Foundation Only 4/7 memory expansion nodes join primary

This topic contains 8 replies, has 2 voices, and was last updated by  David 11 months, 2 weeks ago.

Viewing 9 posts - 1 through 9 (of 9 total)
  • Author
    Posts
  • #6177

    David
    Participant

    Trying to bring up 8 nodes using free edition. The primary came up fine, and the first four secondary nodes that I booted joined the backplane. However the last three are stuck between “Configuring Backplane OK (ID: 14)” and the
    next line which should be “Configuring System”.

    Why would these nodes fail to join? I started all seven and the first four progressed past this point before I brought up the primary.

    Is there a way to look at the list of secondary boards, or add another secondary board once the primary decides it has its fill?

    Thanks,
    — ddj
    Dave Johnson
    Brown University CCV

    #6178

    Benzi
    Keymaster

    Hi Dave,

    those 3/7 secondaries that fail to connect – are those always the same 3 (i.e. if you power-cycle all 8 nodes, would the same 3 fail to connect?)
    If yes – then I’d try to trace the issue to the hardware. It could be the HCAs, could be cables, could even the the USB-key holding the boot image. You can swap those and see if the problem follows the parts.
    If no, then maybe this is a software limit or similar issue; how much memory does each node carry?

    and, regardless of the answer to the question above, connectivity issues can also come from the fabric: (1) is there another vSMP instance booting on the same fabric? (2) is the fabric shared with some other cluster nodes (e.g. has another subnet manager running on it)?

    Thanks,
    Benzi

    #6179

    David
    Participant

    I haven’t tried rebooting yet. Responding to your other questions first:
    Fabric is single subnet, FDR for the most part, ~500 nodes.
    Primary is on one 36 port mellanox switch, other 7 nodes are all on
    one leaf in a 18-leaf director switch. The ones without node numbers below.
    node924-927 came up, 928,930,932 did not.
    Switch 36 “S-e41d2d0300445400” # “MF0;mlnx20:SX6518/L17/U1” base port 0 lid 5 lmc 0
    [1] “H-0cc47affff5f56e8″[1](cc47affff5f56e9) # “node917 HCA-1” lid 41 4xFDR
    [2] “H-0cc47affff5f569c”[1](cc47affff5f569d) # “node918 HCA-1” lid 51 4xFDR
    [3] “H-0cc47affff5f5304″[1](cc47affff5f5305) # “node919 HCA-1” lid 49 4xFDR
    [4] “H-0cc47affff5f56ec”[1](cc47affff5f56ed) # “node920 HCA-1” lid 46 4xFDR
    [5] “H-0cc47affff5f7bcc”[1](cc47affff5f7bcd) # “node921 HCA-1” lid 57 4xFDR
    [6] “H-0cc47affff5f8018″[1](cc47affff5f8019) # “node922 HCA-1” lid 61 4xFDR
    [7] “H-0cc47affff5f7c88″[1](cc47affff5f7c89) # “node923 HCA-1” lid 75 4xFDR
    [8] “H-0cc47affff5f7ce4″[1](cc47affff5f7ce5) # “MT25408 ConnectX Mellanox Technologies” lid 73 4xFDR
    [9] “H-0cc47affff5f800c”[1](cc47affff5f800d) # “MT25408 ConnectX Mellanox Technologies” lid 42 4xFDR
    [10] “H-0cc47affff5f82c0″[1](cc47affff5f82c1) # “MT25408 ConnectX Mellanox Technologies” lid 59 4xFDR
    [11] “H-0cc47affff5f82ac”[1](cc47affff5f82ad) # “MT25408 ConnectX Mellanox Technologies” lid 63 4xFDR
    [12] “H-0cc47affff5f7a54″[1](cc47affff5f7a55) # “MT25408 ConnectX Mellanox Technologies” lid 79 4xFDR
    [13] “H-0cc47affff5f7d98″[1](cc47affff5f7d99) # “node929 HCA-1” lid 56 4xFDR
    [14] “H-0cc47affff5f7a5c”[1](cc47affff5f7a5d) # “MT25408 ConnectX Mellanox Technologies” lid 76 4xFDR
    [15] “H-0cc47affff5f7dc0″[1](cc47affff5f7dc1) # “node931 HCA-1” lid 54 4xFDR
    [16] “H-0cc47affff5f833c”[1](cc47affff5f833d) # “MT25408 ConnectX Mellanox Technologies” lid 52 4xFDR
    [17] “H-0cc47affff5f7d90″[1](cc47affff5f7d91) # “node933 HCA-1” lid 44 4xFDR
    [18] “H-0cc47affff5f7c40″[1](cc47affff5f7c41) # “node934 HCA-1” lid 74 4xFDR

    All the nodes were running GPFS with VERBS RDMA over the same HCAs (integrated on the motherboards) before we pulled them from production to try ScaleMP.

    Each node has 128 Gig of memory. No other vSMP instance here, though we first attempted to boot two other nodes (IBM 3755-M3 quad core with 512G memory) as secondaries, these crashed when booting off the USB sticks. I tried to open a ticket but was told mixing AMD and Intel was not supported. The crash still occurred when the primary was powered off, so I’m skeptical of this statement.
    Perhaps there is something on the primary that remembers these nodes, that I need to clean up?

    #6180

    David
    Participant

    Tried rebooting, and the same four nodes had the same behavior.
    I got a message on the primary when I booted it that
    vSMP ??? version mismatch
    ???? ???? 04:00:0#1 =>..:
    17-33-16-19-33:10.12.14.16

    The DARK RED type on top of the DARK BLUE background
    is IMPOSSIBLE TO READ!!!!!!!!

    The last .12.14.16 correspond to the missing nodes.

    I made all the flash drives from one copy of the latest free download.
    I rewrote the ones that had been used in the AMD SMP nodes.

    Tried to hit F5 on the primary, but running remote VNC desktop (Fast-X)
    and watching the boot using Java console emulation, the keystrokes were
    ignored. The machine room is a mile away.

    #6181

    Benzi
    Keymaster

    Hi Dave,

    Sorry about the color palette, I’ll ask fr that to be reproduced and if we see the same unreadable palette used, request a change to it as an enhancement request.

    As for the problem at hand – the output you provided suggests that 5 of the nodes share the same IB switch while the other nodes are on a different IB switch (and those switches are interconnected)
    1. can it be that those nodes are on a physically different switches? If yes, are those switches possibly running each a different subnet manager?)
    2. are those nodes identical from hardware perspective? or are they in a different rack/switch as they are a different generation? if they differ, could that be the issue? if all are the same hardware, could it be that those nodes have different BIOS settings (e.g. some have CPU virtualizaiton support turned on, and some not)?

    I would recommend the following:
    1. power all nodes off
    2. power on only the nodes that are ‘ok’ and connect to the primary. leave the others powered off.
    3. when the nodes finish the aggregation, you should be prompted to “hit ESC to continue with X out of 8 nodes” (or, you could wait fr that to time out. the system should be able to come up. if it does, we know the SW can operate on that hardware model with your environment/settings – and should then focus on understanding what is different with the other nodes (the most probable suspect would be the different fabric)

    Regards,
    Benzi

    #6182

    David
    Participant

    The 7 nodes are identical, there is only one IB fabric. One subnet manager.
    IF you notice, there are LIDs assigned to all the ports on the shared leaf switch. The 3 that don’t come up are on the same leaf switch as the 4 that do come up.
    The 8th node (primary) is newer, on a separate but connected switch, same fabric. Same HCA, same amount of memory, same motherboard, newer processors.
    Broadwell vs Haswell.

    I would consider moving the primary to another Haswell node, but that would probably entail reissuing the license. I would very much rather not do that.

    I am considering rewriting the USB sticks of the three misbehaving nodes.

    The Primary node is up and running right now with 5/8 nodes.
    I still need to figure out how to get into F5 setup to tell the system
    to expose the QDR card in the Primary node for OFED use, so I can mount GPFS.

    Would much prefer a system which is serial-console friendly.

    — ddj

    #6183

    David
    Participant

    Does the HCA firmware version need to match exactly?

    node1049: fw_ver: 2.35.5100
    node924: fw_ver: 2.33.5100

    The red lettering says
    vSMP Foundation version mismatch
    (various versions)

    #6184

    Benzi
    Keymaster

    On one hand, this would make the error message misleading.
    On the other hand, clearly a difference between the nodes is causing an issue – so if possible I’d recommend upgrading the firmware to at least be the same on all nodes (and even better, if all nodes were upgraded to the latest firmware revision available from Mellanox for those HCAs)

    #6185

    David
    Participant

    After a lot of head scratching and experimentation, it turns out four of the USB thumb drives we were using were found to be extremely poor quality, with no serial numbers, Vendor ID of ABCD and product ID of 1234. Substituting proper USB sticks made the problem go away.

    Hope nobody else runs into this problem…. thanks for all the help over the last few days.

    — ddj

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic.