Your first paragraph is generally correct (at a simplified, 30,000 ft level). vSMP Foundation aggregated several x86 nodes into a single virtual system, running a single Linux OS, which sees the underlying virtual machine as a NUMA shared-memory system (I say this is simplified because we actually do much more than mere ccNUMA, but that’s a discussion fit more for a whiteboard)
We try to avoid the term CPU as it may confuse the reader. The terminology we prefer to use is: processor == socket == physical_silicon_package. If we do use CPU then we mean it in the sense of processing logical unit, e.g. core (or SMT, where relevant). You have 8 dual-socket nodes, each socket carrying 4 cores. You did not write how much RAM you have on each node, so for the sake of example I would assume 96 GB RAM per node. If you apply vSMP Foundation Free to this 8-node cluster, the result would be a system with 8 cores and more than 700 GB RAM.
The available cores cannot be cores from across all nodes, as the vSMP Foundation Free license model allows for use of computing and IO resources only from one node, while RAM is aggregated with other (up to 8) nodes. In other words, all but one node are “memory-only” nodes (you will designate which of the 8 nodes to use the compute and IO out of the cluster during installation)
In terms of performance, you should actually expect better performance compared with a hard-wired SMP system that is similarly configured. For example, if you run a workload requiring 300GB RAM, it is expected to run at similar or better speed than if you ran the same workload on a 4-socket system with same-generation core architecture and 512GB physically installed and when using same number of cores. We have a track record of outperforming HW-only SMPs when it comes to large-memory workloads, and you can see many examples on http://scalemp.mywebdev.a2hosted.com/performance.
Of course, for scalability on using compute and RAM on multiple nodes (System Expansion, as opposed to Memory Expansion), actual performance would vary, mostly based on the implementation of the application. You can see some nice multi-threaded examples in the same URL referred to above.
Hope this helps.