First, please note that the diff for the optimized MPICH2 source code is available from ScaleMP, so you could actually review the code with all changes.
The optimizations ScaleMP made to MPICH2 are mostly around the following components:
• Improve the shared-memory CPU affinity control for MPI ranks
• Improve the way shared-memory communication is implemented, especially with alignment to larger cacheline size on vSMP Foundation, and taking advantage of Large-Block Copy capabilities.
• Take advantage of the available virtual hardware assists provided for asynchronous shared-memory communication – specifically for small messages/indicators