On modern processors, the cost of a cache miss is significant, so if the language and runtime can help your program achieve better data locality, performance will be improved.
We parallelized LM_NS3D, a program for three dimensional low mach flow simulation, with PVM as the message passing environment. Communication is carefully optimized. Data locality is achieved.