Discussion Closed This discussion was created more than 6 months ago and has been closed. To start a new discussion with a link back to this one, click here.

COMSOL on cluster is limited to 2 nodes

Please login with a confirmed email address before reporting spam

Hi,

I've been trying to set up a small cluster to run parametric sweeps. Using 4 computers, all running Ubuntu (all 11.10, kernel 3.0.0-16-generic, 3 nodes are constructed with modified xubuntu live-cds where the "main" node is running kubuntu).

I've set-up the ssh/mpd communications and such, and have verified that the mpd ring is operating correctly, or at least I think it is. Running "comsol mpd ringtest" gives a response in the milliseconds range. I've also checked manually that each node is able to ssh into every other node without the need for authentication. The COMSOL install directory is NFS mounted to each node.

If I run to model on only two nodes:

comsol -nn 2 batch -inputfile <file_to_run>

The model runs as expected; two nodes are initialized and the parameters are segregated to each node and evaluated. If I try to run -nn >2, the cluster will initialize, but I do not get any log output and it does not look like anything is being solved. What happens is that when I run the command, I get no output but checking each of the nodes that are active, I have a "comsollauncher" process spawned, and they operate at 100% CPU indefinitely (I allowed it to run for 30 minutes on 4 hardware nodes before quitting), or spawn and use no CPU time.

Further, if I am only using two hardware nodes, and I try -nn 4 (i.e., two software nodes on each hardware node), the same problem exists, whereas running -nn 2 works correctly. This does not matter what hardware nodes I am using, as I have checked each combination.

Instead, if I use just one computer and run with -nn 4, the model runs correctly.

Has anyone encountered a problem like this? Any help would be appreciated.

Other information:

MPD is launched with:
comsol -nn 4 mpd boot -v -r ssh -f clusternodes
where "clusternodes" is the file containing the hostname to each node

Running:
comsol -nn 4 server
reproduces the same problem as described above

Output of "comsol mpd trace":

$ comsol mpd trace
Precision390-Ubuntu
livecluster01
livecluster03
livecluster02

Output of "comsol mpd ringtest":

$ comsol mpd ringtest
time for 1 loops = 0.00281715393066 seconds

COMSOL version is latest: 4.2.1.166

Each node is connect via a 100/1000 ethernet switch, which allows connection to the license manager.



6 Replies Last Post Apr 6, 2012, 8:14 a.m. EDT
Jim Freels mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago Feb 24, 2012, 11:09 p.m. EST
Do you have the cluster module installed along with the rest of the COMSOL installation ? Do you have the floating network license ?

We have only installed on a RHEL cluster here. However, we have wondered about setting a pseudo-cluster with available computers in the domain, similar to what you are doing. Perhaps there is something special about the rhel cluster that is missing ?
Do you have the cluster module installed along with the rest of the COMSOL installation ? Do you have the floating network license ? We have only installed on a RHEL cluster here. However, we have wondered about setting a pseudo-cluster with available computers in the domain, similar to what you are doing. Perhaps there is something special about the rhel cluster that is missing ?

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago Feb 25, 2012, 9:37 a.m. EST

Do you have the cluster module installed along with the rest of the COMSOL installation ? Do you have the floating network license ?

We have only installed on a RHEL cluster here. However, we have wondered about setting a pseudo-cluster with available computers in the domain, similar to what you are doing. Perhaps there is something special about the rhel cluster that is missing ?


Yes, we have a floating network license. I should have mentioned before that I have used a RHEL cluster before with much success, and actually, I have the same model I've been trying to run locally submitted to the job queue at that cluster. The job queue for the RHEL cluster is ridiculous right now, hence me trying to construct my own.

I'm quite sure that cluster support is installed. Is that even a module? I thought that cluster support was included by default now. Either way, the install works on the RHEL cluster so I don't believe this to be the problem, though I will check.

Anyway, I figure that I am missing something that the RHEL cluster has which I haven't setup correctly or the like, though for the life of me, I can't figure out what that is. Like I stated before, my local cluster works when only using two hardware nodes, regardless of which ones, and when trying to use more than two nodes, it doesn't. This leads me to believe that the license and install are correct. I'll investigate more on Moday, focusing on the NFS mounts.

Thanks for the reply.
[QUOTE] Do you have the cluster module installed along with the rest of the COMSOL installation ? Do you have the floating network license ? We have only installed on a RHEL cluster here. However, we have wondered about setting a pseudo-cluster with available computers in the domain, similar to what you are doing. Perhaps there is something special about the rhel cluster that is missing ? [/QUOTE] Yes, we have a floating network license. I should have mentioned before that I have used a RHEL cluster before with much success, and actually, I have the same model I've been trying to run locally submitted to the job queue at that cluster. The job queue for the RHEL cluster is ridiculous right now, hence me trying to construct my own. I'm quite sure that cluster support is installed. Is that even a module? I thought that cluster support was included by default now. Either way, the install works on the RHEL cluster so I don't believe this to be the problem, though I will check. Anyway, I figure that I am missing something that the RHEL cluster has which I haven't setup correctly or the like, though for the life of me, I can't figure out what that is. Like I stated before, my local cluster works when only using two hardware nodes, regardless of which ones, and when trying to use more than two nodes, it doesn't. This leads me to believe that the license and install are correct. I'll investigate more on Moday, focusing on the NFS mounts. Thanks for the reply.

Jim Freels mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago Feb 25, 2012, 10:37 a.m. EST
Kyle, I don't remember the cluster mode being installed by default. It only comes with the floating network license. If you look at the list of modules being installed from the dvd right before the actual install step (it has read your license, and interprets that to pick what gets installed), you will see the cluster mode actually in the list to be installed. Be sure that is there, or cluster mode will not work. You can also see (although difficult to interpret by humans) cluster mode listed in the actual license file itself.

Also, there is an open-source equivalent of RHEL called CENTOS that also includes the cluster setup. I have a colleague in academia who uses this with COMSOL in parallel and it does work fine. Because it is open-source and nothing proprietary, perhaps there may be some information there on what it takes to come up with the equivalent setup as RHEL. If you figure it out, please document what you do. I think it would be valuable to lots of folks who cannot spend the money or training on RHEL.
Kyle, I don't remember the cluster mode being installed by default. It only comes with the floating network license. If you look at the list of modules being installed from the dvd right before the actual install step (it has read your license, and interprets that to pick what gets installed), you will see the cluster mode actually in the list to be installed. Be sure that is there, or cluster mode will not work. You can also see (although difficult to interpret by humans) cluster mode listed in the actual license file itself. Also, there is an open-source equivalent of RHEL called CENTOS that also includes the cluster setup. I have a colleague in academia who uses this with COMSOL in parallel and it does work fine. Because it is open-source and nothing proprietary, perhaps there may be some information there on what it takes to come up with the equivalent setup as RHEL. If you figure it out, please document what you do. I think it would be valuable to lots of folks who cannot spend the money or training on RHEL.

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago Apr 5, 2012, 5:55 p.m. EDT
Hi Kyle,

Sorry I don't have an answer to your question. Quite the opposite, I have a new question. I'm planning to build a similar thing in my lab with 4 ubuntu systems. Do you have any resource/tutorial I could use to guide me through the set up process? it would be of great help. Thank you in advance and good luck
Hi Kyle, Sorry I don't have an answer to your question. Quite the opposite, I have a new question. I'm planning to build a similar thing in my lab with 4 ubuntu systems. Do you have any resource/tutorial I could use to guide me through the set up process? it would be of great help. Thank you in advance and good luck

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago Apr 6, 2012, 8:00 a.m. EDT

Hi Kyle,

Sorry I don't have an answer to your question. Quite the opposite, I have a new question. I'm planning to build a similar thing in my lab with 4 ubuntu systems. Do you have any resource/tutorial I could use to guide me through the set up process? it would be of great help. Thank you in advance and good luck


Hi Roger,

Unfortunately, I never found a good tutorial on what needed to be done, so I ended up scratching mostly everything together myself, and I can barely understand my notes file. I'd be happy to help; however, I'm quite busy now (finishing up doctorate) and will be for two more weeks.

I have this thread tagged so hopefully I'll remember about you when my load is lightened up. I hope you can wait that long.
[QUOTE] Hi Kyle, Sorry I don't have an answer to your question. Quite the opposite, I have a new question. I'm planning to build a similar thing in my lab with 4 ubuntu systems. Do you have any resource/tutorial I could use to guide me through the set up process? it would be of great help. Thank you in advance and good luck [/QUOTE] Hi Roger, Unfortunately, I never found a good tutorial on what needed to be done, so I ended up scratching mostly everything together myself, and I can barely understand my notes file. I'd be happy to help; however, I'm quite busy now (finishing up doctorate) and will be for two more weeks. I have this thread tagged so hopefully I'll remember about you when my load is lightened up. I hope you can wait that long.

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago Apr 6, 2012, 8:14 a.m. EDT
Thanks Kyle. I am just on my second year of phd, and still got 2 more ahed, so I can still waste time. I started tinkering with the cluster configuration. I might end up writing a tutorial if everything goes as expected. Thanks again and good luck with your thesis
Thanks Kyle. I am just on my second year of phd, and still got 2 more ahed, so I can still waste time. I started tinkering with the cluster configuration. I might end up writing a tutorial if everything goes as expected. Thanks again and good luck with your thesis

Note that while COMSOL employees may participate in the discussion forum, COMSOL® software users who are on-subscription should submit their questions via the Support Center for a more comprehensive response from the Technical Support team.