The Perils and Pitfalls of Scientific Programming

As I sit here writing this, I’m waiting on some custom software to finish running on my other computer. That custom software is still in a testing phase and the current run will tell me if I have finally figured out the source of a persistent positive bias that I am getting from the output of the code. While waiting, I began to think about adding a graphic user interface to handle the input of various parameters in the code instead of having a parameter file. This then got me thinking about how the current state of scientific software is likely a barrier to many who may otherwise enter a specific field of research.

Let me start by talking briefly about a colleague from graduate school. While she was immensely intelligent and was able to figure out how to use software that she needed to for her research, it often stalled her progress and caused a lot of undue frustration. Even I, someone who is fairly computer literate, often have a difficult time getting someone else’s software to run properly. Since I know how to program (at least to some extent), it’s often easier to write my own code instead.

Now, there are arguments to be made about why writing your own code is actually a good thing. First in foremost, it means that you actually understand what the code does, and you can trust (assuming you don’t have any bugs in the code) the results. Second, it means you actually have a strong working knowledge of the science behind the code. These are positive things, and very important. However, there is a lot of open source science software out there (especially in the physics world), so as long as you can read the code you can figure out what it’s doing, and you can learn the theories behind the code without having to write the code.

So, why isn’t it easier to use some standardized scientific software? Let’s take a look at code that has been used by a large number of people in the field of astronomy/astrophysics, GADGET2. This is software that is capable of running simulations of large numbers of particles under the influence of gravity and hydrodynamics on even modest hardware. I personally used it for one of my papers, running it on an old quad-core AMD cpu (2.4 GHz) with 8 GB of RAM, so not exactly a supercomputer. However, it took me quite a while to figure out how to get it to install/run properly, having to manually setup a configuration file pointing to the locations of many different libraries on my computer. Once it was setup, I had to setup parameter files for different runs of the software. These are barriers to entry.

Another problem is the lack of standardized software for some problems, and too much variety in software for other problems. Consider measuring the galaxy power spectrum from data or mock galaxy catalogs. This is something that is done by a lot of different researchers (sorry for the glut of examples associated with cosmology/astrophysics, but since I work in that field they are the ones I know). Even though it is done by so many different researchers, to my knowledge there is no standard software that does this. This means that every paper that reports results of the galaxy power spectrum uses software written by the authors. However, if you need a halo finder to locate objects from some N-body simulation, you’ll have no problem finding software.

When reading most papers these days, I inevitably come across some part of the methods where the authors must have written software, or used something like Mathematica to do some numerical analysis, but they make no mention the software specifically. I’ve written papers that used custom software to get the results, and only mention it briefly, or say that it was done numerically.

Clearly, there are problems with scientific software which create barriers to entry for those new to the field as well as creating a lot duplication of effort for those in the field. So, how do we fix it?

The first thing that comes to mind, in order to reduce duplication of effort is for authors of papers to publish their custom source code making it available to anyone. There are sites like github that allow you to create a free public repository. Then, when writing your paper, you can still be fairly vague in your methods section (which is a problem for discussion on another day), but when you say you did something numerically, or are discussing how you got a particular result, simply add a footnote saying the software can be found at the given URL. Even if other’s have trouble compiling your code, they can see exactly what the source code did, and much more quickly recreate  the program. This would be a good first step, and is something I think we should all start doing as soon as possible. Personally, I have setup a github repository and will be posting my code for future papers.

However, simply publishing our code doesn’t solve the large problem of being a barrier to entry. In order for the published code to be more broadly useful, people need to be able to install and run the software easily. This is where more effort will be required on the part of scientists to become much better programmers. I don’t think it’s a secret that scientists write ugly code that I’m sure would make most professional programmers cringe. Add to that the fact that writing easy to install, easy to use software is quite difficult, and it will be hard to get people to change. One way forward that I see is for us to reach out to the computer science departments at our universities. Find students who are likely much better programmers than we are to work with us to improve our software, perhaps designing a GUI program that can be easily installed on many different computing platforms (Windows, Mac OSx and at least some popular Linux distributions). This would give the students something to put on the CVs, and perhaps even being listed as an author on a publication, and in the process, scientist may start learning to be better programmers.

Of course, we could also take the time to learn to be better programmers, which I hope all of us make an effort to do that as we continue in our careers. However, I realize that we all tend to be quite busy, and finding the time to learn something that is secondary to our research can be difficult. If we work with people who already know more than we do, it should accelerate the progress towards removing the barrier of entry that is scientific software.

What has your experience with scientific software been? Is there something in your field that frustrates you? Is the software in your field actually easy to use? Have an idea about how to improve things? Please share in the comments!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s