Open-Source Learning

Posted on Tue 06 June 2017 in science

New researchers, graduate and undergraduate students, spend much of their first 6 months to first year in their research group learning the techniques the lab uses to do their science. This is espeically true in computational fields, where there are numerous softwares (open- and closed-source) using in specific domains. Challenges associated with joining computation-focused research groups are exacerbated by the fact that many undergraduate curricula in applied research fields tend to lag behind in the use of software tools. In my experience, relatively few undergraduates in engineering consider themselves to be proficent software developers (CS, CE, etc. aside). Research groups need better ways to teach new scientists software fluency.

Most research groups have some kind of onboarding procedure for new members, which often involves an insurmountable stack of theory-filled textbooks and literature that are both too broad and too narrow to solve the introductory problems a new student should be exposed to. Eventually, these piles of literature are eschewed in favor of brute forcing the initial research tasks with heavy involvement and guidance from senior group members. This approach is wasteful for the student who may miss some key theoretical background or dwell too much on theory and be less productive, ultimately delaying the time to first publication.

The more senior researcher(s) tasked with guiding the new student as they orient themselves towards their research goals also suffer from the inefficieny of this process. Heavily guided introductions by senior group members is a cornerstone of academic culture and every early, mid, and late career scientist has been a part of this process at one time or another. Passing of knowledge down academic trees is not going away anytime soon, but it can be made more efficient.

Instead of teaching new students skills entirely case by case and in-person, each research group can develop a manual that couples theory and practice. Of course, this has been done many times over the years and is not at all a new idea and several models have been used for maintaining such documentation with varying degrees of success. In the past several years, better software development practices have become more accessible and more adopted by scientists. Scientists have readily adopted resources such as GitHub for hosting their code, but entire academic journals exist on GitHub as well as the DFT book by John Kitchin. In particular, the DFT book and the Hacking Materials group handbook by Anubhav Jain have served as inspiration to develop open guides to starting at the ground level with a new research group.

Thus, I am proposing that other research groups join this effort for opening up the onboarding process of new researchers and maintain open documentation for how to get started as a researcher and a member of the group/institution/company. I have initiated such a project for the Phases Research Lab at Penn State where we have an onboarding guide freely available on GitHub. The goals of the repository, called prl-onboard, is to guide new researchers (and possibly new materials scientists) through learning Python in context of the tools we are using in our group: pycalphad for computational thermodynamics and atomate for DFT calculations. Primarily this is accomplished by having a complete 'course' that students can progress through and alternate reading relevant material and actually using what they are reading about. Other goals are to introduce people to foundational knowledge required for success in our field and to getting settled at Penn State. Eventually, we hope that the prl-onboard repo will be useful to people outside of the Phases Research Lab.

Besides just being open for the sake of being accessible to people outside of our research group, the increased visibilty, ease of navigation and rich formatting offered by markup languages on services like GitHub can be coupled with a tight feedback loop to ensure that the project doesn't die after the person who initated the effort leaves or becomes to busy to maintain the resource.

This effort is still in its initial stages at the time of writing but I hope to address several long term goals and challenges in future posts. Some of these challenges are using closed-source software (we depend on VASP and Thermo-Calc) and textbooks for the reading material. We have access to these resources at the group scale, but scaling this beyond our group would make the time it takes to develop this material more worthwhile is currently a major challenge.