Open Source

Open Source and Research Collaboration

Hussien Ballouk
12/20/2023
5 min read

The most impactful code I've ever written has zero stars on GitHub. It's a 200-line Python script that helps linguistics researchers analyze language patterns. I built it for a professor at UBC who was manually categorizing thousands of text samples.

The script saved her months of work. She shared it with colleagues. They improved it. Now it's being used by research groups in 12 countries, has prevented countless hours of manual labor, and has contributed to dozens of published papers.

None of this would have happened if the code wasn't open source.

The Hidden Infrastructure of Science

Most people think scientific breakthroughs come from brilliant individuals having eureka moments in labs. The reality is more mundane: science runs on software.

Data analysis software. Simulation software. Statistical packages. Visualization tools. Database systems. Web scrapers. The boring, unglamorous code that turns raw data into insights.

Almost all of this software is open source. Not because researchers are ideologically committed to free software (though some are), but because open source is the only thing that works at the scale and pace of modern research.

When a researcher in Tokyo discovers a new statistical method, researchers in Toronto need to be able to use it next week, not next year. When someone finds a bug in a widely-used analysis package, it needs to be fixed immediately across thousands of research projects.

Proprietary software can't move that fast.

Why Research Software Is Different

Building software for researchers is weird. The requirements are constantly changing because researchers are literally trying to do things that have never been done before. The user base is highly technical but also highly impatient. And everyone has opinions about how things should work.

I learned this the hard way when we open-sourced our research data processing pipeline. Within a week, we had 47 GitHub issues, 12 pull requests, and one very angry email from a professor who insisted our algorithm was "statistically nonsensical" (it wasn't, but his use case was edge case we hadn't considered).

This level of engagement would be overwhelming for a normal software project. For research software, it's exactly what you want. Every issue is a real researcher with a real problem who's willing to help you solve it.

The Collaboration Multiplier Effect

The best research software projects don't just solve one problem – they create platforms for solving entire classes of problems.

Take scikit-learn, the machine learning library. It started as a simple collection of algorithms for one research group. Now it's used by millions of researchers worldwide and has enabled thousands of research projects that couldn't have existed otherwise.

Or Jupyter notebooks. Originally built for interactive Python computing in scientific contexts. Now it's the standard way researchers share and reproduce computational work across dozens of fields.

These tools succeeded because they were built in the open, with input from diverse research communities, solving real problems that lots of people shared.

The Real Challenges

Open source research software faces some unique challenges that commercial software doesn't:

The Sustainability Problem

Grad students and postdocs build amazing research tools, then graduate and move on to industry jobs. The software becomes orphaned. Nobody has time to maintain it. Users find bugs that never get fixed.

I've seen brilliant research tools disappear because the original author got a job at Google and didn't have time to review pull requests anymore.

The research community is slowly figuring out sustainable funding models for critical open source infrastructure, but it's still a major problem.

The Documentation Problem

Researchers are great at writing papers. They're terrible at writing documentation. I can't count how many research tools I've found that solve exactly the problem I have, but with README files that assume I already know what the tool does and how to use it.

Good documentation takes time and effort that doesn't count toward academic promotions. The incentives are all wrong.

The "Not Invented Here" Problem

Every research group wants to build their own version of everything. Sometimes this is necessary – you're doing something genuinely novel that requires custom tools. But often it's just because learning someone else's tool is harder than building your own.

This leads to a fragmented ecosystem where there are dozens of tools that solve the same problem slightly differently, instead of one tool that solves it really well.

What Actually Works

After three years of building and maintaining open source research tools, here's what I've learned:

Start with a real problem you actually have. Don't build tools for hypothetical users. Build tools that solve a problem you're personally experiencing.

Make the first version ridiculously simple. Researchers will find ways to use your tool that you never imagined. Start with the simplest possible version that works, then let users tell you what's missing.

Document everything like you're explaining it to yourself six months from now. Because you probably will be.

Respond to issues quickly, even if you can't fix them quickly. A "thanks for reporting this, I'll look into it" goes a long way.

Find collaborators who care about sustainability. Individual genius is overrated. Long-term maintenance is underrated.

The Bigger Picture

Open source research software isn't just about making research more efficient (though it does that). It's about making research more democratic.

A graduate student in Nigeria can use the same computational tools as a professor at Harvard. A research group with no budget can access the same algorithms as a well-funded lab. Good ideas can come from anywhere and spread everywhere.

This levels the playing field in ways that are fundamentally important for how science works.

What We're Building

At Trixode Studios, every research tool we build is open source by default. Not because we're trying to save the world, but because open source makes our tools better.

Our users find bugs we miss. They suggest features we wouldn't think of. They adapt our tools for use cases we never considered. They make our software more robust, more useful, and more impactful than we could make it alone.

If you're a researcher reading this, consider open sourcing your next analysis script or data processing tool. Even if it's messy, even if it's specific to your project, even if you think nobody else will use it.

You might be wrong. And even if you're right, the act of preparing code for public release will make it better for your own use.

The future of research is collaborative. Code should be too.

RELATED POSTS