Bioinformatics Base Basics (Part 3)
Basics of Coding
Introduction: Babbling for Bioinformatics
In the world of bioinformatics, coding is as essential as understanding the biological concepts themselves. If you're new to the field, you might feel overwhelmed by the idea of coding — after all, it can seem like a foreign language even if English is your native language. But think of it this way: just as you learn a new spoken language to communicate with others, learning to code is about communicating with computers. Let's explore this analogy further and demystify the coding process for those interested in bioinformatics.
Basics of Language
Every human language has its own set of rules — grammar, vocabulary, and syntax. Similarly, programming languages have their own syntax and structure. Just like you wouldn't write a letter in a language without following its grammatical rules, coding requires you to adhere to specific syntax to ensure that your commands are understood by the computer. When you write a program, you are essentially giving the computer instructions. These instructions can also be thought of as sentences in a language, but each command must be clear and precise, with no mistakes. Unlike human communication where misuse of words or poor pronunciation may still be understood given the other person can guess what you meant, computer code computer code does exactly what you tell it to. If it is written incorrectly, this will lead to errors -- resulting in the computer refusing to do the task or the computer doing the task wrongly but unbeknownst to you. Therefore, to analyse data with a using a programming language, you need to specify exactly how you want to analyse it, using the correct syntax, step by step. In this way it’s similar to cooking, where you need to follow a set of guides step by step to get the end product. In fact there are lots of coding ‘cookbooks’ for coding languages which give you recipes for achieving certain goals such as making graphs that look visually appealing and professional.
A question you may have at this point is ‘why do we need multiple languages?’. To help understand this, we can use human languages as an analogy. Consider the differences between Braille, Sign language, English and even Mathematics. Each has their own strengths and weakness which results in specific use cases. For example, although Mathematics can write an elegant equation, you still need another language such as English, to explain what the variables are and what it means. On the other hand, Braille allows written communication for blind people and sign language allows visual communication for deaf people. It is also the same for programming languages, where some are more useful for doing statistical analysis whereas others are better for moving and manipulating lots of files around a computer. It is important to note that programming languages often have additional “libraries” that provide more functions than the base package — similar to how downloadable context (DLC) adds extra content to video games.
Code for Humans vs. Code for Computers
One of the critical aspects of coding in bioinformatics is to write code that is readable for humans, not just for computers. If you are just starting out with no coding experience then you may have no real concept of what this means in practice. Put simply, it means that files and commands are written in clearly and concisely in a way that is easy to understand to the person reading the code. Going back to the analogy of the English language, you would write shorter sentences with less punctuation, commonly used words with unambiguous wording. Consider the terms ‘bi-weekly’ and ‘bi-monthly’ for example, does these phrases mean twice within the time-frame or once according to that time frame? In reality these terms are used in both ways which can be very confusing. Therefore, when writing code make sure that things are explicit and clear without needing users to guess what is meant. Keep in mind when learning a new spoken language, the clearer and more straightforward your sentences are, the easier it is for others (and yourself) to comprehend. Similarly, when writing code, make sure to keep it simple, modular, and reusable. This involves breaking your code into smaller, manageable components, much like using paragraphs to organise ideas in an essay.
On top of this, you should write comments in your code that explain what each part does and what the expected behaviour is. By doing this, you will increase the chance that someone (including yourself later on) will be able to more easily understand the code and detect whether it is working properly. If you are new at programming you may not be aware of the fact that computers are actually fairly stupid, they don’t always know when the program is not behaving the way you expected it to. Its true that programmes can give you error messages when things don’t work but a lot of times you will encounter ’silent errors’ which, as the name implies, are errors that the computer doesn’t ‘notice’. Because of this, it is vital that you properly document and test code properly.
In bioinformatics, you will encounter various programming languages, each tailored for different tasks. For beginners, it’s good to start with either R or Python — the choice often depends on the type of data that you are analysing. In my personal experience, it has been more common to see R being taught and used in biology classes because they involve a lot of statistics and there are a lot of established packages and pipelines that are regularly used in research. These packages and pipelines are more often than not published to Bioconductor an open source project and repository for R packages related to the analysis of biological data. The project “aims to develop and share open source software for precise and repeatable analysis of biological data”. Not only are the best bioinformatics R packages maintained through this site, it also has great learning resources such as workflows, training, and textbooks so it’s important to be familiar with it. On the other hand, you could also begin with learning Python. Python is often recommended due to its simplicity and readability. Think of Python as a widely spoken language with many speakers; it's straightforward and has a wealth of resources available for learners. Although Python was historically used less often in biology compared to R, it is becoming increasingly popular - likely due to high quality packages becoming available for next-generation sequencing data analysis and the increasing prevalence of machine-learning based projects. In addition to R and Python, understanding the command line and using a command programming language (such as Bash) is another essential skill in bioinformatics. Normally, people interact with their computers through a graphical user interfaces (GUIs) and click buttons in order to make the computer to perform a task. However, some software doesn’t have a GUI and must be accessed through the command line instead — where you type commands instead of pressing buttons. While various applications do exist, many bioinformatics tools rely on the command line for efficiency and flexibility. Although it might sound strange at first that typing commands is more convenient than clicking, the command line is convenient for performing repetitive large-scale tasks such as renaming all files in a folder (called a directory) and moving or copying them elsewhere — something which would be tedious and time consuming if you were to do it by hand. It might seem a bit intimidating at first but is incredibly powerful once you get the hang of it. Another thing to consider is that, because of the large file sizes often involved in bioinformatics, its common to use a High performance computer (HPC) which is also accessed via the command line (because it is a very powerful computer located somewhere else on campus with no monitor for you to go look at physically). The default software you use to access the command line depends on the operating system of your computer e.g. Terminal on MacOS and Command Prompt or Powershell on Windows. It’s important to know that there are multiple command line programming languages so you need to know which one your computer uses. Even if you're primarily using a GUI, familiarising yourself with the command line will provide a deeper understanding of how software operates under the hood, likely increasing your overall coding ability.
In summary, just as you might learn multiple spoken languages to communicate with different people, knowing more than one programming language can be beneficial in bioinformatics. Each language can help you tackle specific challenges, whether it's data analysis, visualisation, or software development. As you embark on your journey into bioinformatics, remember that learning to code is like learning a new language. Embrace the process, practice regularly, and don't hesitate to ask for help along the way. With practice and patience, you will become fluent in the coding languages that power bioinformatics, enabling you to contribute meaningfully to this exciting field. The skills you acquire will not only enhance your ability to analyse biological data but will also open up new avenues for research and discovery.
Getting Started & Recommended Resources
There are several great resources for getting started with coding for bioinformatics. However, before getting started with anything bioinformatics, we would strongly recommend learning more about computing and coding first. If you wish to watch a class about it, Harvard University have made their introduction to computer science course available for free if you audit the class on Edx. In addition, Harvard have several courses related to Data Science and life sciences on Edx (also free through audit) which teach statistics and R such as this one. Although watching classes, taking courses and reading textbooks definitely have their place, its impossible to learn to code without practicing it. For practical learning, we personally recommend using Codecademy as we have had a great experience using it. They provide courses for lots of programming languages, including the key languages used in bioinformatics (R, Python, Bash). The courses include completing projects and provide step by step tutorials and AI assistance should you need it. You can take introductory courses for free but you must pay more advanced courses.
Once you have some understanding of coding in general, you should then move onto bioinformatics specific material. The highly regarded forum Biostars has an excellent handbook available which provides a strong foundation in many areas. The Biostars forum itself is also full of tutorials and guides that will be useful for you later on but could be intimidating at the start of your journey when you have minimal coding experience. Another excellent textbook that provides foundational information and advice is Bioinformatics Data Skills by Vince Buffalo. Additionally, there are several courses on Coursera for Bioinformatics from several top universities, including the University of San Diego.


The introductory material for bioinformatics and data science are similar. Here's some suggestions from the DS path. Kaggle Learn has excellent and free tutorials on Python, data science & ML (https://www.kaggle.com/). The Swirl package teaches you R and statistics from inside the R console (https://swirlstats.com/). R for Data Analysis (https://r4ds.hadley.nz/) and Python for Data Analysis (https://wesmckinney.com/book/) are top notch and free.
Thank you so much for all these resources. As a beginner, I was overwhelmed by the amount available online but this helped me a lot. Thank you so much.😊