The Biotechnology, Bioinformatics and Computational Biology fields are growing on multiple fronts, including data volume and complexity. The amount of data generated is far larger than the ability to consume or make sense of it. While data volume may sound like a problem, there is another more significant and pressing issue, specifically around existing processes and general accessibility to data.
At Mammoth Analytics, our mission is to provide frictionless access to data across industries and roles. This article focuses on Biotech — the data problems faced by scientists, and the solutions that the Mammoth platform delivers.
In the world of genetics and genomics, processing, analyzing and comparing gene expression and mutation data across hundreds or thousands of samples is an inefficient process. Raw sequencing data needs to be processed and mapped to a reference genome, but the real challenge is to then analyze results across many batches of data.
The tools available in the market today are either insufficient or require coding skills. Scientists often partner with bioinformaticians or computational biologists who compare sample groups and find genes whose expression are statistically different. The current process can be described in the following diagram:
The above diagram illustrates two scenarios:
a) Researcher 1 gets their data analyzed via the Bioinformatics team and b) Researcher 2 runs a Python script
In either case, they are delving beyond their core skillset. The key problem: the researchers know what they want out of their data, but they face too many friction points. They either have to learn a highly technical skill (e.g. coding in python) or rely on the bioinformatics team (who do not understand the objectives as well as the researchers).
Data preparation is the act of pre-processing raw data into a form that can be readily and accurately analysed. In the world of Biotech, gene expression data needs to be pivoted, annotated, joined with another dataset, compared against threshold values, etc. This process is more data science than biology. The data owners (researchers) don’t have the right skills do perform these tasks.
This results in the following problems:
At Mammoth Analytics, we’ve been working hard to empower the data owners (scientists and researchers) with all the tools they need to become data-driven.
The image above describes a new, improved method. The Mammoth Platform provides a plethora of tools to reshape data in powerful ways without any code or dependency on others.
Some of the key features include:
Mammoth provides powerful functions to reshape and transform data in new ways. Joining disparate datasets, Lookups, Fuzzy Search, Windows Functions and a ton more — all in an easy, point-and-click interface without writing any code.
In Mammoth, all data changes are recorded in an editable and easy-to-digest data pipeline. The original data stays intact while the changes are saved in a new version. This leaves room for a high level of experimentation, exploration, analysis and collaboration that was not possible before.
Implicit in the data pipeline is the ability to automate the data flow. This helps users save time by eliminating repetitive tasks.
The platform is code-free and designed for anyone to use, regardless of technical ability. For researchers who don’t know how to code, this opens up possibilities that didn’t exist before. For coders, the platform addresses multiple pain-points faced during data preparation and offers significant time-savings.
Exploration, discovery and ad-hoc data preparation
Mammoth’s users can discover trends and anomalies with a couple of clicks. The exploration features are unique, powerful and are designed to provide instant insights with drill-down and pivot capabilities.
One of the benefits of a code-free data transformation platform is the ability to make ad-hoc changes. The exploration ability allows users to drill down into specifics and instantly re-shape the data, without any reliance on IT or a technical skillset.
It is hard to overstate the benefits of providing direct control directly to researchers without the learning-curve or resource-dependency barrier. Geneticists are now given an ability to do a high level of experimentation. In addition, the turnaround time to insights is drastically reduced.
Direct control of data preparation, without the learning curve, allows for a reduction of turnaround time anywhere from days to weeks. Consider the following factors: the average salary of a researcher is $72K, and the bioinformatician is $100K. By avoiding the iterative process of coding or reliance on the bioinformatician, a typical project using Mammoth results in a savings of between $10K to $30k.
With Mammoth, it is not just the end result, but the process that gets captured. All the researchers are now able to participate in the creation of the steps required to achieve the various outcomes. The process now becomes the intellectual property of the organization.
By removing the bottlenecks and providing the ability to prepare gene-expression data instantaneously, the researchers can now run multiple scenarios at a fraction of the time. There are substantial benefits to this approach. Whether it is private companies or institutional labs, empowering the scientists with the right tools allows for unintended discoveries and insights.
In this case study, we highlight some work in a cancer research lab in a large biotech firm. A sequencing process for all the samples resulted in the generation of gene-expression data.
Determine, with proper annotation, the genes that are expressed higher in the target population than in the rest of the “normal” tissue (toxicity).
Different tissues have different “thresholds” since some organs are more sensitive than others. Also, to avoid getting just tissue-specific genes, researchers were looking for a tighter threshold on the matching organ (e.g. normal liver for liver cancer). There is an art to setting these: too loose, and it ends up unspecific (false positives), too tight, and you miss things (false negatives).
Typically, filtering a list of gene expressions requires extensive data handling and manipulation. This is either done via complex MS Excel functions (non-repeatable, black box), Python (repeatable, black box), or sending it to Bioinformaticians (black box, high turnaround time). None of these options is optimal due to the time and complexity involved.
Below is an example of the Python code that performs the tasks:
The Mammoth platform allows for a code-free, editable data pipeline to achieve the same set of tasks. The following is an example of one of the pipelines:
Mammoth’s ease-of-use, transparent pipeline, and automation make data manipulation accessible to all researchers.
There is no more reliance on learning how to code or on Bioinformaticians for basic transformation tasks. After creating the data pipeline, which takes one day, this project results in 4 researchers reducing their data preparation time from days to minutes.
Mammoth removes inefficiencies by empowering the researchers with the right tools. In reality, the platform goes beyond just solving for inefficiencies — it helps researchers discover new insights which in turn creates new opportunities.
To learn more, feel free to reach us at firstname.lastname@example.org. If you want to give it a spin, sign up for a demo here.