Discover the NCBI taxonomy of organisms in a graph database
The evolution of life is an exquisite and insightful area of examine that traces our origins again to the start of life. It helps us perceive the place we got here from and the place we’re doubtlessly going. The relationships between species are sometimes depicted within the tree of life, which is a mannequin used to explain relationships between numerous species. Since a tree construction is a type of a graph, it is sensible to retailer these relationships in a graph database to be analyzed and visualized.
On this weblog publish, I’ve determined to import the NCBI taxonomy of organisms into Neo4j, a graph database, the place we will simply traverse and analyze relationships between numerous species.
Surroundings and dataset setup
To comply with the code examples on this publish, you’ll need to obtain Neo4j Desktop software. I’ve ready a database dump that you should use to simply get the Neo4j database up and working with out having to import the dataset your self. Check out my earlier weblog publish should you want some assist with restoring the database dump.
The unique dataset is obtainable on the NCBI web site.
I’ve used the brand new taxonomy dump folder downloaded on thirteenth June 2022 to create the above database dump. Whereas no specific license is specified for the dataset, the NCBI web site states that every one info is obtainable inside the public area.
I’ve made obtainable the code used to import the taxonomy into Neo4j on my GitHub if you wish to consider the method or make any modifications.
Graph schema
I’ve imported the next information into Neo4j:
- nodes.dmp
- names.dmp
- host.dmp
- citations.dmp
Another information have redundant info that’s already current within the nodes.dmp
file that incorporates the taxonomy of organisms. I’ve regarded a bit at genetic code information, however since I don’t know what to do with genetic code identify and their translations, I’ve skipped them throughout import.
Utilizing the above 4 information, I’ve constructed the next graph schema.
I’ve added a generic label Node to all nodes current within the nodes.dmp
file. The nodes with the generic label node include a number of properties that can be utilized to import different information and assist consultants higher analyze the dataset. For us, solely the identify property might be related. The taxonomy hierarchy is represented with the PARENT relationship between nodes. The dataset additionally incorporates a file that describes potential hosts of varied species. Lastly, a number of the nodes are talked about in numerous medical sources, that are represented because the Quotation nodes.
All of the nodes with the generic label Node have a secondary label that describes their rank. Some examples of ranks are Species, Household, and Genus. There are too lots of them to listing all of them, so I’ve ready a screenshot with all obtainable node labels.
Exploratory evaluation
All of the code on this evaluation is obtainable on GitHub within the type of a Jupyter pocket book, though the queries have been modified to work with Pandas Dataframe as a substitute of visualization instruments.
I regarded for Homo Sapiens species within the dataset however couldn’t discover it. Curiously, the oldsters at NCBI determined to call our species merely Human. We will study the taxonomy neighborhood as much as 4 hops with the next Cypher assertion:
MATCH p=(n:Node {identify:"human"})-[:PARENT*..4]-()
RETURN p
Outcomes
I’m making the visualizations in Neo4j Bloom because it gives a hierarchical format, which is ideal for visualizing taxonomies. One of many benefits of utilizing Neo4j Bloom is that it permits customers who aren’t skilled with Neo4j or Cypher to examine and analyze graphs. Comply with this hyperlink if you wish to be taught extra about Neo4j Bloom.
So, human node is a species that belongs to a people genus, which is part of the Pongidae household. After a fast Google search it appears that evidently Pongidae taxon is out of date, and Hominidae ought to be used, which is represented within the NCBI taxonomy as a brilliant household. Curiously, the human species has two subspecies, specifically neanderthals and denisovans, that are represented below the homo sp altai node. I simply realized one thing new about our historical past.
The NCBI taxonomy dataset incorporates solely 10% of the described species of life on the planet, so don’t be shocked if there are lacking species from the dataset.
Let’s study what number of species are there within the dataset with the next Cypher assertion:
MATCH (s:Species)
RETURN depend(s) AS speciesCount
There are virtually two million species described within the dataset, which suggests there’s loads of room to discover.
Subsequent, we will study the taxonomy hierarchy for human species all the best way to the basis of the tree utilizing a easy question:
MATCH (:Node {identify:'human'})-[:PARENT*0..]->(mum or dad)
RETURN mum or dad.identify AS lineage, labels(mum or dad)[1] AS rank
Outcome
Plainly there are 31 traversals wanted to get from the human node to the basis node. For some cause, the basis node has a self-loop (relationship with itself), and that’s why it reveals twice within the outcomes. As well as, a clade, a gaggle of organisms which have developed from a standard ancestor, reveals up a number of instances within the hierarchy. It seems just like the NCBI taxonomy is richer than what you’ll discover with a fast Google search.
Graph databases like Neo4j are additionally nice at discovering shortest paths between nodes within the graph. Now, we will reply a important query of how shut are apples to oranges within the taxonomy.
MATCH (h:Node {identify:'Valencia orange'}), (g:Node {identify:'candy banana'})
MATCH p=shortestPath( (h)-[:PARENT*]-(g))
RETURN p
Outcomes
Plainly the closest widespread ancestor between candy banana and valencia orange is Mesangiospermae clade. Mesangiospermae is a clade of flowering vegetation.
One other use-case for traversing relationships could possibly be discovering all of the species in the identical household as a specific species. Right here, we are going to visualize all of the genus in the identical household because the candy banana.
MATCH (:Node {identify:'candy banana'})-[:PARENT*0..]->(f:Household)
MATCH p=(f)<-[:PARENT*]-(s:Genus)
RETURN p
Outcomes
Candy banana belongs to the Musa genus and Musaceae household. Curiously, there’s a Musella genus, which appears like a small Musa. Actually, after googling the Musella genus, it seems like solely a single species is current within the Musella genus. The species is often known as the Chinese language dwarf banana.
Inference with Neo4j
Within the final instance, we are going to take a look at how one can develop inference queries in Neo4j. Inference means we create new relationships based mostly on a algorithm between nodes and both retailer them within the database or use them at query-time solely. Right here, I’ll present you an instance of inference queries utilizing new relationships solely at query-time when analyzing potential hosts.
First, we are going to consider which organism have described potential parasites within the dataset.
MATCH (n:Node)
RETURN n.identify AS organism,
labels(n)[1] AS rank,
measurement((n)<-[:POTENTIAL_HOST]-()) AS potentialParasites
ORDER BY potentialParasites DESC
LIMIT 5
Outcomes
Plainly people are probably the most described and solely species with potential parasites. I might enterprise a guess that the majority if not the entire potential parasites for people are additionally potential parasites for vertebrates because the counts are so shut.
We will examine what number of potential hosts organisms have with the next Cypher assertion.
MATCH (n:Node)
WHERE EXISTS { (n)-[:POTENTIAL_HOST]->()}
WITH measurement((n)-[:POTENTIAL_HOST]->()) AS ph
RETURN ph, depend(*) AS depend
ORDER BY ph
Outcomes
18359 organisms have just one identified host, whereas 163434 have two identified hosts. Due to this fact, my speculation that the majority parasites that assault people additionally doubtlessly assault all vertebrates is legitimate.
Right here is the place the inference queries comes into play. We all know that vertebrates is the next degree taxon within the taxonomy of organisms. Due to this fact, we will traverse from vertebrates to the species degree to look at which species could possibly be doubtlessly used as hosts.
We’ll use the instance of Monkeypox virus as it’s related on this time. First, we will consider its potential hosts.
MATCH (n: Node {identify:"Monkeypox virus"})-[:POTENTIAL_HOST]->(host)
RETURN host.identify AS host
Outcomes
Discover that each human and vertebrates are described as potential hosts of Monkeypox virus. Nonetheless, let’s say we wish to study all of the species which might be doubtlessly endangered by the virus.
MATCH (n: Node {identify:"Monkeypox virus"})-[:POTENTIAL_HOST]->()<-[:PARENT*0..]-(host:Species)
RETURN host.identify AS host
LIMIT 10
Outcomes
We have now used a restrict as there are a number of vertebrates. Sadly, we don’t know which ones are extinct as that will assist us filter them out and determine solely potential victims of the Monkeypox virus which might be nonetheless alive. Nonetheless, it’s nonetheless a superb instance of inference in Neo4j, the place we create or infer a brand new relationship based mostly on the predefined algorithm at question time.
Conclusion
I actually loved writing this text because it gave me a possibility to discover the taxonomy of bananas and oranges. You should utilize this dataset as a hobbyist to discover your favorite species and even in a extra skilled setting. Merely obtain the database dump, load it into Neo4j, and get began.
The code is obtainable on GitHub.