Topics in molecular & Organismal Evolution R studio Project Part 1 COVID-related

Topics in molecular & Organismal Evolution R studio Project Part 1
COVID-related molecular evolution analysis of the Human Gene ACE-2
 Background reading 
The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68-74.
Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R, 1000 Genomes Project Analysis Group. 2011. The variant call format and VCFtools. Bioinformatics App Note 27: 2156-2158.
Gheblawi M, Wang K, Viveiros A, Nguyen Q, Zhong J-C, Turner AJ, Raizada MK, Grant MB, Oudit GY. 2020. Angiotensin-Converting Enzyme 2: SARS-CoV-2 Receptor and Regulator of the Renin-Angiotensin System. Circulation Research, 126(10):1456-1474
Hoffmann M, et al. 2020. SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor. Cell, 181(2):271-280.e8
David A, Khanna T, Beykou M, Hanna G, Sternberg MJE. [Preprint]. Structure, function and variants analysis of the androgen-regulated TMPRSS2, a drug target candidate for COVID-19 infection. bioRxiv, accessed 10SEP20, DOI: 10.1101/2020.05.26.116608
Sriram K, Insel P, Loomba R. May 14, 2020. What is the ACE2 receptor, how is it connected to coronavirus and why might it be key to treating COVID-19? The experts explain. The Conversation. (Links to an external site.). (Links to an external site.)
Question 1.1. What is the 1000 genomes project?
Question 1.2. What are the super populations and sub-populations that contributed genomic DNA to the 1000 genomes project?
Question 1.3. Why are the ACE2 and TMPRSS2 human genes of interest when it comes to understanding the severity of covid infection?
 Understanding the files we will work with
Figure 1 provides an overview of how the sequence data for the 1000 genomes project was generated. The sequencing technology they used was the propriety Illumina platform (NGS or Next Generation Sequencing in the figure). This process involves shearing the genomic DNA into numerous small fragments (i.e. the “shotgun” approach), making a clonal bacterial library of these fragments (for large genomes only > 500Mb, smaller genomes do not require this library and can be sequences directly).
Figure 1: Overview of NGS and the files generated 
The next step is to sequence these fragments to a target coverage depth (coverage refers to how many copies of each fragment to generate – since the fragments overlap – the higher the coverage (i.e. more copies of a fragment), the more accurate your eventual re-assembly back into contiguous chromosomes will be as well as the eventual calling of variants). The initial, raw sequence files are in FASTQ file format. FASTQ files contain the sequences themselves along with various quality scores for each base call. You can read more about FASTQ files here: (Links to an external site.). The raw reads are then passed through reference genome alignment algorithms to generate SAM/BAM files. SAM stands for Sequence Alignment / Map format. You can read more about them here: (Links to an external site.). BAM files are the compressed, binary version of SAM files (up to 128MB of sequence). These files typically contain an entire genome worth of DNA sequence information as well as positional information on where in the genome each and every piece of sequence falls in the genome (file sizes are typically in hundreds of Gigabytes, and often exceed a Terabyte). These are NOT files you would want to typically download onto your laptop, rather you would work with these on a server, which has the capacity to work with such large files efficiently. For our purposes (molecular evolution analysis of a single gene, ACE2), VCF (Variant Call Format) files are sufficient, and small enough to work with on our personal laptops and Rstudio. VCF files are the result of picking out only the variant nucleotide positions from a SAM/BAM file (i.e. loci where the sequences differ from the reference genome). You can find more information on VCF files here: (Links to an external site.)
VCF files are sufficient for most population genetic / molecular evolution analyses.
Question 2.1. Make labelled figures for, or describe (or do both) the headers of a typical FASTQ, SAM/BAM and VCF file, include column heading descriptions where appropriate – highlight what they have in common and what is unique to each file type. The goal here is to be able to recognize when you have successfully created / downloaded a VCF file and understand what it looks like so that we can import it into Rstudio  
 Using EnsEMBL to get VCF files from the 1000 genomes project. 
You can find general information of how to get data from the 1000 genomes project using genome browsers (of which EnsEMBL is one) here: (Links to an external site.)
Step by Step instructions:
Step 1: Finding ACE2
Go to the website: (Links to an external site.)
Find the search bar in the top left-hand corner and type in “ACE2” Make sure the “category” drop-down menu is set to “Search all categories.” Click “Go.”
The first result to come up should be called “ACE2 (Human Gene)”.
Clicking on that will bring you to the “home page” for the gene ACE2
Question 3.1: what does ACE stand for?
Question 3.2. On which chromosome do we find the human ACE2 gene?
Question 3.3. How many different orthologous genes of ACE2 have been sequenced and name at least 1 vertebrate and 1 invertebrate in which it is found. 
Step 2: This may or may not work –  use “dataslicer” to get a VCF file containing only the data you would like to analyze
The Data Slicer in EnsEMBL is a convenient way to get only the amount of data that you want without using a separate program to cut it out yourself or having to write some code to do it. We will try to use this tool to get the data for our analyses of ACE2. We will be taking one slice of data that contains all of the SNPs (Single Nucleotide Polymorphisms) in ACE2. The link to the Data Slicer is available here: (Links to an external site.)
First off, in the “Name for this job” category, let’s name it after the population we will be sampling from – the Yoruba population (which shows the most variation in the ACE2gene), so name it “YRI_ACE2”.
The file format should be set for VCF. If it’s not, click the drop-down menu and select VCF.
In the “region lookup” bar, copy and paste in the location X:15561033-15602148. These are the GRCh38.p13 version alignment coordinates for the gene ACE2.
In in the “Genotype file URL…” paste this URL: (Links to an external site.)
This will ensure you get data from the last phase of the 1000 Genomes Project.
In the “filters” category, select “By populations”. Paste in thuis URL (Links to an external site.)
. Select the YRI population so that you only get the data for that population.
This may or may not work, (data slicer is buggy) if not, then we need to get a bit more sophisticated and perform some command line magic.
 Using the terminal on macs or command prompt in windows
This link is a nice gentle introduction to terminal basics – without the basics, you will be lost, so please work your way through this. Since the Data slicer most likely will not work, we will have to learn some terminal basics to download the data set we want to analyze. It is also one of those skills that is absolutely crucial if you want to do anything mildly bioinformatics these days. (Links to an external site.)
Question 4.1: what does BASH stand for?
Question 4.2: What is a directory? Do directories have extensions like .exe or .txt?
Question 4.3: using some of the commands you are learning about, briefly describe what the man command does (make sure you understand what the command q does beforehand)