Adapted from the lesson by Tracy Teal and Adina Howe. Original contributors: Paul Wilson, Milad Fatenejad, Sasha Wood, Radhika Khetani, and Adina Howe for Software Carpentry (http://software-carpentry.org/)
We showed a little how to search within a file using less
. We can also
search within files without even opening them, using grep
. Grep is a command-line
utility for searching plain-text data sets for lines matching a string or regular expression.
Let’s give it a try!
Let’s navigate into directory untrimmed_fastq
Hint: it’s in directory dc_sample_data
, use cd
Suppose we want to see how many reads in our file have really bad, with 10 consecutive Ns
Let’s search for the string NNNNNNNNNN
in file
$ grep NNNNNNNNNN SRR098026.fastq
We get back a lot of lines. What is we want to see the whole fastq record for each of these read. We can use the ‘-B’ argument for grep to return the matched line plus one before (-B 1) and two lines after (-A 2). Since each record is four lines and the last second is the sequence, this should give the whole record.
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq
for example:
@SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Exercise
1) Search for the sequence GNATNACCACTTCC
in SRR098026.fastq
.
In addition to finding the sequence, have your search also return
the name of the sequence.
2) Search for that sequence in both fastq files.
We’re excited we have all these sequences that we care about that we just got from the FASTQ files. That is a really important motif that is going to help us answer our important question. But all those sequences just went whizzing by with grep. How can we capture them?
We can do that with something called “redirection”. The idea is that we’re redirecting the output to the terminal (all the stuff that went whizzing by) to something else. In this case, we want to print it to a file, so that we can look at it later.
The redirection command for putting something in a file is >
Let’s try it out:
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
The prompt should sit there a little bit, and then it should look like nothing
happened. But type ls
. You should have a new file called good-data.txt. Take
a look at it and see if it has what you think it should.
If we use ‘»’, it will append to rather tha overwrite a file. This can be useful for saving more than one search, for example:
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
$ grep -B1 -A2 NNNNNNNNNN SRR097977.fastq >> bad_reads.txt
Exercise
Let’s put all the sequences that contain TTATCCGGATTTATTGGGTTTAAAGGGT
from all the files in to another file called ‘good-data.txt’
There’s one more useful redirection command that we’re going to show, and that’s
called the pipe command, and it is |
. It’s probably not a key on
your keyboard you use very much. What |
does is take the output that
scrolling by on the terminal and then can run it through another command.
When it was all whizzing by before, we wished we could just slow it down and
look at it, like we can with less
. Well it turns out that we can! We pipe
the grep
command through less
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | less
Now we can use the arrows to scroll up and down and use q
to get out.
We can also do something tricky and use the command wc
. wc
stands for
word count
. It counts the number of lines or characters. So, we can use
it to count the number of lines we’re getting back from our grep
command.
And that will magically tell us how many sequences we’re finding. We’re
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | wc
That tells us the number of lines, words and characters in the file. If we
just want the number of lines, we can use the -l
flag for lines
.
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | wc -l
Redirecting is not super intuitive, but it’s really powerful for stringing together these different commands, so you can do whatever you need to do.
NOTE:
The philosophy behind these command line programs is that none of them
really do anything all that impressive. BUT when you start chaining
them together, you can do some really powerful things really
efficiently. If you want to be proficient at using the shell, you must
learn to become proficient with the pipe and redirection operators:
|
, >
, >>
.
Finally, let’s use the new tools in our kit and a few new ones to example our SRA metadata file.
$ cd
$ cd dc_sample_data/
Let’s ask a few questions about the data
1) How many of the read libraries are paired end?
First, what are the column headers?
$ head -n 1 SraRunTable.txt
BioSample_s InsertSize_l LibraryLayout_s Library_Name_s LoadDate_s MBases_l MBytes_l ReleaseDate_s Run_s SRA_Sample_s Sample_Name_s Assay_Type_s AssemblyName_s BioProject_s Center_Name_s Consent_s Organism_Platform_s SRA_Study_s g1k_analysis_group_s g1k_pop_code_s source_s strain_s
That’s only the first line but it is a lot to take in. ‘cut’ is a program that will extract columns in tab-delimited files. It is a very good command to know. Lets look at just the first four columns in the header using the ‘|’ readirect and ‘cut’
$ head -n 1 SraRunTable.txt | cut -f1-4
BioSample_s InsertSize_l LibraryLayout_s Library_Name_s
‘-f1-4’ means to cut the first four fields (columns). The LibraryLayout_s column looks promising. Let’s look at some data for just that column.
$ cut -f3 SraRunTable.txt | head -n 10
LibraryLayout_s
SINGLE
SINGLE
SINGLE
SINGLE
SINGLE
SINGLE
SINGLE
SINGLE
PAIRED
We can see that there are (at least) two categories, SINGLE and PAIRED. We want to search all entries in this column for just PAIRED and count the number of hits.
$ cut -f3 SraRunTable.txt | grep PAIRED | wc -l
2
2) How many of each class of library layout are there?
We can use some new tools ‘sort’ and ‘uniq’ to extract more information. For example, cut the third column, remove the header and sort the values. The ‘-v’ option for greap means return all lines that DO NOT match.
$ cut -f3 SraRunTable.txt | grep -v LibraryLayout_s | sort
This returns a sorted list (too long to show here) of PAIRED and SINGLE values. Now we can use ‘uniq’ with the ‘-c’ flag to count the different categories.
$ cut -f3 SraRunTable.txt | grep -v LibraryLayout_s |sort | uniq -c
2 PAIRED
35 SINGLE
3) Sort the metadata file by PAIRED/SINGLE and save to a new file We can use if ‘-k’ option for sort to specify which column to sort on. Note that this does something similar to cut’s ‘-f’.
$ sort -k3 SraRunTable.txt > SraRunTable_sorted_by_layout.txt
4) Extract only paired end records into a new file Do we know PAIRED only occurs in column 4? WE know there are only two in the file, so let’s check.
$ grep PAIRED SraRunTable.txt | wc -l
2
OK, we are good to go.
$ grep PAIRED SraRunTable.txt > SraRunTable_only_paired_end.txt
Exercise
1) How many sample load dates are there? Hint: sort -u
outputs unique records only.
2) How many samples were loaded on each date
3) Filter subsets into new files bases on load date
The find
program can be used to find files based on arbitrary
criteria. Navigate to your home directory (hint: cd
or cd ~
) and enter the following
command:
find . -print
This prints the name of every file or directory, recursively, starting from the current directory. Let’s exclude all of the directories:
find . -type f -print
This tells find
to locate only files. Now try these commands:
find . -type f -name "*7*"
find . -type f -name "*7*" -or -name "*6*" -print
find . -type f -name "*7*" -and -name "*6*" -print
The find
command can acquire a list of files and perform some
operation on each file. Try this command out:
find . -type f -exec grep Load {} \;
This command finds every file starting from .
. Then it searches each
file for a line which contains the word “Volume”. The {}
refers to
the name of each file. The trailing \;
is used to terminate the
command. This command is slow, because it is calling a new instance
of grep
for each item the find
returns.
A faster way to do this is to use the xargs
command:
find . -type f -print | xargs grep Load
find
generates a list of all the files we are interested in,
then we pipe them to xargs
. xargs
takes the items given to it
and passes them as arguments to grep
. xargs
generally only creates
a single instance of grep
(or whatever program it is running).
Exercise
Navigate to your home directory. Use one find
command to perform each
of the operations listed below (except number 2, which does not
require a find
command):
Find any file whose name contains “youfoundit” from your home directory and determine what’s in the file
In your home directory, create a new directory called data_copies
Find all of the fastq files from your home directory and copy them to the data_copies
directory.
From your home directory, find all fastq files in diectory data_copies
and rename all of the copied files to ensure that they end in .txt
(note: it is ok for the file name to end in .fastq.txt
)