Preprocessing Eland Output, Mappability, GC content, & Ambiguity (N) Scores


We have a growing library of pre-processed mosaics files. Please check if what you need is already available among these data: mosaics preprocessed data.

Here are the detailed instructions for generating these data. First, we assume that you have perl, python, & JVM in your path.
It is also assumed that tag length (tagL) = 28, expected fragment length (fragL) = 200, & bin size (binsize) = 50.
Please download scripts for preprocessing first and extract them in your working directory.

> tar -zxvf mosaics_preprocessing_scripts.tar.gz


Preprocess Eland output to bin-level files

You need to have "preprocess_eland.pl" from scripts for preprocessing in your working directory.
Let [chrID], [fragL], [binsize], [collapse], and [infile] be the chromosome number,
fragment length, bin size, counts to be collapsed, and Eland output file name.

> perl preprocess_eland.pl chr[chrID] [fragL] [binsize] [collapse] [infile]

For example, let's assume that "eland_results.out" is the Eland output file name.
In order to preprocess bin-level tag count file for chromosome 2,

> perl preprocess_eland.pl chr2 200 50 3 eland_results.out

This command generates "chr2_eland_results.out_fragL200_bin50.txt".


Preprocess PeakSeq nucletide-level mappability (chr*b.out) to binary files

You need to have "cal_binary_map_score.py" and "CountMap.py"
from scripts for preprocessing in your working directory.
Put all the chr*b.out files from PeakSeq nucleotide-level mappability in the same directory.
For hg18 and mm9, you can just download preprocessed nucleotide-level mappability instead.
(hg18) (mm9)

Let [chrID], [chr_length], and [output_file] be the chromosome number,
the size of chromosome, and the name of output file.
In order to preprocess chr*b.out to binary file,

> python cal_binary_map_score.py [chrID] 1 [chr_length] > [output_file]

For example, in order to preprocess "chr2b.out" to "chr2_map_binary.txt"
(size of chromosome 2 of mouse mm9 reference genome is 181748087),

> python cal_binary_map_score.py 2 1 181748087 > chr2_map_binary.txt

This command generates mappability binary file named as "chr2_map_binary.txt".


Preprocess mappability binary files to bin-level files

You need to have "process_score_java.pl" and "CalcMappability.class"
from scripts for preprocessing in your working directory.
Let [input_binary_file], [output_pre], [tagL], [fragL], and [binsize] be binary file name
to be preprocessed, output file name prefix, tag length, fragment length, and bin size.

> perl process_score_java.pl [input_binary_file] [output_pre] [tagL] [fragL] [binsize]

For example, in order to preprocess mappability binary files to bin-level files for chromosome 2,

> perl process_score_java.pl chr2_map_binary.txt 2_map 28 200 50

These command generates bin-level mappability, named as "chr2_map_fragL200_bin50.txt".


Preprocess genome assembly FASTA files (chr*.fa) to GC and N binary files

You need to have "cal_binary_GC_N_score.pl" from scripts for preprocessing in your working directory.
Download the reference genome assembly FASTA files that you are interested in
from UCSC genome browser. Assume that these files are named as "chr*.fa".
Put all the chr*.fa files in the same directory.

In order to generate GC and N binary files for chromosome [chrID],

> perl cal_binary_GC_N_score.pl chr[chrID].fa [chrID] 1

This script generates GC and N binary files, named as
"chr[chrID]_GC_binary.txt" and "chr[chrID]_N_binary.txt", respectively.

For example, in order to create GC and N binary files for chromosome 2 from "chr2.fa",

> perl cal_binary_GC_N_score.pl chr2.fa 2 1

These command generates "chr2_GC_binary.txt" and "chr2_N_binary.txt".


Preprocess GC & N binary files to bin-level files

You need to have "process_score.pl" from scripts for preprocessing in your working directory.
Let [input_binary_file], [output_pre], [fragL], and [binsize] be binary file name
to be preprocessed, output file name prefix, fragment length, and bin size.

> perl process_score.pl [input_binary_file] [output_pre] [fragL] [binsize]

For example, in order to preprocess M & GC binary files to bin-level files for chromosome 2,

> perl process_score.pl chr2_GC_binary.txt 2_GC 200 50
> perl process_score.pl chr2_N_binary.txt 2_N 200 50

These commands generate bin-level GC content & ambiguity score files,
named as "chr2_GC_fragL200_bin50.txt" & "chr2_N_fragL200_bin50.txt", respectively.