This is an outline on how to use TiMAT2 (aka T2) to process your tiling microarray data.
The applications are designed to be user friendly and require only moderate familiarity
with command line programs. When in doubt type 'java -jar pathTo/T2/Apps/ApplicationName' to pull a menu of options.
See the readme file for installation instructions.
- I think it important to have three replicas for every experiment. These don't necessarily have to be biological replicas, e.g. three different isolates of a fibroblast cell line processed in parallel. Technical replicas in most cases are preferred, e.g. pool multiple plates of fibroblasts and split it after fixing/lysing/preparing chromatin and process in parallel. Technical replicas are extremely informative. They provide a measure of the sample prep/ microarray processing consistency without any confounding biological variability.
- If you choose to only go with two replicas be prepared for lots of qPCR validation. The symmetric null method of confidence estimation works reasonably well for non complex organisms but rarely for human or mouse. Three IPs and three input replicas are the minimum needed for a random label permutation confidence estimation.
- It is rather critical to perform all of these preps using the same reagents, same equipment, at the same time, etc. Be prepared to throw out all of the data generated during the pilot/ optimization phase of the experiment and repeat it along side the other samples. If you do plan to make use of the pilot data, prep/ IP/ amplify in parallel the other technical replicas and freeze them while deciding if the pilot worked, if so, then run the matched samples. One good way to kill a tiling experiment is to prep samples on different days using different reagents and expect some sort of magic to happen upon hybridization and anlysis. Technical inconsistency is the primary cause for tiling microarray failure.
- For two color arrays, there is no need to have a matching input control on every chip. You can treat the different colors independently and think of them as a way of getting two readouts from the same chip. For example, if I performed a ChIp-ChiP experiment with 4 antibodies on the same fibroblast cells I would use 9 chips. For each antibody, I would prepare three technical replicas, and label them with Cy3 or Cy5 and place them on different chips. I would do likewise with one input sample split to three technical replicas. For the last sample, I would perform mock IPs, three technical replicas, using IgG or a non specific antibody, e.g. anti-GST or anti-FLAG. See below.
- For your experiments involving an IP, consider performing a mock IP using IgG or an antibody that doesn't bind anything within your sample, e.g. anti-GST or anti-FLAG. It is best to use a type matched antibody. TiMAT2 is set up to generate empirical FDRs based on the mock IP and is a great way to estimate confidence in your real IP data without having to perform exhaustive qPCR. One consideration though, it must be performed in parallel with the real IPs. Subtle differences in washing and amplification really effect the output.
- Don't over wash your beads during the IP. Use the minimum number of washes (3x 5min?) that give good fold enrichment for known targets by qPCR. Over washing can lead to a PCR bottle neck effect where a small number of regions get overly amplified and will look exactly like real regions on the microarray even though nothing specific was pulled down by the IP.
- You might want to prepare twice the amount of chromatin you will need to be sure to not run out if and when parts of the experiment need to be repeated.
- Save enough pooled pre amplification real IP and input material for ~50 qPCR rxns each.
For ChIP-chip experiments:
- Three technical replicas for each anti-transcription factor antibody (i.e. hybridize three independent parallel processed IPs to three chips)
- Three technical replicas for a mock IgG IP.
- Three technical replicas for just input chromatin.
- Commercial chips are very consistent. Technical repeats (using the same hybridization solution on different chips) are unnecessary and can emphasize false positives in certain testing situations.
- For Affy data, skip mismatch transformation for ChIP-Chip experiments. Just median scale intensity values to 100 and quantile normalize across all chips.
- Use the Correlate app to check the correlation coefficient between your replicas and appropriate clustering of like chips.
For IPs, the r^2 values between replicas should be > 0.5 (or 50 for 100 *r^2). These should typically be > 0.8.
If they are less than these values your IP/ PCR/ labeling protocols must be optimized. It cannot be emphasized enough to not proceed with the full experiment until these
values have been achieved. Otherwise, considerable time and money will be wasted. Garbage in = garbage out.
Speckles and spots on the chip can throw the correlation coefficient. If this is the case, use full quantile normalization and the robust pseudo median window scanning statistic.
- Java version 1.4 or greater.
- These command line applications have been tested on MacOSX and Linux. They have not been tested on a Windows machine. All the source code, with extensive documentation, is included in the TiMAT2 package.
- A computer with > 2 GigaBytes RAM. One can use a machine with less memory but some programs will run rather slowly, hours instead of minutes.
Tiling analysis follows five basic steps:
- Obtain or make map file(s) to associate microarray intensities with genomic coordinates, release specific
- Normalize raw microarray intensities
- Map normalized intensities to the genome
- Use window smoothing to create an array of window summary scores
- Identify enriched (and or reduced) regions by picking a threshold summary score and merge overlapping windows that exceed the threshold
Obtaining Map files
- TiMAT2 uses the Affymetrix tpmap file format. This is a text version of the binary probe map file, bpmaps. A bpmap can be converted to a tpmap using the bpmap2tpmap command line utility. Since this tool is not officially released ask your Affy rep for the tpmap version of the officially released bpmaps.
- TPMaps can be generated from the NimbleGen NAD files using the ConvertNimblegenNDF2TPMap app.
- TPMaps might be generated from the Agilent txt results file, provided sequences are present, using the ConvertAgilentData app.
- You can create your own, it's a simple tab delimited txt file containing:
- Two comment lines: '#seq_group_name' and '#version'
- Oligo sequence (5'->3', sense strand)
- t or f (boolean describing whether the real oligo on the microarray will hybridize with the sense target strand)
- Start position of the oligo, zero base coordinates
- Perfect match X microarray coordinate
- Perfect match Y microarray coordinate
- (Optional) Mis match X microarray coordinate
- (Optional) Mis match Y microarray coordinate
- (Optional) Mystery number 1 (for backward compatibility?)
- Example of a Sense, PM only tpmap
ACAATGTACAATACCGAAGGACTATTGCAGATAAATTCTTACAAGGTGATGTACGTGATC t chr1 0 290 2061
TGTAGGTCATGGGCTAGAAAGGCAAACTTGATATCGTTAGTTTACTATGCTTGGAACAAT t chr1 55 489 2571
TTGTGGTTTGACAAAGTTTGAACATGCCGACAGTGATATGGCTTACACTTCTGACTGTAG t chr1 110 422 781
ACCACAACTTGCGAGAATGTAAACTACGTATTCGATTTCAAGGACTTTGTTTATTTTGTG t chr1 165 108 2431
ACCGGGAAAGCATGCCCCGTACGCTTATCTACTTGTTATAAATGTGGCAAGGCCGACCAC t chr1 220 25 3181
AGATGAATGTCTAAGTTGCACTCTTGGTAACCCAGACACAGAAATTCGTGCTCATACCGG t chr1 275 279 411
GGGTAGACTCAACTGACATTTCAGCACAACTCGAGCGATTCTTTCGAGTGTATAAAGATG t chr1 330 130 371
CGTACCATTCATTTCGCCGAGTATAAAAGTCGTGTGCAAGCCGTCAAAAAACAATGGGTA t chr1 385 519 951
Building Map Files
Building map files is a very complex process fraught with many pitfalls. It should only be performed by tiling microarray professionals.
- Obtain fasta files for the genome you wish to map. Complex organisms should be RepeatMasked with X's or N's, not lower case.
- Convert the fasta headers into the convention used by UCSC and IGB. (e.g. > chromosome 1 : NCBI build 35.1 to >chr1)
- (Optional) Obtain control sequence fasta files containing control sequences (bacterial, arabidopsis, intergenic regions...) include these with the genomic sequences. You may also like to create one particular fasta file and call it chrCtrls.fasta. This can be used in median scaling during array normalization.
- Create or obtain a xxx.1lq file. This can be obtained directly from your Affy rep. Do not use a rotated 7G specific xxx.1lq file. For NimbleGen and Agilent, you will need to create it. The PrintSelectColumns app may be helpful. You'll need to copy and paste entire columns and most spread sheet applications cannot handle the large sized map files. You may also need to transform the array coordinates into zero based coordinates. It is a tab delimited txt file containing:
- A variety of header lines only one is required 'X Y Seq Destype'
- X coordinate on the array (zero based)
- Y coordinate on the array (zero based)
- Oligo sequence (the original 1lq format specified 3'-5', you can leave it 5'-3' and throw the appropriate flag in MummerMapper)
- Destype (the original 1lq format specified a number used to describe the class of oligo, PM, MM, S, AS, Bright Control, Dark Control, etc. You can put anything here and use a regular expression in MummerMapper to select appropriate probes to map.)
- Example of a custom xxx.1lq file that can be read by MummerMapper:
X Y Seq Destype
0 3 TATATTTCTGCATATACATCATAGTCTCTAAAAACTGTACTAGGTCATACTCCATAGAGG Spo_P101594-R|Spombe|complement_chr2:8636..8695
0 4 AGTTATGACGAGCTTTGGAACGTTTATGACGACTGTTAGGATCGCAGTTTCCTGAAGTCT Spo_P110716-R|Spombe|complement_chr2:511391..511450
0 5 TCGTAGACCATCAAAAACTGTAACGAATAGTTCCAAAACGGTCACCACTAGACTTAGCAC Spo_P022386|Spombe|chr1:1232221..1232280
0 6 GACAATAATTTGGCGGCGCTGCTACCAATATAGCTGGAAGCAGTAAAGTATTAATCGTGC Spo_P043768-R|Spombe|complement_chr1:2408231..2408290
0 7 TAGGCTAAGCATCGAGTTCATGGCTAAATTAATAAAAGACTTTAAAAGTTTGTTAAAATG Spo_P000762-R|Spombe|complement_chr1:42901..42960
0 8 CCACGACCTTTTCCGTGAGTCCGGGGGTTGTAAGTCTTCTTTATACCGTAGCGGGTAGAA Spo_P041016-R|Spombe|complement_chr1:2256871..2256930
0 9 CATCAGCTATGAACTTTCACTTTTGTCTACAACTTTTTTAATCAGCTACAAATTTACAAA Spo_P042235-R|Spombe|complement_chr1:2323916..2323975
0 10 CAGGATTACAACAGGTTAAATGTCGATTTAACAAGGAAAGACTTACGTCACCACAGAGAA Spo_P105785|Spombe|chr2:240186..240245
- Run the MummerMapper app. Questions to address:
- How many exact matches to allow?
- Stranded? S, AS, or Both?
- Which probe Destypes?
- (Affy specific) Is this to be used on a pre 7G or 7G scanner?
- Are you loading large mammalian chromosomes? Run on a 64 bit machine!
- Do you want partial matches? What is the minimum sized oligo to match?
- Do you need to reverse the oligo sequences?
- Are MM oligos present?
- Test, test, and test the finished tpmap using known datasets. One mistake may create hundreds of hours of analysis headaches later on. Building tpmaps is a lot like taking pictures for your friend's wedding. Sounds easy but mess it up and you'll be mistrusted for life.
Convert Raw Microarray Data Into xxx.cela Files
To use T2 you must convert your raw microarray intensity data into a format that can be read by TiMAT, serialized binary Java float files. These are compressed, load quickly, and are used by multiple TiMAT2 applications.
- Affymetrix text xxx.cel files can be converted to xxx.cela using the CelFileConverter app.
- Affymetrix binary xxx.cel files can be converted to text xxx.cel files using the CelFileConversion tool
- Agilent xxx.txt results files are converted to xxx.cela using the ConvertAgilentData app. This application parses the processed intensities and raw intensities. The later are recommended for use in T2.
- NimbleGen text xxx.PAIR data files are converted to xxx.cela using the ConvertNimblegenPAIR2Cela app.
Option 1: Process your tiling data using the T2 application (Standard Analysis)
The T2 application is a wrapper for many TiMAT2 applications and can work with a cluster to farm out big jobs for rapid analysis.
- Convert your raw intensity data to binary xxx.cela or text xxx.cel files, see above.
- Create processed TPMapFiles by running the TPMapProcessor app on your tpmap file(s).
- Complete the t2ParamaterFile and launch the T2 app.
Option 2a: Process your single chip set tiling data using individual TiMAT applications (Custom Analysis)
This approach can be used to customize your analysis for non standard tiling experiments. Many options exist here that are not available in the T2 app.
- Convert your raw intensity data to binary xxx.cela files, see above.
- Create processed TPMapFiles by running the TPMapProcessor app on your tpmap file(s)., pay close attention to the size of the window option.
- (Optional) Run the VirtualCel application on a directory of converted cel files (xxx.cela) to create image files for visual inspection. Any large spots should be masked using the CelMasker app.
- Run the CelProcessor app on sets of xxx.cela files to scale, normalize, and map the data. This application generates xxx.celp files.
- (Optional) Use the ScatterPlot to draw a simple scatter plot and calculate a Pearson correlation coefficient on processed cel files. For example, compare the different treatment chips to one another. There should be a good internal correlation (>=0.8).
- (Optional) Use HierarchicalClustering to cluster your xxx.celp files. Any file that fails to cluster with an r^2*100 value of > 50 should be removed from the analysis.
- Run ScanChip
- (Optional) Run ScanGenes to perform a basic expression array analysis using your tiling data.
- If you have performed mock IPs, run the FDRWindowConverter app to calculate an empirical FDR for each window.
- Merge high scoring, overlapping Windows into an array of Intervals with the IntervalMaker application. The biggest difficulty is where to set the threshold for merging windows. Two confidence estimations are provided by TiMAT2: an empirical FDR based on a mock IP and a statistical FDR based on Richard Bourgon's non-parametric symmetric p-test. Set these generously and manually filter. If you have multiple replicas, or have used different antibodies you can merge the different Window arrays using the MultiWindowIntervalMaker app.
- (Optional) The SetNumberIntervalMaker can also be used to make multiple interval arrays, each containing a specific number of intervals. This is useful, and recommended, for analysis independent of thresholds.
- Load the interval array(s) with oligo information using the LoadIntervalOligoInfo app.
- For chIP-chip data, find the best average intensity difference sub window within each Interval, as well as enrichment peaks using FindSubBindingRegions.
- (Optional) Score Intervals for the presence of a transcription factor using ScoreIntervals. You can also score whole chromosomes or a set of FASTA sequences using ScoreChromosomes and ScoreSequences respectively.
- (Optional) Filter Intervals with a variety of parameters using IntervalFilter. Sorts intervals into pass and fail.
- (Optional) Compare and split different Interval sets based on their overlap with the OverlapCounter. Use the IntersectRegions app for a more robust and sophisticated analysis.
- Print results:
- Print interval plots for processed intervals using IntervalPlotter. These are graphic representations of the individual oligo intensities for each data set, the averaged treatment chip, the averaged control chip, the intensity difference, the intensity ratio, smoothed (trimmed mean) ratios, as well as hits to a position specific probability matrix, number of matches, and the best window, and the best sub window. These plots can be saved to disk as PNG files or manipulated directly to print chromosomal coordinates for a point or region (including the genomic sequence).
- Print interval reports in a spread sheet or page format using IntervalReportPrinter.
- Print Intervals as .sgr files for import into Affymetrix's IGB browser with IntervalGraphPrinter.
- Print all the oligo values as .sgr files for import into IGB using OligoIntensityPrinter.
- Print any serialize int (ie processed xxx.celp files) associated with each oligo in the tpmap file as .sgr files for import into IGB using IntensityPrinter.
- Print the Intervals in GFF3 format with IntervalGFFPrinter.
Option 2b: Process your multi chip set tiling data using individual TiMAT applications (Custom Analysis)
Multi-chip set analysis is very similar to single-chip set analysis and is the pipeline wrapped by the T2 app. The key differences are noted below.
- Set the -r flag when running the CelProcessor app to break the mapped intensities down by chromosome. The results in this case will be a directory instead of a single xxx.celp file for each array.
- Run the MakeChromosomeSets app to merge the chromosome specific normalized data directories together.
- Run ScanChromosomes instead of ScanChip.
- Run MergeWindowArrays if you processed individual chromosomes in ScanChromosomes.
- Use the LoadChipSetIntervalOligoInfo instead of the LoadIntervalOligoInfo.
Useful higher level analysis and utility applications
Be sure to check out the General Analysis and Utility applications. They might save you days of coding.