An article published in GenomeBiology presents GreenHill: a novel scaffolding and phasing tool for reconstructing chromosome-level haplotypes using the Hi-C genomic analysis technique.
]), Hi-C scaffolding , and Hi-C phasing . CPU times, real times, and peak memory usage of the tools were measured with the GNU time command on a computer with an Intel Xeon Gold 6342 CPU and 512 GB of RAM. The number of threads was specified as 48 for each process provided it was configurable.: Table S4. GreenHill required similar or less time than other approaches. GreenHill generated assemblies within approximately 1 h for data with small genome sizes .
In the third and the fourth benchmarks of actual data, the CLR input contigs were contiguous and relatively fragmented, respectively. Short contigs are difficult to determine the order and orient owing to the small number of mapped Hi-C reads. Therefore, fragmented contigs may be difficult for the scaffolders. GreenHill achieved the highest values for all metrics regarding phasing quality for both contiguous and fragmented input contigs, suggesting its versatility.
We tested GreenHill’s performance on a variety of heterozygosity species ranging from 0.21 ~ 1.47. The heterozygosity of a genome assembly can have a significant impact on its quality. High levels of heterozygosity can lead to increased fragmentation and misassembly, while low levels of heterozygosity can make phasing difficult with few heterozygous sites. In all data, GreenHill was able to construct highly accurate and contiguous haplotypes and showed high robustness to heterozygosity.
Regarding algorithms, the unique functions of GreenHill include the simultaneous use of long and Hi-C reads and error correction using Hi-C contact information with variance-based threshold selection [