Sometime during January this year, I was surfing the internet and came upon a microbial identification challenge that was sponsored by the US FDA. The challenge was about to get underway, and reminded me of a project I’d started with the Army’s Biotechnology High Performance Computing Software Applications Institute (BHSAI) and the National Cancer Institute (NCI) in Frederick, Maryland. The aim of the project I’d started in 2015 was to rapidly identify battlefield and hospital pathogens, with the hope of using a small mobile DNA sequencer (like the Oxford Nanopore MinION) and a cloud based system for taxonomic identification. The unique twist that I brought to the project was experience with an exacting local best match algorithm, named Smith-Waterman, to use for the identification purpose.
But by February 2018, I was slammed with work, starting up a new genetic test and never thought I’d have time to compete in the challenge. Yet, as luck would have it, about the middle of the month, I was out of a job!
After a few weeks of R&R and recovery time, I set about to enroll in and compete the the challenge. I revived my old code, tried working the examples on my laptop MacBook Pro (the only computer conveniently at my disposal) and started running the 4 core machine continuously, overnight. My 2012 vintage machine was the top of the line when it was purchased and I’d put in a 2 Tb hard drive late in February. It was time to teach the Fast Artificial Neural Net (FANN) to adjust the quality scores for the FDA sequences of interest. Time to focus, then, on being able to identify Salmonella enterica var. enterica serovar Newport and divine it from all other strains!
Motivation for the detection software was the nosocomial bacterial infection with Carbapenem resistant Klebsiella pneumoniae (kpc) that had killed 11 patients at the NIH clinical center in 2011. That, combined with the report by my friend Chris Mason that Bacillus anthracis could be found in the NYC subway provided sufficient impetus for alignments and writing scripts.
In 2013-4, it was still early days for Oxford Nanopore sequencing, but I knew that the error profile would improve with time. It seemed foolish not to begin work.
And, so I labored over the program, hoping that, at some point, I would be able to publish the work, with a small cadre of colleagues. Here is the precis that I included with the FDA challenge submission. I hope to provide more details a month or two from now:
Each sample was aligned to the 14 whole genome Newport strains listed at the MBGD database (Uchiyama et al., 2015) using MosaikAligner 2.2.3 (Lee et al, 2014) after compressing using MosaikBuild. The aligner uses a striped Smith-Waterman algorithm implementation to search for local homology and a Fast Artificial Neural Net library to refine the sequence search. The parameters used during the alignment included aligning all reads to all positions with a maximum mismatch percent threshold of 0.1, a minimum percent alignment threshold of 0.5, and a hash size of 15 with a hash position threshold of 100 bp. A perl script was used to filter all such alignments with a samtools mapq => 35 with a perfect match of 150 bp or greater. Computing was entirely done on a mid-2012 vintage MacBook Pro with a 2.9 GHz Intel core i7 microprocessor and 8 GB 1600 MHz DDR3 RAM. The serovar was identified through comparative counting of qualified alignments. Additional work involved alignment to the 408 Salmonella fasta files listed in the MBGD database. This final step was not completed in time for submission to the challenge. A manuscript is in preparation describing the methods used in more detail.
References:
Lee WP, Stromberg MP, Ward A, Stewart C, Garrison EP, Marth GT. MOSAIK: a
hash-based algorithm for accurate next-generation sequencing short-read mapping. PLoS One. 2014 Mar 5;9(3):e90581. doi: 10.1371/journal.pone.0090581. eCollection 2014. PubMed PMID: 24599324; PubMed Central PMCID: PMC3944147.
Uchiyama I, Mihara M, Nishide H, Chiba H. MBGD update 2015: microbial genome database for flexible ortholog analysis utilizing a diverse set of genomic data. Nucleic Acids Res. 2015 Jan;43(Database issue):D270-6. doi: 10.1093/nar/gku1152. Epub 2014 Nov 14. PubMed PMID: 25398900; PubMed Central PMCID: PMC4383954.
Acknowledgement:
The author wishes to thank the FDA for motivating participation in the challenge and John Plaschke, Stefan Stefanov of NCBI, Kate Im, Brian Bushnell and Chris Mason for supporting earlier versions of this work.