- Major update to the documentation.
- Exposed option to set PRNG seed when subsampling reads.
- Fixed issue #14: ‘detect’ and ‘error’ commands were broken. This involved rewriting those commands to use the same pipeline and reporting frameworks as the ‘trim’ and ‘qc’ commands. >>>>>>> issue14
- Updated Dockerfile to use smaller, Alpine-based image.
- Added Docker image for v1.1.2 to Docker Hub.
- Updated Travis config to automatically build Docker images for each release.
- Ported over improvements to adapter parsing (635eea9) from Cutadapt.
- Fixed #12: tqdm progress bar not working.
- Fixed #13: unnecessary differences in summary output between Cutadapt and Atropos.
- New ‘qc’ command computes read-level statistics.
- The ‘trim’ command can also compute read-level statistic pre- and/or post-trimming using the new ‘–stats’ option.
- Major refactoring and improvement of reporting:
- Text report now has data lined up in columns.
- Reports can also be generated in JSON, YAML, and pickle formats.
- Added optional dependency on jinja2, which enables generating reports using templates (including user-defined).
- Major internal code reorganization.
- Static code analysis (pylint).
- Switched to pytest for testing.
- Command-specific help will now show with ‘atropos ‘ or ‘atropos -h’
- Fixed adapter masking in InsertAligner (issue #7; thanks @lllaaa).
- Added developer/contributor documentation and guidelines.
- Implemented Atropos module for MultiQC, which reads reports in JSON format. This is currently available here and will hopefully soon be merged into MultiQC.
- Ported some recent enhancments over from Cutadapt.
- Identified a subtle bug having to do with insufficient memory in
multi-threaded mode. The main thread appears to hang waiting for the
next read from the input file. This appears to occur only under a
strictly-regulated memory cap such as on cluster environment. This
bug is not fixed, but I added the following:
- Set the default batch size based on the queue sizes
- Warn the user if their combination of batch and queue sizes might lead to excessive memory usage.
- Bug fixes
- Abstracted the ErrorEstimator class to enable alternate implementations.
- Added a new ShadowRegressionErrorEstimator that uses the ShadowRegression R package (Wang et al.) to more accurately estimate sequencing error rate. This requires that R and the ShadowRegression package and its dependencies be installed – MASS and ReadCount, which in turn depend on a bunch of Bioconductor packages. At some point, this dependency will go away when I reimplement the method in pure python.
- The error command now reports the longest matching read fragment, which is usually a closer match for the actual adapter sequence than the longest matching k-mer.
- Changed the order of trimming operations - OverwriteReadModifier is now after read and quality trimming.
- Refactored the main Atropos interface to improve testability.
- Added more unit tests.
- Fixed a major bug in OverwriteReadModifier, and in the unit tests for paired-end trimmers.
- Added OverwriteReadModifier, a paired-end modifier that overwrites one read end with the other if the mean quality over the first N bases (where N is user-specified) of one is below a threshold value and the mean quality of the other is above a second threshold. This dramatically improves the number of high-quality read mappings in data sets where there are systematic problems with one read-end.
- Perform error correction when insert match fails but adapter matches are complementary
- Improvements to handling of cached adapter lists
- Merged reads are no longer written to –too-short-output by default
- Many bugfixes and improvements in deployment (including a Makefile)
- Migrate to Versioneer for version management.
- Enable stderr as a valid output using the ‘_’ shortcut.
- Add ability to specify SAM/BAM as input format.
- Add option to select which read to use when treating a paired-end interleaved or SAM/BAM file as single-end.
- Remove restrictions on the use of –merge-overlapping, and enable error correction during merging.
- We are beginning to move towards the use of commands for all operations, and read-trimming now falls under the ‘trim’ command. Currently, calling atropos without any command will default to the ‘trim’ command.
- When InsertAdapterCutter.symmetric is True and mismatch_action is not None, insert match fails, at least one adapter match succeeds, and the adapter matches (if there are two) are complementary, then the reads are treated as overlapping and error correction is performed. This leads to substantial improvements when one read is of good quality while the other is other is of poor quality.
- Fixed missing import bug in ‘detect’ command.
- Added estimate of fraction of contaminated reads to output of ‘detect’ command.
- Optionally cache list of known contaminants rather than re-download it every time.
- Implemented _align.MultiAligner, which returns all matches that satisfy the overlap and error thresholds. align.InsertAligner now uses MultiAligner for insert matching, and tests all matches in decreasing size order until it finds one with adapter matches (if any).
- Major improvements to the accuracy of the ‘detect’ command.
- Added options for how to correct mismatched bases for which qualities are equal.
- Added option to select a single pair of adapters from multiple sequences in a fasta file.
- Fixed report when insert-match is used.
- Fixed several bugs when using the “message” progress bar (thanks to Thomas Cokelaer!).
- Fixed a segmentation fault that occurs when trying to trim zero-length reads with the insert aligner.
- Sevaral other bugfixes.
- Add options to specify max error rates for insert and adapter matching within insert aligner.
- Add new command to estimate empirical error rate in data set from base qualities.
- Add ability to correct errors during insert-match adapter trimming.
- Implement additional adapter-detection algorithms.
- Fix bug where default output file is force-created in parallel-write mode
- Clarify and fix issues with bisulfite trimming. Notably, rrbs and non-directional are now allowed independently or in combination.
- Introduced new ‘detect’ command for automatically detecting adapter sequences.
- Options are now required to specify input files.
- Major updates to documentation.
- Bugfix release
- Reverted previously introduced (and no longer necessary) dependency on bitarray).
- Switched the insert aligner back to the default implementation, as the one that ignores indels is not any faster.
- Re-engineered modifiers.py (and all dependent code) to enable use of modifiers that simultaneously edit both reads in a pair.
- Add –op-order option to enable use to specify order of first four trimming operations.
- Implemented insert-based alignment for paired-end adapter trimming. This is currently experimental. Benchmarking against SeqPurge and Skewer using simulated reads showed that the method Cutadapt uses to align adapters, while optimal for single-end reads, is much less sensitive and specific than the insert match algorithms used by SeqPurge and Skewer. Our algorithm is similar to the one used by SeqPurge but leverages the dynamic programming model of Cutadapt.
- Based on tests, worker compression is faster than writer compression when more than 8 threads are available, so set this to be the default.
- Interanal code reorganization - compression code moved to separate module
- Eliminated the –worker-compression option in favor of –compression (whose value is either ‘worker’ or ‘writer’)
- More documentation improvements
- Significant performance improvements:
- Start an extra worker once the main process is finished loading reads
- Use system-level gzip for writer compression
- Use writer compression by default
- More documentation fixes
- Disable quality trimming if all cutoffs are set to 0
- Eliminated the –parallel-environment option
- Fix documentation bugs associated with migration from optparse to argparse
- Initial release (forked from cutadapt 1.10)
- Re-wrote much of filters.py and modifiers.py to separate
modifying/filtering from file writing.
- File writing is now managed by a separate class (seqio.Writers)
- There are container classes for managing filters (filters.Filters) and modifiers (modifiers.Modifiers)
- Re-wrote all of the output-oriented code in seqio.py
- Formatting Sequence objects is now separate from writing data
- There is a container class (seqio.Formatters) that manages the formatters for output files
- Added support for interleaved output
- Implemented multiprocessing
- Added several new options in scripts.atropos to control parallelization
- Wrote all of the parallel processing code in atropos.multicore
- Renamed scripts.atropos.process_single_reads() to scripts.atropos.run_serial() and rewrote to work similarly to atropos.multicore.run_parallel()
- Added ability to merge report statistics from multiple worker threads
- Added miRNA and bisulfite sequencing options to scripts.atropos
- Added progress bar support
- Switched argument parsing to argparse
- Reorganized the monolithic scripts.atropos.main() into multiple functions
- Dropped all support for python 2.x