Welcome to grep_vcf’s documentation!¶
User Guide¶
grep_vcf is a tiny tool to filter vcf file based on position file and vice et versa. The position file must be a tabulated file with a genomic position as first column. This tool is designed to support big files without consuming huge memory.
Usage¶
- positional arguments:
- positions The text file with the positions looking for in vcf file. It
- must be a tsv file (https://en.wikipedia.org/wiki/Tab- separated_values).where position are in first column.Lines starting with ‘#’ are considering as comments.
- optional arguments:
-h, --help show this help message and exit --vcf VCF The path to the vcf file. By default grep_vcf search for the same path as position file but with ‘.vcf’ as extension. --out OUT The path to an output file, default is stdout. If the file exists, it will be replaced. --invert, -v Invert the sense of matching, to select non-matching vcf lines. --switch Filter position file to keep lines that position match in vcf --version, -V Display version information and quit.
Requirements¶
grep_vcf need python >= 3.6 (tested with 3.6, 3.7 3.8)
Installation¶
pip install git@https://github.com/bneron/grep_vcf.git#egg=grep_vcf
Developer Guide¶
Installation¶
The recommend way to install grep_vcf is to use a virtualenv:
python -m venv grep_vcf
cd grep_vcf
source bin/activate
git clone https://github.com/bneron/grep_vcf.git
cd grep_vcf
pip install -e .[dev]
Overview¶
- There are 2 main files
- grep_vcf/grep_vcf.py which is the module
- grep_vcf/scripts/grep_vcf.py which is the entrypoint to run grep_vcf from command line.
API¶
Module API¶
The module contains mainly two functions
- match_generator that allow to keep lines with a given position in target file based
- on position found in reference file.
- invert_match_generator which that allow to filter out lines with a given position in target file based
- on position found in reference file.
These tow functions are generators to try to work in constant memory even with big files.
Note
in both cases line starting with # are considering as comments and are ignored.
The other functions are helpers.
-
grep_vcf.grep_vcf.
_parse_line
(file)[source]¶ Go to next line and parse it, extract the first field and transform it in int. Ignore comments (line starting with #)
Parameters: file (a file object) – the file to parse. it must be a tsv file with an integer as first column.
Returns: the position parsed
Return type: int
Raises: - StopIteration – when reach the end of file
- ValueError – when first column can not be cast in an integer
-
grep_vcf.grep_vcf.
_until_the_end
(file)[source]¶ Iterate over lines until the end of file. Skip line starting with ‘#’
Parameters: file – the file to iterate over Returns: lines Return type: str
-
grep_vcf.grep_vcf.
invert_match_generator
(ref_file, target_file)[source]¶ create a generator which can iterate over line in target_file where position not appear in reference file the position are extract from the first column of ref_file and target_file.
Parameters: - ref_file (file object) – the text file to extract
- target_file (file object) – the vcf to compare
Returns: a generator
Return type: generator
-
grep_vcf.grep_vcf.
match_generator
(ref_file, target_file)[source]¶ create a generator which can iterate over line in target_file where position not appear in reference file the position are extract from the first column of ref_file and target_file.
Parameters: - ref_file (file object) – the text file to extract
- target_file (file object) – the vcf to compare
Returns: a generator
Return type: generator
Scripts API¶
-
grep_vcf.scripts.grep_vcf.
get_version_message
()[source]¶ Returns: the version informations Return type: str
-
grep_vcf.scripts.grep_vcf.
main
(args=None)[source]¶ Parameters: - args – the arguments to use to run
- args – list of str
-
grep_vcf.scripts.grep_vcf.
parse_args
(args)[source]¶ Parameters: args (List of strings [without the program name]) – The arguments provided on the command line Returns: The arguments parsed Return type: aprgparse.Namespace
object.