Last updated: 2023-04-05
Checks: 7 0
Knit directory: GlobalStructure/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20230404)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version baf2fd4. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rproj.user/
Ignored: 1_Raw/.DS_Store
Ignored: 2_Derived/.DS_Store
Ignored: 3_Results/.DS_Store
Ignored: renv/library/
Ignored: renv/staging/
Unstaged changes:
Modified: 3_Results/SlipAndJump.HomologyAndRepeats.txt
Modified: 3_Results/heatmap_folding_stability.svg
Modified: 3_Results/heatmap_global_folding_sw100.pdf
Modified: 3_Results/heatmap_microhomology_AIC.pdf
Modified: 3_Results/heatmap_microhomology_AIC_wt_deledions.svg
Modified: 3_Results/violin_rep_center_release.pdf
Modified: 3_Results/violin_rep_flanked_length_release.pdf
Modified: 3_Results/violin_rep_folding_infsign_np.pdf
Modified: 3_Results/violin_rep_folding_infsign_p.pdf
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/methods.Rmd
) and HTML
(docs/methods.html
) files. If you’ve configured a remote
Git repository (see ?wflow_git_remote
), click on the
hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | baf2fd4 | Evgenii O. Tretiakov | 2023-04-05 | Start workflowr project. |
# Presentation
library("glue")
library("knitr")
# JSON
library("jsonlite")
# Tidyverse
library("tidyverse")
dir.create(here::here("3_Results", DOCNAME), showWarnings = FALSE)
write_bib(c("base", "reshape2", "tidyverse", "here", "furrr", "ggpubr",
"cowplot", "patchwork", "raster", "skimr", "pspearman",
"ggstatsplot", "ggasym", "cowplot", "ggstatsplot",
"gridExtra", "tidyverse", "dplyr", "tidyr", "magrittr", "stringr",
"future", "purrr", "here", "workflowr", "knitr",
"kableExtra", "rmarkdown"),
file = here::here("3_Results", DOCNAME, "packages.bib"))
versions <- list(
cowplot = packageVersion("cowplot"),
dplyr = packageVersion("dplyr"),
furrr = packageVersion("furrr"),
future = packageVersion("future"),
ggasym = packageVersion("ggasym"),
ggpubr = packageVersion("ggpubr"),
ggstatsplot = packageVersion("ggstatsplot"),
gridExtra = packageVersion("gridExtra"),
here = packageVersion("here"),
kableExtra = packageVersion("kableExtra"),
knitr = packageVersion("knitr"),
magrittr = packageVersion("magrittr"),
patchwork = packageVersion("patchwork"),
pspearman = packageVersion("pspearman"),
purrr = packageVersion("purrr"),
python = "3.8.8",
R = str_extract(R.version.string, "[0-9\\.]+"),
raster = packageVersion("raster"),
reshape2 = packageVersion("reshape2"),
rmarkdown = packageVersion("rmarkdown"),
skimr = packageVersion("skimr"),
stringr = packageVersion("stringr"),
tidyr = packageVersion("tidyr"),
tidyverse = packageVersion("tidyverse"),
viridis = packageVersion("viridis"),
workflowr = packageVersion("workflowr")
)
For each deletion from MitoBreak in the major arc (5781-16569), its midpoint was found. Next, each of the real deletions was moved randomly within the major arc, and their midpoints were also obtained. For the observed means of the observed deletions and randomly simulated ones, the corresponding standard deviations were obtained and compared.
The publicly available mtDNA matrix was visualized using Juicebox (Robinson et al. 2018). The corresponding paper describing the methodology of obtaining Hi-C data derived from the human lymphoblastoid cell line is by (Rao et al. 2014). Additionally, we obtained six Hi-C mtDNA contact matrixes from olfactory receptors of covid patients and controls. Details of the in situ Hi-C protocol, as well as bioinformatics analyses, are described in the original paper (Zazhytska et al. 2022). Matrices were visualized using Juicebox (Robinson et al. 2018).
We used the heavy chain of the reference human mtDNA sequence (NC_012920.1) since it spends the most time being single-stranded according to the asymmetric model of mtDNA replication (Persson et al. 2019). Using Mfold (Zuker 2003) with parameters set for DNA folding and a circular sequence, we constrained everything but the major arc from forming base pairs. We obtained the global (genome-wide) secondary structure, which we then translated into the number of hydrogen bonds connecting our regions of interest (100 bp windows for the analyses and visualization). Next, within the single-stranded heavy chain of the major arc, we defined 100 bp windows and hybridized all potential pairs of such windows using ViennaRna Package 2 (Lorenz et al. 2011). Obtained Gibbs Energies for each pair of such windows was used as a metric of a strength of a potential interaction between two single-stranded DNA regions.
For each pair of 100 bp window, we estimated the number of nucleotides involved in at least one inverted/direct degraded repeat. The corresponding repeat should have one arm located in the first window and another arm located in the second window. All degraded (with the maximal level of imperfection of 80%) repeats in the human mtDNA were called using our algorithm described previously (Shamanskiy et al. 2019).
For clusterization, we used all MitoBreak (Damas et al. 2014) deletions from the major arc. We used 5’ and 3’coordinates as input for a hierarchical density-based clustering algorithm (python hdbscan v0.8.24) (McInnes and Healy 2017). DBSCAN is a well-known algorithm for probability density-based clusterization, which detects clusters as regions with more densely located sample data as well as outlier samples. The advantage of this method is soft clustering. We variated cluster density parameters in order to ensure cluster stability and found that cluster formations stay relatively stable for a wide range of parameters. Thus, DBSCAN produces a robust set of clusters, producing additional evidence for regions with elevated deletion rates. We also performed affinity propagation clustering (Frey and Dueck 2007) as a data exploration experiment, which also yields robust clustering.
The list of the perfect direct repeats with a length of 10 or more base pairs was used from our algorithm described in (Guo et al. 2010).
We used our database of degraded mtDNA repeats (Shamanskiy et al. 2019) with a length of 10 bp or more and a similarity of 80% or more. We took into account only direct repeats with both arms located in the major arc. We grouped repeats with similar motifs into clusters so that each considered cluster should contain at least three arms of the repeat, and at least one deletion should be associated with two of them. We additionally restricted our subset of clusters, considering only non-realized repeats as pairs of arms, where at least one of them (the first or the second) is the same as in realized repeat. Visually in Fig 2D, it means that within each cluster, we compare realized repeats (red dot) with non-realized ones (grey dot) located on the same horizontal (the same Y coordinate) or vertical (the same X coordinate) axis. We got 618 clusters like this. Pairwise alignments for microhomology matrix: A measure for the degree of similarity between segments of the major arc was obtained by aligning small windows of the mitochondrial major arc sequence with each other. We sliced the mitochondrial major arc sequence into 100 nucleotide pieces and aligned them against each other using EMBOSS Needle (Needleman and Wunsch 1970) with default parameters (match +5, gap open - 10, gap extend - 0.5), parsed out the alignment scores, thus obtaining data for the matrix of microhomology.
Visualisations and figures were primarily created using the ggplot2 (v), cowplot (v1.1.1) (Wilke 2020), patchwork (v1.1.2) (Pedersen 2022), ggasym (v0.1.6) (Cook 2021), ggpubr (v0.6.0) (Kassambara 2023) and ggstatsplot (v0.11.0) (patil2021?) packages using the viridis colour palette (v0.6.2) for continuous data. Data manipulation was performed using other packages in the tidyverse (v2.0.0) (Wickham 2023) particularly dplyr (v1.1.1) (Wickham et al. 2023), tidyr (v1.1.1) (Wickham, Vaughan, and Girlich 2023) and purrr (v1.0.1) (Wickham and Henry 2023). The analysis project was managed using the workflowr (v1.7.0) (Blischak, Carbonetto, and Stephens 2021) package which was also used to produce the publicly available website displaying the analysis code, results and output. Reproducible reports were produced using knitr (v1.42) (Xie 2023) and R Markdown (v2.21) (Allaire et al. 2023) and converted to HTML using Pandoc (v).
versions <- purrr::map(versions, as.character)
versions <- jsonlite::toJSON(versions, pretty = TRUE)
readr::write_lines(versions,
here::here("3_Results", DOCNAME, "package-versions.json"))
sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.1
[5] purrr_1.0.1 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
[9] ggplot2_3.4.2 tidyverse_2.0.0 jsonlite_1.8.4 knitr_1.42
[13] glue_1.6.2 workflowr_1.7.0
loaded via a namespace (and not attached):
[1] tidyselect_1.2.0 xfun_0.38 bslib_0.4.2 colorspace_2.1-0
[5] vctrs_0.6.1 generics_0.1.3 htmltools_0.5.5 yaml_2.3.7
[9] utf8_1.2.3 rlang_1.1.0 jquerylib_0.1.4 later_1.3.0
[13] pillar_1.9.0 withr_2.5.0 bit64_4.0.5 lifecycle_1.0.3
[17] munsell_0.5.0 gtable_0.3.3 evaluate_0.20 tzdb_0.3.0
[21] callr_3.7.3 fastmap_1.1.1 httpuv_1.6.9 ps_1.7.4
[25] parallel_4.2.2 fansi_1.0.4 Rcpp_1.0.10 renv_0.17.2
[29] promises_1.2.0.1 scales_1.2.1 cachem_1.0.7 vroom_1.6.1
[33] bit_4.0.5 fs_1.6.1 hms_1.1.3 digest_0.6.31
[37] stringi_1.7.12 processx_3.8.0 getPass_0.2-2 rprojroot_2.0.3
[41] grid_4.2.2 here_1.0.1 cli_3.6.1 tools_4.2.2
[45] magrittr_2.0.3 sass_0.4.5 crayon_1.5.2 whisker_0.4.1
[49] pkgconfig_2.0.3 timechange_0.2.0 rmarkdown_2.21 httr_1.4.5
[53] rstudioapi_0.14 R6_2.5.1 git2r_0.31.0 compiler_4.2.2