Last updated: 2023-04-05

Checks: 7 0

Knit directory: GlobalStructure/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20230404)

The command set.seed(20230404) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: baf2fd4

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version baf2fd4. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/
    Ignored:    1_Raw/.DS_Store
    Ignored:    2_Derived/.DS_Store
    Ignored:    3_Results/.DS_Store
    Ignored:    renv/library/
    Ignored:    renv/staging/

Unstaged changes:
    Modified:   3_Results/SlipAndJump.HomologyAndRepeats.txt
    Modified:   3_Results/heatmap_folding_stability.svg
    Modified:   3_Results/heatmap_global_folding_sw100.pdf
    Modified:   3_Results/heatmap_microhomology_AIC.pdf
    Modified:   3_Results/heatmap_microhomology_AIC_wt_deledions.svg
    Modified:   3_Results/violin_rep_center_release.pdf
    Modified:   3_Results/violin_rep_flanked_length_release.pdf
    Modified:   3_Results/violin_rep_folding_infsign_np.pdf
    Modified:   3_Results/violin_rep_folding_infsign_p.pdf

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/methods.Rmd) and HTML (docs/methods.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	baf2fd4	Evgenii O. Tretiakov	2023-04-05	Start workflowr project.

# Presentation
library("glue")
library("knitr")

# JSON
library("jsonlite")

# Tidyverse
library("tidyverse")

dir.create(here::here("3_Results", DOCNAME), showWarnings = FALSE)

write_bib(c("base", "reshape2", "tidyverse", "here", "furrr", "ggpubr",
            "cowplot", "patchwork", "raster", "skimr", "pspearman",
            "ggstatsplot", "ggasym", "cowplot", "ggstatsplot",
            "gridExtra", "tidyverse", "dplyr", "tidyr", "magrittr", "stringr", 
            "future", "purrr", "here", "workflowr", "knitr",
            "kableExtra", "rmarkdown"),
          file = here::here("3_Results", DOCNAME, "packages.bib"))

versions <- list(
    cowplot     = packageVersion("cowplot"),
    dplyr       = packageVersion("dplyr"),
    furrr       = packageVersion("furrr"),
    future      = packageVersion("future"),
    ggasym      = packageVersion("ggasym"),
    ggpubr      = packageVersion("ggpubr"),
    ggstatsplot = packageVersion("ggstatsplot"),
    gridExtra   = packageVersion("gridExtra"),
    here        = packageVersion("here"),
    kableExtra  = packageVersion("kableExtra"),
    knitr       = packageVersion("knitr"),
    magrittr    = packageVersion("magrittr"),
    patchwork   = packageVersion("patchwork"),
    pspearman   = packageVersion("pspearman"),
    purrr       = packageVersion("purrr"),
    python      = "3.8.8",
    R           = str_extract(R.version.string, "[0-9\\.]+"),
    raster      = packageVersion("raster"),
    reshape2    = packageVersion("reshape2"),
    rmarkdown   = packageVersion("rmarkdown"),
    skimr       = packageVersion("skimr"),
    stringr     = packageVersion("stringr"),
    tidyr       = packageVersion("tidyr"),
    tidyverse   = packageVersion("tidyverse"),
    viridis     = packageVersion("viridis"),
    workflowr   = packageVersion("workflowr")
)

Distribution of the centers:

For each deletion from MitoBreak in the major arc (5781-16569), its midpoint was found. Next, each of the real deletions was moved randomly within the major arc, and their midpoints were also obtained. For the observed means of the observed deletions and randomly simulated ones, the corresponding standard deviations were obtained and compared.

Hi-C mtDNA contact matrix:

The publicly available mtDNA matrix was visualized using Juicebox (Robinson et al. 2018). The corresponding paper describing the methodology of obtaining Hi-C data derived from the human lymphoblastoid cell line is by (Rao et al. 2014). Additionally, we obtained six Hi-C mtDNA contact matrixes from olfactory receptors of covid patients and controls. Details of the in situ Hi-C protocol, as well as bioinformatics analyses, are described in the original paper (Zazhytska et al. 2022). Matrices were visualized using Juicebox (Robinson et al. 2018).

In silico folding:

We used the heavy chain of the reference human mtDNA sequence (NC_012920.1) since it spends the most time being single-stranded according to the asymmetric model of mtDNA replication (Persson et al. 2019). Using Mfold (Zuker 2003) with parameters set for DNA folding and a circular sequence, we constrained everything but the major arc from forming base pairs. We obtained the global (genome-wide) secondary structure, which we then translated into the number of hydrogen bonds connecting our regions of interest (100 bp windows for the analyses and visualization). Next, within the single-stranded heavy chain of the major arc, we defined 100 bp windows and hybridized all potential pairs of such windows using ViennaRna Package 2 (Lorenz et al. 2011). Obtained Gibbs Energies for each pair of such windows was used as a metric of a strength of a potential interaction between two single-stranded DNA regions.

The density of inverted/direct repeats:

For each pair of 100 bp window, we estimated the number of nucleotides involved in at least one inverted/direct degraded repeat. The corresponding repeat should have one arm located in the first window and another arm located in the second window. All degraded (with the maximal level of imperfection of 80%) repeats in the human mtDNA were called using our algorithm described previously (Shamanskiy et al. 2019).

Clusterization of deletions:

For clusterization, we used all MitoBreak (Damas et al. 2014) deletions from the major arc. We used 5’ and 3’coordinates as input for a hierarchical density-based clustering algorithm (python hdbscan v0.8.24) (McInnes and Healy 2017). DBSCAN is a well-known algorithm for probability density-based clusterization, which detects clusters as regions with more densely located sample data as well as outlier samples. The advantage of this method is soft clustering. We variated cluster density parameters in order to ensure cluster stability and found that cluster formations stay relatively stable for a wide range of parameters. Thus, DBSCAN produces a robust set of clusters, producing additional evidence for regions with elevated deletion rates. We also performed affinity propagation clustering (Frey and Dueck 2007) as a data exploration experiment, which also yields robust clustering.

Perfect direct repeats of the human mtDNA:

The list of the perfect direct repeats with a length of 10 or more base pairs was used from our algorithm described in (Guo et al. 2010).

Realized and non-realized direct degraded repeats:

We used our database of degraded mtDNA repeats (Shamanskiy et al. 2019) with a length of 10 bp or more and a similarity of 80% or more. We took into account only direct repeats with both arms located in the major arc. We grouped repeats with similar motifs into clusters so that each considered cluster should contain at least three arms of the repeat, and at least one deletion should be associated with two of them. We additionally restricted our subset of clusters, considering only non-realized repeats as pairs of arms, where at least one of them (the first or the second) is the same as in realized repeat. Visually in Fig 2D, it means that within each cluster, we compare realized repeats (red dot) with non-realized ones (grey dot) located on the same horizontal (the same Y coordinate) or vertical (the same X coordinate) axis. We got 618 clusters like this. Pairwise alignments for microhomology matrix: A measure for the degree of similarity between segments of the major arc was obtained by aligning small windows of the mitochondrial major arc sequence with each other. We sliced the mitochondrial major arc sequence into 100 nucleotide pieces and aligned them against each other using EMBOSS Needle (Needleman and Wunsch 1970) with default parameters (match +5, gap open - 10, gap extend - 0.5), parsed out the alignment scores, thus obtaining data for the matrix of microhomology.

Other packages

Visualisations and figures were primarily created using the ggplot2 (v), cowplot (v1.1.1) (Wilke 2020), patchwork (v1.1.2) (Pedersen 2022), ggasym (v0.1.6) (Cook 2021), ggpubr (v0.6.0) (Kassambara 2023) and ggstatsplot (v0.11.0) (patil2021?) packages using the viridis colour palette (v0.6.2) for continuous data. Data manipulation was performed using other packages in the tidyverse (v2.0.0) (Wickham 2023) particularly dplyr (v1.1.1) (Wickham et al. 2023), tidyr (v1.1.1) (Wickham, Vaughan, and Girlich 2023) and purrr (v1.0.1) (Wickham and Henry 2023). The analysis project was managed using the workflowr (v1.7.0) (Blischak, Carbonetto, and Stephens 2021) package which was also used to produce the publicly available website displaying the analysis code, results and output. Reproducible reports were produced using knitr (v1.42) (Xie 2023) and R Markdown (v2.21) (Allaire et al. 2023) and converted to HTML using Pandoc (v).

Summary

Output files

versions <- purrr::map(versions, as.character)
versions <- jsonlite::toJSON(versions, pretty = TRUE)
readr::write_lines(versions,
                   here::here("3_Results", DOCNAME, "package-versions.json"))

References

Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2023. Rmarkdown: Dynamic Documents for r. https://CRAN.R-project.org/package=rmarkdown.

Blischak, John, Peter Carbonetto, and Matthew Stephens. 2021. Workflowr: A Framework for Reproducible and Collaborative Data Science. https://github.com/workflowr/workflowr.

Cook, Joshua H. 2021. Ggasym: Asymmetric Matrix Plotting in Ggplot2. https://github.com/jhrcook/ggasym https://jhrcook.github.io/ggasym/.

Damas, Joana, João Carneiro, António Amorim, and Filipe Pereira. 2014. “MitoBreak: the mitochondrial DNA breakpoints database.” Nucleic Acids Research 42 (Database issue): D1261–1268. https://doi.org/10.1093/nar/gkt982.

Frey, Brendan J., and Delbert Dueck. 2007. “Clustering by passing messages between data points.” Science (New York, N.Y.) 315 (5814): 972–76. https://doi.org/10.1126/science.1136800.

Guo, Xinhong, Konstantin Yu Popadin, Natalya Markuzon, Yuriy L. Orlov, Yevgenya Kraytsberg, Kim J. Krishnan, Gábor Zsurka, Douglas M. Turnbull, Wolfram S. Kunz, and Konstantin Khrapko. 2010. “Repeats, longevity and the sources of mtDNA deletions: evidence from ’deletional spectra’.” Trends in genetics: TIG 26 (8): 340–43. https://doi.org/10.1016/j.tig.2010.05.006.

Kassambara, Alboukadel. 2023. Ggpubr: Ggplot2 Based Publication Ready Plots. https://rpkgs.datanovia.com/ggpubr/.

Lorenz, Ronny, Stephan H. Bernhart, Christian Höner Zu Siederdissen, Hakim Tafer, Christoph Flamm, Peter F. Stadler, and Ivo L. Hofacker. 2011. “ViennaRNA Package 2.0.” Algorithms for molecular biology: AMB 6 (November): 26. https://doi.org/10.1186/1748-7188-6-26.

McInnes, Leland, and John Healy. 2017. “2017 IEEE International Conference on Data Mining Workshops (ICDMW).” In, 33–42. New Orleans, LA: IEEE. https://doi.org/10.1109/ICDMW.2017.12.

Needleman, S. B., and C. D. Wunsch. 1970. “A general method applicable to the search for similarities in the amino acid sequence of two proteins.” Journal of Molecular Biology 48 (3): 443–53. https://doi.org/10.1016/0022-2836(70)90057-4.

Pedersen, Thomas Lin. 2022. Patchwork: The Composer of Plots. https://CRAN.R-project.org/package=patchwork.

Persson, Örjan, Yazh Muthukumar, Swaraj Basu, Louise Jenninger, Jay P. Uhler, Anna-Karin Berglund, Robert McFarland, et al. 2019. “Copy-choice recombination during mitochondrial L-strand synthesis causes DNA deletions.” Nature Communications 10 (1): 759. https://doi.org/10.1038/s41467-019-08673-5.

Rao, Suhas S. P., Miriam H. Huntley, Neva C. Durand, Elena K. Stamenova, Ivan D. Bochkov, James T. Robinson, Adrian L. Sanborn, et al. 2014. “A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping.” Cell 159 (7): 1665–80. https://doi.org/10.1016/j.cell.2014.11.021.

Robinson, James T., Douglass Turner, Neva C. Durand, Helga Thorvaldsdóttir, Jill P. Mesirov, and Erez Lieberman Aiden. 2018. “Juicebox.js Provides a Cloud-Based Visualization System for Hi-C Data.” Cell Systems 6 (2): 256–258.e1. https://doi.org/10.1016/j.cels.2018.01.001.

Shamanskiy, Viktor A., Valeria N. Timonina, Konstantin Yu Popadin, and Konstantin V. Gunbin. 2019. “ImtRDB: a database and software for mitochondrial imperfect interspersed repeats annotation.” BMC genomics 20 (Suppl 3): 295. https://doi.org/10.1186/s12864-019-5536-1.

Wickham, Hadley. 2023. Tidyverse: Easily Install and Load the Tidyverse. https://CRAN.R-project.org/package=tidyverse.

Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Wickham, Hadley, and Lionel Henry. 2023. Purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.

Wickham, Hadley, Davis Vaughan, and Maximilian Girlich. 2023. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.

Wilke, Claus O. 2020. Cowplot: Streamlined Plot Theme and Plot Annotations for Ggplot2. https://wilkelab.org/cowplot/.

Xie, Yihui. 2023. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.

Zazhytska, Marianna, Albana Kodra, Daisy A. Hoagland, Justin Frere, John F. Fullard, Hani Shayya, Natalie G. McArthur, et al. 2022. “Non-cell-autonomous disruption of nuclear architecture as a potential cause of COVID-19-induced anosmia.” Cell 185 (6): 1052–1064.e12. https://doi.org/10.1016/j.cell.2022.01.024.

Zuker, Michael. 2003. “Mfold web server for nucleic acid folding and hybridization prediction.” Nucleic Acids Research 31 (13): 3406–15. https://doi.org/10.1093/nar/gkg595.

sessionInfo()

R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.1    
 [5] purrr_1.0.1     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1   
 [9] ggplot2_3.4.2   tidyverse_2.0.0 jsonlite_1.8.4  knitr_1.42     
[13] glue_1.6.2      workflowr_1.7.0

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0 xfun_0.38        bslib_0.4.2      colorspace_2.1-0
 [5] vctrs_0.6.1      generics_0.1.3   htmltools_0.5.5  yaml_2.3.7      
 [9] utf8_1.2.3       rlang_1.1.0      jquerylib_0.1.4  later_1.3.0     
[13] pillar_1.9.0     withr_2.5.0      bit64_4.0.5      lifecycle_1.0.3 
[17] munsell_0.5.0    gtable_0.3.3     evaluate_0.20    tzdb_0.3.0      
[21] callr_3.7.3      fastmap_1.1.1    httpuv_1.6.9     ps_1.7.4        
[25] parallel_4.2.2   fansi_1.0.4      Rcpp_1.0.10      renv_0.17.2     
[29] promises_1.2.0.1 scales_1.2.1     cachem_1.0.7     vroom_1.6.1     
[33] bit_4.0.5        fs_1.6.1         hms_1.1.3        digest_0.6.31   
[37] stringi_1.7.12   processx_3.8.0   getPass_0.2-2    rprojroot_2.0.3 
[41] grid_4.2.2       here_1.0.1       cli_3.6.1        tools_4.2.2     
[45] magrittr_2.0.3   sass_0.4.5       crayon_1.5.2     whisker_0.4.1   
[49] pkgconfig_2.0.3  timechange_0.2.0 rmarkdown_2.21   httr_1.4.5      
[53] rstudioapi_0.14  R6_2.5.1         git2r_0.31.0     compiler_4.2.2

Methods