grenadine.Preprocessing package¶
Submodules¶
grenadine.Preprocessing.discretization module¶
This module allows to discretize gene expression datasets. It is mostly based on scikit-learn library. Different discretization methods are available : EWD (equal width, uniform), EFD (equal frequency, quantile), kmeans, bikmeans (Li et al., 2010).
-
grenadine.Preprocessing.discretization.
bikmeans_original
(data, nb_bins)[source]¶ Discretize data into nb_bins intervals, with method bikmeans, from the publication by Li et al, 2010.
Parameters: - data (pandas.DataFrame) – dataset to discretize
- nb_bins (int) – number of intervals in which to discretize data
Returns: dataframe of discretized data
Return type: pandas.DataFrame
Examples
>>> import pandas as pd >>> import numpy as np >>> np.random.seed(0) >>> data = pd.DataFrame(np.random.randn(3, 5), index=["gene1", "gene2", "gene3"], columns=["c1", "c2", "c3", "c4", "c5"]) >>> data c1 c2 c3 c4 c5 gene1 1.764052 0.400157 0.978738 2.240893 1.867558 gene2 -0.977278 0.950088 -0.151357 -0.103219 0.410599 gene3 0.144044 1.454274 0.761038 0.121675 0.443863 >>> discr_data = bikmeans_original(data=data, nb_bins=2) >>> discr_data c1 c2 c3 c4 c5 gene1 1.0 0.0 0.0 1.0 1.0 gene2 0.0 1.0 0.0 0.0 0.0 gene3 0.0 1.0 0.0 0.0 0.0
-
grenadine.Preprocessing.discretization.
bikmeans_simple
(data, nb_bins)[source]¶ Discretize data into nb_bins intervals, with method bikmeans, simplified. From the publication by Li et al, 2010. See function bikmeans_original() for the full implementation of bikmeans as described in the paper.
Parameters: - data (pandas.DataFrame) – dataset to discretize
- nb_bins (int) – number of intervals in which to discretize data
Returns: dataframe of discretized data
Return type: pandas.DataFrame
Examples
>>> import pandas as pd >>> import numpy as np >>> np.random.seed(0) >>> data = pd.DataFrame(np.random.randn(3, 5), index=["gene1", "gene2", "gene3"], columns=["c1", "c2", "c3", "c4", "c5"]) >>> data c1 c2 c3 c4 c5 gene1 1.764052 0.400157 0.978738 2.240893 1.867558 gene2 -0.977278 0.950088 -0.151357 -0.103219 0.410599 gene3 0.144044 1.454274 0.761038 0.121675 0.443863 >>> discr_data = bikmeans_simple(data=data, nb_bins=2) >>> discr_data c1 c2 c3 c4 c5 gene1 2.0 1.0 1.0 2.0 2.0 gene2 1.0 2.0 1.0 1.0 1.0 gene3 1.0 2.0 1.0 1.0 1.0
-
grenadine.Preprocessing.discretization.
discretize_genexp
(data, method, nb_bins=2, axis=0)[source]¶ Discretize data into nb_bins intervals, with specified method, along specified axis.
Parameters: - data (pandas.DataFrame or pandas.Series) – dataset to discretize
- method (str) – method used for discretization, amongst: ‘kmeans’, ‘bikmeans’, ‘ewd’, ‘efd’
- nb_bins (int) – (default 2) number of intervals in which to discretize data
- axis (int) – (default 0) indicates if discretization should be done on each column (0) or each line (1) of data. Ignore this parameter if method is bikmeans
Returns: dataframe or series of discretized data, depending on the dimension of passed data
Return type: pandas.DataFrame or pandas.Series
Examples
>>> import pandas as pd >>> import numpy as np >>> np.random.seed(0) >>> data = pd.DataFrame(np.random.randn(3, 5), index=["gene1", "gene2", "gene3"], columns=["c1", "c2", "c3", "c4", "c5"]) >>> data c1 c2 c3 c4 c5 gene1 1.764052 0.400157 0.978738 2.240893 1.867558 gene2 -0.977278 0.950088 -0.151357 -0.103219 0.410599 gene3 0.144044 1.454274 0.761038 0.121675 0.443863 >>> discr_data = discretize_genexp(data=data, method='efd') >>> discr_data c1 c2 c3 c4 c5 gene1 1.0 0.0 1.0 1.0 1.0 gene2 0.0 1.0 0.0 0.0 0.0 gene3 1.0 1.0 1.0 1.0 1.0
grenadine.Preprocessing.rnaseq_normalization module¶
This module allows to normalize RNAseq gene expression data.
-
grenadine.Preprocessing.rnaseq_normalization.
DEseq2
(raw_counts, col_data, rlog=True)[source]¶ Apply R DEseq2 normalization.
Parameters: - raw_counts (pandas.DataFrame) – raw RNAseq counts where rows are genes and columns are conditions
- col_data (pandas.DataFrame) – Two columns, one corresponding to ids of each condition (individuals), and one with the experiment id (if many repetitions)
Returns: Normalized counts
Return type: pandas.DataFrame
Example
>>> import pandas as pd >>> import numpy as np >>> np.random.seed(0) >>> raw_counts = pd.DataFrame(np.random.randint(0,1000,(20,10)), columns = ["Z"+str(i) for i in range(10)]) >>> col_data = pd.DataFrame([["Z0","1"], ["Z1","2"], ["Z2","3"], ["Z3","4"], ["Z4","5"], ["Z5","6"], ["Z6","7"], ["Z7","8"], ["Z8","9"], ["Z9","10"] ],columns=["individuals","conditions"]) >>> raw_counts.columns = col_data["individuals"] >>> col_data.index = col_data['individuals'] >>> DEseq2(raw_counts,col_data,rlog=False) individuals X0 X1 ... X8 X9 0 408.025477 382.991634 ... 7.745300 611.474516 1 165.238388 516.593367 ... 270.224902 596.251084 2 289.912839 377.510537 ... 727.197585 60.893728 3 463.502625 627.585575 ... 385.543809 718.884285 4 59.056319 674.174898 ... 364.029087 243.574911 5 573.263865 181.561329 ... 129.948918 570.878697 6 304.229522 314.477925 ... 802.068816 44.824550 7 537.472156 376.825400 ... 36.144732 373.819828 8 323.914962 608.401737 ... 748.712307 100.643800 9 464.695682 294.608949 ... 781.414683 535.357356 10 559.543710 57.551516 ... 112.737140 822.065324 11 517.786716 123.324676 ... 618.763389 768.783312 12 222.505121 584.421939 ... 81.755942 166.612005 13 361.496256 175.395095 ... 333.047888 515.905193 14 330.476775 666.638390 ... 779.693506 312.926101 15 331.073304 653.620785 ... 493.978005 787.389729 16 437.851901 84.271862 ... 483.650938 347.601696 17 466.485268 28.090621 ... 750.433484 9.303208 18 459.326926 210.337087 ... 149.742461 468.543405 19 221.312064 126.065225 ... 662.653421 435.559302
-
grenadine.Preprocessing.rnaseq_normalization.
RPK
(raw_counts, seq_lengths, seq_in_kb=False)[source]¶ Reads Per Kilobase normalization.
Parameters: - raw_counts (pandas.DataFrame) – raw RNAseq counts where rows are genes and columns are conditions
- seq_lengths (pandas.Series) – sequences DNA lengths
- seq_in_kb (bool) – True if lengths in kb, False otherwise
Returns: Normalized counts
Return type: pandas.DataFrame
Examples
>>> import numpy as np >>> np.random.seed(0) >>> import pandas as pd >>> nb_genes = 1000 >>> nb_conditions = 5 >>> raw_counts = np.random.randint(0,1e6,(nb_genes,nb_conditions)) >>> raw_counts = pd.DataFrame(raw_counts) >>> seq_lengths = np.random.randint(100,20000,nb_genes) >>> seq_lengths = pd.Series(seq_lengths) >>> rpk = RPK(raw_counts, seq_lengths) >>> rpk.head() 0 1 2 3 4 0 321202.997719 99612.577387 142010.101010 38433.365917 313911.697621 1 26853.843441 155566.114245 63431.417489 53620.768688 21611.248237 2 97195.319962 71353.390640 59624.960204 117133.237822 117212.034384 3 132006.796941 72465.590484 356436.703483 256785.896347 229981.733220 4 48384.227419 34354.424576 18889.143614 37956.492944 45220.490091
-
grenadine.Preprocessing.rnaseq_normalization.
RPKM
(raw_counts, seq_lengths, seq_in_kb=False)[source]¶ Reads Per Kilobase Million (also known as FPM: Fragments per kilobase).
Parameters: - raw_counts (pandas.DataFrame) – raw RNAseq counts where rows are genes and columns are conditions
- seq_lengths (pandas.Series) – sequences DNA lengths
- seq_in_kb (bool) – True if lengths in kb, False otherwise
Returns: Normalized counts
Return type: pandas.DataFrame
Examples
>>> import numpy as np >>> np.random.seed(0) >>> import pandas as pd >>> nb_genes = 1000 >>> nb_conditions = 5 >>> raw_counts = np.random.randint(0,1e6,(nb_genes,nb_conditions)) >>> raw_counts = pd.DataFrame(raw_counts) >>> seq_lengths = np.random.randint(100,20000,nb_genes) >>> seq_lengths = pd.Series(seq_lengths) >>> rpkm = RPKM(raw_counts, seq_lengths) >>> rpkm.head() 0 1 2 3 4 0 649.733415 201.368439 291.638511 76.398582 628.676848 1 54.320288 314.479420 130.265692 106.588393 43.281252 2 196.607901 144.242035 122.448576 232.839698 234.742741 3 267.024989 146.490365 731.994898 510.443933 460.588733 4 97.872216 69.448026 38.791619 75.450645 90.563924
-
grenadine.Preprocessing.rnaseq_normalization.
RPM
(raw_counts)[source]¶ Reads Per Million.
Parameters: raw_counts (pandas.DataFrame) – raw RNAseq counts where rows are genes and columns are conditions Returns: Normalized counts Return type: pandas.DataFrame Examples
>>> import numpy as np >>> np.random.seed(0) >>> import pandas as pd >>> nb_genes = 1000 >>> nb_conditions = 5 >>> raw_counts = np.random.randint(0,1e6,(nb_genes,nb_conditions)) >>> raw_counts = pd.DataFrame(raw_counts) >>> rpm = RPM(raw_counts) >>> rpm.head() 0 1 2 3 4 0 1994.031850 617.999738 895.038590 234.467249 1929.409246 1 308.104674 1783.727269 738.867008 604.569366 245.491264 2 1235.090833 906.128463 769.221953 1462.698984 1474.653899 3 628.576824 344.838319 1723.115991 1201.585019 1084.225878 4 1921.133736 1363.195297 761.440687 1481.020714 1777.679267
-
grenadine.Preprocessing.rnaseq_normalization.
TPM
(raw_counts, seq_lengths, seq_in_kb=False)[source]¶ Transcript Per Million normalization.
Parameters: - raw_counts (pandas.DataFrame) – raw RNAseq counts where rows are genes and columns are conditions
- seq_lengths (pandas.Series) – sequences DNA lengths
- seq_in_kb (bool) – True if lengths in kb, False otherwise
Returns: Normalized counts
Return type: pandas.DataFrame
Examples
>>> import numpy as np >>> np.random.seed(0) >>> import pandas as pd >>> nb_genes = 1000 >>> nb_conditions = 5 >>> raw_counts = np.random.randint(0,1e6,(nb_genes,nb_conditions)) >>> raw_counts = pd.DataFrame(raw_counts) >>> seq_lengths = np.random.randint(100,20000,nb_genes) >>> seq_lengths = pd.Series(seq_lengths) >>> tpm = TPM(raw_counts, seq_lengths) >>> tpm.head() 0 1 2 3 4 0 2455.468465 739.530213 1103.147117 265.510632 2397.256398 1 205.286894 1154.932887 492.740902 370.430324 165.039097 2 743.019352 529.732184 463.172003 809.195846 895.115733 3 1009.139172 537.989227 2768.832068 1773.963432 1756.306584 4 369.878069 255.049468 146.732550 262.216233 345.336316
-
grenadine.Preprocessing.rnaseq_normalization.
log
(X, base=10, pseudocount=1)[source]¶ Add a pseudocount and apply the log transformation with a given base.
Parameters: - X (pandas.DataFrame or numpy.array) – gene expression matrix
- base (float) – logarithm base
- pseudocount (float) – pseudocount value
Returns: log transformed gene expression matrix
Return type: pandas.DataFrame or numpy.array
Examples
>>> import pandas as pd >>> import numpy as np >>> np.random.seed(0) >>> data = pd.DataFrame(np.random.randn(5, 5), index=["c1", "c2", "c3", "c4", "c5"], columns=["gene1", "gene2", "gene3", "gene4", "gene5"]) >>> pseudocount = -np.min(data.values)+1 >>> log_data = log(data, pseudocount=pseudocount) >>> log_data gene1 gene2 gene3 gene4 gene5 c1 0.725670 0.596943 0.656264 0.762970 0.734043 c2 0.410897 0.653509 0.531687 0.537790 0.598089 c3 0.567853 0.699600 0.634883 0.565218 0.601718 c4 0.589577 0.703039 0.524764 0.587268 0.431186 c5 0.000000 0.623932 0.645169 0.448834 0.765128
grenadine.Preprocessing.standard_preprocessing module¶
This module allows to pre-process gene expression data.
-
grenadine.Preprocessing.standard_preprocessing.
cat_gene_expression_dfs
(gene_expression_dfs)[source]¶ Concatenate different gene expression datasets, based on gene id (rows).
Parameters: gene_expression_dfs (list of pandas.DataFrame) – Expression datasets list Returns: concatenated gene expression datasets Return type: pandas.DataFrame Examples
>>> import pandas as pd >>> import numpy as np >>> np.random.seed(0) >>> data1 = pd.DataFrame(np.random.randn(3, 3), index=["gene1", "gene2", "gene3"], columns=["c1", "c2", "c3"]) >>> data1 c1 c2 c3 gene1 1.764052 0.400157 0.978738 gene2 2.240893 1.867558 -0.977278 gene3 0.950088 -0.151357 -0.103219 >>> data2 = pd.DataFrame(np.random.randn(3, 3), index=["gene2", "gene3", "gene4"], columns=["c4", "c5", "c6"]) >>> data2 c4 c5 c6 gene2 0.410599 0.144044 1.454274 gene3 0.761038 0.121675 0.443863 gene4 0.333674 1.494079 -0.205158 >>> data=cat_gene_expression_dfs([data1, data2]) >>> data c1 c2 c3 c4 c5 c6 gene1 1.764052 0.400157 0.978738 NaN NaN NaN gene2 2.240893 1.867558 -0.977278 0.410599 0.144044 1.454274 gene3 0.950088 -0.151357 -0.103219 0.761038 0.121675 0.443863 gene4 NaN NaN NaN 0.333674 1.494079 -0.205158
-
grenadine.Preprocessing.standard_preprocessing.
columns_matrix_OT_norm
(X, reference=None, bins=None, **SinkhornTransport_para)[source]¶ Use optimal transport in order to make all conditions disributions alike.
Parameters: - X (pandas.DataFrame) – gene expression matrix
- r_percentile (numpy.array) – reference distribution
- bins (numpy.array) – bins for percentiles computation
- SinkhornTransport_para – ot.da.SinkhornTransport parameters
Returns: Normalized matrix
Return type: pandas.DataFrame
Examples
>>> import numpy as np >>> import pandas as pd >>> a = pd.DataFrame(np.random.randn(10000,10)) >>> b = pd.DataFrame(np.random.randn(10000,10)*3+4) >>> bins = list(range(1,100)) >>> b_ = columns_matrix_OT_norm(b,a.iloc[:,0],bins,reg_e=5e-1)
-
grenadine.Preprocessing.standard_preprocessing.
mean_std_polishing
(A, nb_iterations=5)[source]¶ Iterative z-score on rows and columns.
Parameters: - A (pandas.DataFrame or numpy.array) – matrix
- nb_iterations (int) – number of polishing iterations
Returns: Polished matrix
Return type: pandas.DataFrame or numpy.array
Examples
>>> import pandas as pd >>> import numpy as np >>> np.random.seed(0) >>> data = pd.DataFrame(np.random.randn(5, 5), index=["c1", "c2", "c3", "c4", "c5"], columns=["gene1", "gene2", "gene3", "gene4", "gene5"]) >>> norm_data = mean_std_polishing(data) >>> norm_data gene1 gene2 gene3 gene4 gene5 c1 0.336095 -1.618781 0.187436 1.109617 -0.014367 c2 -0.321684 0.586608 -1.606905 0.484159 0.857821 c3 0.139260 0.860934 0.976541 -1.395814 -0.580921 c4 1.243263 0.421752 -0.585940 0.282319 -1.361394 c5 -1.363323 -0.161066 0.826375 -0.421998 1.120013
-
grenadine.Preprocessing.standard_preprocessing.
median_outliers_filter
(X, threshold=3)[source]¶ Ensures that all the values of data_set are within: \(median(X) \pm \tau \times MAD(X))\)
Parameters: - X (pandas.DataFrame or numpy.array) – gene expression matrix (for instance)
- threshold (float) – \(\tau\) threshold
Returns: X without outliers (outliers set to the extreme values allowed)
Return type: pandas.DataFrame or numpy.array
Examples
>>> import pandas as pd >>> import numpy as np >>> np.random.seed(0) >>> data = pd.DataFrame(np.random.randn(5, 5), index=["c1", "c2", "c3", "c4", "c5"], columns=["gene1", "gene2", "gene3", "gene4", "gene5"]) >>> median_outliers_filter(data) gene1 gene2 gene3 gene4 gene5 c1 1.764052 0.400157 0.978738 0.674682 1.867558 c2 -0.977278 0.950088 -0.653101 -0.103219 0.410599 c3 0.144044 1.454274 0.761038 0.121675 0.443863 c4 0.333674 1.494079 -0.653101 0.313068 -0.854096 c5 -2.552990 0.653619 0.864436 -0.674682 2.269755
-
grenadine.Preprocessing.standard_preprocessing.
z_score
(A, axis=0)[source]¶ Compute the z-score along the specified axis.
Parameters: - A (pandas.DataFrame or numpy.array) – matrix
- axis (int) – 0 for columns and 1 for rows
Returns: Normalized matrix
Return type: pandas.DataFrame or numpy.array
Examples
>>> import pandas as pd >>> import numpy as np >>> np.random.seed(0) >>> data = pd.DataFrame(np.random.randn(5, 5), index=["c1", "c2", "c3", "c4", "c5"], columns=["gene1", "gene2", "gene3", "gene4", "gene5"]) >>> norm_data = z_score(data) >>> norm_data gene1 gene2 gene3 gene4 gene5 c1 1.254757 -1.222682 0.914682 1.672581 0.828015 c2 -0.446591 -0.083589 -1.038607 -0.418644 -0.331945 c3 0.249333 0.960749 0.538403 -0.218012 -0.305461 c4 0.367024 1.043200 -1.131598 -0.047267 -1.338834 c5 -1.424523 -0.697678 0.717120 -0.988659 1.148225
Module contents¶
This submodule contains several pre-processing techniques for gene expression datasets (standardizations, discretizations and RNAseq normalization)