grenadine.Preprocessing package

Submodules

grenadine.Preprocessing.discretization module

This module allows to discretize gene expression datasets. It is mostly based on scikit-learn library. Different discretization methods are available : EWD (equal width, uniform), EFD (equal frequency, quantile), kmeans, bikmeans (Li et al., 2010).

grenadine.Preprocessing.discretization.bikmeans_original(data, nb_bins)[source]

Discretize data into nb_bins intervals, with method bikmeans, from the publication by Li et al, 2010.

Parameters:
  • data (pandas.DataFrame) – dataset to discretize
  • nb_bins (int) – number of intervals in which to discretize data
Returns:

dataframe of discretized data

Return type:

pandas.DataFrame

Examples

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(0)
>>> data = pd.DataFrame(np.random.randn(3, 5),
                        index=["gene1", "gene2", "gene3"],
                        columns=["c1", "c2", "c3", "c4", "c5"])
>>> data
             c1        c2        c3        c4        c5
gene1  1.764052  0.400157  0.978738  2.240893  1.867558
gene2 -0.977278  0.950088 -0.151357 -0.103219  0.410599
gene3  0.144044  1.454274  0.761038  0.121675  0.443863
>>> discr_data = bikmeans_original(data=data, nb_bins=2)
>>> discr_data
        c1   c2   c3   c4   c5
gene1  1.0  0.0  0.0  1.0  1.0
gene2  0.0  1.0  0.0  0.0  0.0
gene3  0.0  1.0  0.0  0.0  0.0
grenadine.Preprocessing.discretization.bikmeans_simple(data, nb_bins)[source]

Discretize data into nb_bins intervals, with method bikmeans, simplified. From the publication by Li et al, 2010. See function bikmeans_original() for the full implementation of bikmeans as described in the paper.

Parameters:
  • data (pandas.DataFrame) – dataset to discretize
  • nb_bins (int) – number of intervals in which to discretize data
Returns:

dataframe of discretized data

Return type:

pandas.DataFrame

Examples

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(0)
>>> data = pd.DataFrame(np.random.randn(3, 5),
                        index=["gene1", "gene2", "gene3"],
                        columns=["c1", "c2", "c3", "c4", "c5"])
>>> data
             c1        c2        c3        c4        c5
gene1  1.764052  0.400157  0.978738  2.240893  1.867558
gene2 -0.977278  0.950088 -0.151357 -0.103219  0.410599
gene3  0.144044  1.454274  0.761038  0.121675  0.443863
>>> discr_data = bikmeans_simple(data=data, nb_bins=2)
>>> discr_data
        c1   c2   c3   c4   c5
gene1  2.0  1.0  1.0  2.0  2.0
gene2  1.0  2.0  1.0  1.0  1.0
gene3  1.0  2.0  1.0  1.0  1.0
grenadine.Preprocessing.discretization.discretize_genexp(data, method, nb_bins=2, axis=0)[source]

Discretize data into nb_bins intervals, with specified method, along specified axis.

Parameters:
  • data (pandas.DataFrame or pandas.Series) – dataset to discretize
  • method (str) – method used for discretization, amongst: ‘kmeans’, ‘bikmeans’, ‘ewd’, ‘efd’
  • nb_bins (int) – (default 2) number of intervals in which to discretize data
  • axis (int) – (default 0) indicates if discretization should be done on each column (0) or each line (1) of data. Ignore this parameter if method is bikmeans
Returns:

dataframe or series of discretized data, depending on the dimension of passed data

Return type:

pandas.DataFrame or pandas.Series

Examples

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(0)
>>> data = pd.DataFrame(np.random.randn(3, 5),
                        index=["gene1", "gene2", "gene3"],
                        columns=["c1", "c2", "c3", "c4", "c5"])
>>> data
             c1        c2        c3        c4        c5
gene1  1.764052  0.400157  0.978738  2.240893  1.867558
gene2 -0.977278  0.950088 -0.151357 -0.103219  0.410599
gene3  0.144044  1.454274  0.761038  0.121675  0.443863
>>> discr_data = discretize_genexp(data=data, method='efd')
>>> discr_data
        c1   c2   c3   c4   c5
gene1  1.0  0.0  1.0  1.0  1.0
gene2  0.0  1.0  0.0  0.0  0.0
gene3  1.0  1.0  1.0  1.0  1.0

grenadine.Preprocessing.rnaseq_normalization module

This module allows to normalize RNAseq gene expression data.

grenadine.Preprocessing.rnaseq_normalization.DEseq2(raw_counts, col_data, rlog=True)[source]

Apply R DEseq2 normalization.

Parameters:
  • raw_counts (pandas.DataFrame) – raw RNAseq counts where rows are genes and columns are conditions
  • col_data (pandas.DataFrame) – Two columns, one corresponding to ids of each condition (individuals), and one with the experiment id (if many repetitions)
Returns:

Normalized counts

Return type:

pandas.DataFrame

Example

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(0)
>>> raw_counts = pd.DataFrame(np.random.randint(0,1000,(20,10)),
                              columns = ["Z"+str(i) for i in range(10)])
>>> col_data = pd.DataFrame([["Z0","1"],
                             ["Z1","2"],
                             ["Z2","3"],
                             ["Z3","4"],
                             ["Z4","5"],
                             ["Z5","6"],
                             ["Z6","7"],
                             ["Z7","8"],
                             ["Z8","9"],
                             ["Z9","10"]
                             ],columns=["individuals","conditions"])
>>> raw_counts.columns = col_data["individuals"]
>>> col_data.index = col_data['individuals']
>>> DEseq2(raw_counts,col_data,rlog=False)
individuals          X0          X1     ...              X8          X9
0            408.025477  382.991634     ...        7.745300  611.474516
1            165.238388  516.593367     ...      270.224902  596.251084
2            289.912839  377.510537     ...      727.197585   60.893728
3            463.502625  627.585575     ...      385.543809  718.884285
4             59.056319  674.174898     ...      364.029087  243.574911
5            573.263865  181.561329     ...      129.948918  570.878697
6            304.229522  314.477925     ...      802.068816   44.824550
7            537.472156  376.825400     ...       36.144732  373.819828
8            323.914962  608.401737     ...      748.712307  100.643800
9            464.695682  294.608949     ...      781.414683  535.357356
10           559.543710   57.551516     ...      112.737140  822.065324
11           517.786716  123.324676     ...      618.763389  768.783312
12           222.505121  584.421939     ...       81.755942  166.612005
13           361.496256  175.395095     ...      333.047888  515.905193
14           330.476775  666.638390     ...      779.693506  312.926101
15           331.073304  653.620785     ...      493.978005  787.389729
16           437.851901   84.271862     ...      483.650938  347.601696
17           466.485268   28.090621     ...      750.433484    9.303208
18           459.326926  210.337087     ...      149.742461  468.543405
19           221.312064  126.065225     ...      662.653421  435.559302
grenadine.Preprocessing.rnaseq_normalization.RPK(raw_counts, seq_lengths, seq_in_kb=False)[source]

Reads Per Kilobase normalization.

Parameters:
  • raw_counts (pandas.DataFrame) – raw RNAseq counts where rows are genes and columns are conditions
  • seq_lengths (pandas.Series) – sequences DNA lengths
  • seq_in_kb (bool) – True if lengths in kb, False otherwise
Returns:

Normalized counts

Return type:

pandas.DataFrame

Examples

>>> import numpy as np
>>> np.random.seed(0)
>>> import pandas as pd
>>> nb_genes = 1000
>>> nb_conditions = 5
>>> raw_counts = np.random.randint(0,1e6,(nb_genes,nb_conditions))
>>> raw_counts = pd.DataFrame(raw_counts)
>>> seq_lengths = np.random.randint(100,20000,nb_genes)
>>> seq_lengths = pd.Series(seq_lengths)
>>> rpk = RPK(raw_counts, seq_lengths)
>>> rpk.head()
               0              1              2              3              4
0  321202.997719   99612.577387  142010.101010   38433.365917  313911.697621
1   26853.843441  155566.114245   63431.417489   53620.768688   21611.248237
2   97195.319962   71353.390640   59624.960204  117133.237822  117212.034384
3  132006.796941   72465.590484  356436.703483  256785.896347  229981.733220
4   48384.227419   34354.424576   18889.143614   37956.492944   45220.490091
grenadine.Preprocessing.rnaseq_normalization.RPKM(raw_counts, seq_lengths, seq_in_kb=False)[source]

Reads Per Kilobase Million (also known as FPM: Fragments per kilobase).

Parameters:
  • raw_counts (pandas.DataFrame) – raw RNAseq counts where rows are genes and columns are conditions
  • seq_lengths (pandas.Series) – sequences DNA lengths
  • seq_in_kb (bool) – True if lengths in kb, False otherwise
Returns:

Normalized counts

Return type:

pandas.DataFrame

Examples

>>> import numpy as np
>>> np.random.seed(0)
>>> import pandas as pd
>>> nb_genes = 1000
>>> nb_conditions = 5
>>> raw_counts = np.random.randint(0,1e6,(nb_genes,nb_conditions))
>>> raw_counts = pd.DataFrame(raw_counts)
>>> seq_lengths = np.random.randint(100,20000,nb_genes)
>>> seq_lengths = pd.Series(seq_lengths)
>>> rpkm = RPKM(raw_counts, seq_lengths)
>>> rpkm.head()
            0           1           2           3           4
0  649.733415  201.368439  291.638511   76.398582  628.676848
1   54.320288  314.479420  130.265692  106.588393   43.281252
2  196.607901  144.242035  122.448576  232.839698  234.742741
3  267.024989  146.490365  731.994898  510.443933  460.588733
4   97.872216   69.448026   38.791619   75.450645   90.563924
grenadine.Preprocessing.rnaseq_normalization.RPM(raw_counts)[source]

Reads Per Million.

Parameters:raw_counts (pandas.DataFrame) – raw RNAseq counts where rows are genes and columns are conditions
Returns:Normalized counts
Return type:pandas.DataFrame

Examples

>>> import numpy as np
>>> np.random.seed(0)
>>> import pandas as pd
>>> nb_genes = 1000
>>> nb_conditions = 5
>>> raw_counts = np.random.randint(0,1e6,(nb_genes,nb_conditions))
>>> raw_counts = pd.DataFrame(raw_counts)
>>> rpm = RPM(raw_counts)
>>> rpm.head()
            0            1            2            3            4
0  1994.031850   617.999738   895.038590   234.467249  1929.409246
1   308.104674  1783.727269   738.867008   604.569366   245.491264
2  1235.090833   906.128463   769.221953  1462.698984  1474.653899
3   628.576824   344.838319  1723.115991  1201.585019  1084.225878
4  1921.133736  1363.195297   761.440687  1481.020714  1777.679267
grenadine.Preprocessing.rnaseq_normalization.TPM(raw_counts, seq_lengths, seq_in_kb=False)[source]

Transcript Per Million normalization.

Parameters:
  • raw_counts (pandas.DataFrame) – raw RNAseq counts where rows are genes and columns are conditions
  • seq_lengths (pandas.Series) – sequences DNA lengths
  • seq_in_kb (bool) – True if lengths in kb, False otherwise
Returns:

Normalized counts

Return type:

pandas.DataFrame

Examples

>>> import numpy as np
>>> np.random.seed(0)
>>> import pandas as pd
>>> nb_genes = 1000
>>> nb_conditions = 5
>>> raw_counts = np.random.randint(0,1e6,(nb_genes,nb_conditions))
>>> raw_counts = pd.DataFrame(raw_counts)
>>> seq_lengths = np.random.randint(100,20000,nb_genes)
>>> seq_lengths = pd.Series(seq_lengths)
>>> tpm = TPM(raw_counts, seq_lengths)
>>> tpm.head()
             0            1            2            3            4
0  2455.468465   739.530213  1103.147117   265.510632  2397.256398
1   205.286894  1154.932887   492.740902   370.430324   165.039097
2   743.019352   529.732184   463.172003   809.195846   895.115733
3  1009.139172   537.989227  2768.832068  1773.963432  1756.306584
4   369.878069   255.049468   146.732550   262.216233   345.336316
grenadine.Preprocessing.rnaseq_normalization.log(X, base=10, pseudocount=1)[source]

Add a pseudocount and apply the log transformation with a given base.

Parameters:
  • X (pandas.DataFrame or numpy.array) – gene expression matrix
  • base (float) – logarithm base
  • pseudocount (float) – pseudocount value
Returns:

log transformed gene expression matrix

Return type:

pandas.DataFrame or numpy.array

Examples

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(0)
>>> data = pd.DataFrame(np.random.randn(5, 5),
                    index=["c1", "c2", "c3", "c4", "c5"],
                    columns=["gene1", "gene2", "gene3", "gene4", "gene5"])
>>> pseudocount = -np.min(data.values)+1
>>> log_data = log(data, pseudocount=pseudocount)
>>> log_data
       gene1     gene2     gene3     gene4     gene5
c1  0.725670  0.596943  0.656264  0.762970  0.734043
c2  0.410897  0.653509  0.531687  0.537790  0.598089
c3  0.567853  0.699600  0.634883  0.565218  0.601718
c4  0.589577  0.703039  0.524764  0.587268  0.431186
c5  0.000000  0.623932  0.645169  0.448834  0.765128

grenadine.Preprocessing.standard_preprocessing module

This module allows to pre-process gene expression data.

grenadine.Preprocessing.standard_preprocessing.cat_gene_expression_dfs(gene_expression_dfs)[source]

Concatenate different gene expression datasets, based on gene id (rows).

Parameters:gene_expression_dfs (list of pandas.DataFrame) – Expression datasets list
Returns:concatenated gene expression datasets
Return type:pandas.DataFrame

Examples

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(0)
>>> data1 = pd.DataFrame(np.random.randn(3, 3),
                    index=["gene1", "gene2", "gene3"],
                    columns=["c1", "c2", "c3"])
>>> data1
             c1        c2        c3
gene1  1.764052  0.400157  0.978738
gene2  2.240893  1.867558 -0.977278
gene3  0.950088 -0.151357 -0.103219
>>> data2 = pd.DataFrame(np.random.randn(3, 3),
                    index=["gene2", "gene3", "gene4"],
                    columns=["c4", "c5", "c6"])
>>> data2
             c4        c5        c6
gene2  0.410599  0.144044  1.454274
gene3  0.761038  0.121675  0.443863
gene4  0.333674  1.494079 -0.205158
>>> data=cat_gene_expression_dfs([data1, data2])
>>> data
             c1        c2        c3        c4        c5        c6
gene1  1.764052  0.400157  0.978738       NaN       NaN       NaN
gene2  2.240893  1.867558 -0.977278  0.410599  0.144044  1.454274
gene3  0.950088 -0.151357 -0.103219  0.761038  0.121675  0.443863
gene4       NaN       NaN       NaN  0.333674  1.494079 -0.205158
grenadine.Preprocessing.standard_preprocessing.columns_matrix_OT_norm(X, reference=None, bins=None, **SinkhornTransport_para)[source]

Use optimal transport in order to make all conditions disributions alike.

Parameters:
  • X (pandas.DataFrame) – gene expression matrix
  • r_percentile (numpy.array) – reference distribution
  • bins (numpy.array) – bins for percentiles computation
  • SinkhornTransport_para – ot.da.SinkhornTransport parameters
Returns:

Normalized matrix

Return type:

pandas.DataFrame

Examples

>>> import numpy as np
>>> import pandas as pd
>>> a = pd.DataFrame(np.random.randn(10000,10))
>>> b = pd.DataFrame(np.random.randn(10000,10)*3+4)
>>> bins = list(range(1,100))
>>> b_ = columns_matrix_OT_norm(b,a.iloc[:,0],bins,reg_e=5e-1)
grenadine.Preprocessing.standard_preprocessing.mean_std_polishing(A, nb_iterations=5)[source]

Iterative z-score on rows and columns.

Parameters:
  • A (pandas.DataFrame or numpy.array) – matrix
  • nb_iterations (int) – number of polishing iterations
Returns:

Polished matrix

Return type:

pandas.DataFrame or numpy.array

Examples

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(0)
>>> data = pd.DataFrame(np.random.randn(5, 5),
                    index=["c1", "c2", "c3", "c4", "c5"],
                    columns=["gene1", "gene2", "gene3", "gene4", "gene5"])
>>> norm_data = mean_std_polishing(data)
>>> norm_data
       gene1     gene2     gene3     gene4     gene5
c1  0.336095 -1.618781  0.187436  1.109617 -0.014367
c2 -0.321684  0.586608 -1.606905  0.484159  0.857821
c3  0.139260  0.860934  0.976541 -1.395814 -0.580921
c4  1.243263  0.421752 -0.585940  0.282319 -1.361394
c5 -1.363323 -0.161066  0.826375 -0.421998  1.120013
grenadine.Preprocessing.standard_preprocessing.median_outliers_filter(X, threshold=3)[source]

Ensures that all the values of data_set are within: \(median(X) \pm \tau \times MAD(X))\)

Parameters:
  • X (pandas.DataFrame or numpy.array) – gene expression matrix (for instance)
  • threshold (float) – \(\tau\) threshold
Returns:

X without outliers (outliers set to the extreme values allowed)

Return type:

pandas.DataFrame or numpy.array

Examples

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(0)
>>> data = pd.DataFrame(np.random.randn(5, 5),
                    index=["c1", "c2", "c3", "c4", "c5"],
                    columns=["gene1", "gene2", "gene3", "gene4", "gene5"])
>>> median_outliers_filter(data)
       gene1     gene2     gene3     gene4     gene5
c1  1.764052  0.400157  0.978738  0.674682  1.867558
c2 -0.977278  0.950088 -0.653101 -0.103219  0.410599
c3  0.144044  1.454274  0.761038  0.121675  0.443863
c4  0.333674  1.494079 -0.653101  0.313068 -0.854096
c5 -2.552990  0.653619  0.864436 -0.674682  2.269755
grenadine.Preprocessing.standard_preprocessing.z_score(A, axis=0)[source]

Compute the z-score along the specified axis.

Parameters:
  • A (pandas.DataFrame or numpy.array) – matrix
  • axis (int) – 0 for columns and 1 for rows
Returns:

Normalized matrix

Return type:

pandas.DataFrame or numpy.array

Examples

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(0)
>>> data = pd.DataFrame(np.random.randn(5, 5),
                    index=["c1", "c2", "c3", "c4", "c5"],
                    columns=["gene1", "gene2", "gene3", "gene4", "gene5"])
>>> norm_data = z_score(data)
>>> norm_data
       gene1     gene2     gene3     gene4     gene5
c1  1.254757 -1.222682  0.914682  1.672581  0.828015
c2 -0.446591 -0.083589 -1.038607 -0.418644 -0.331945
c3  0.249333  0.960749  0.538403 -0.218012 -0.305461
c4  0.367024  1.043200 -1.131598 -0.047267 -1.338834
c5 -1.424523 -0.697678  0.717120 -0.988659  1.148225

Module contents

This submodule contains several pre-processing techniques for gene expression datasets (standardizations, discretizations and RNAseq normalization)