1

I have a file with linked data each two columns ( in this example is just 3, but could be more), in a tab delimited file:

Names   SampleA   Names   SampleB   Names   SamplesC
Name1      5      Name3      7      Name1         8
Name2      9      Name2      1      Name2         2
Name4      4      Name4      8      Name3         8

And so on, what I want is to have a single column with the columns Names with not redundant data, and in this case 3 columns with samples; in those samples that are not values for x name will be fit with 0:

Names     SampleA     SampleB    SampleC
Name1          5          0        8
Name2         9           1        2
Name3         0           7        8
Name4         4           8         0

How can I approach this matrix with pandas ???, I jus have tried with R and Perl, but I think will be easer with python using Pandas !!!

Thanks so much !!!!

abraham
  • 161
  • 1
  • 3

1 Answers1

0

You can do it as follows:

  1. Make the column names unambiguous, so you have no columns with the same name (probably you could avoid this step also if you like, if you access the columns by index, but I would make them unambiguous): do something like:

    your_df.columns = ['NamesA', 'SampleA', 'NamesB', 'SampleB', 'Names', 'SamplesC']

  2. Create dataframes from the column pairs

  3. Join the dataframes with the column parts together and collapse the Names columns into one column

  4. Fill the na values

Testdata:

import pandas as pd
your_df= pd.DataFrame({
        'NamesA': ['Name1', 'Name2', 'Name4'], 
        'SampleA': [5, 9, 4], 
        'NamesB':['Name3', 'Name2', 'Name4'], 
        'SampleB': [5, 9, 4], 
        'NamesC':['Name1', 'Name2', 'Name3'], 
        'SampleC': [8, 2, 8]
        })

Here some example code (beginning with step2):

all_cols= list(your_df.columns)
joined_df= None
while all_cols:
    name_col, sample_col, *all_cols= all_cols
    # in case not all columns in your df are filled
    # you need to handle na values
    filled_indexer= ~your_df[name_col].isna()
    # Step 2:
    col_pair_df= your_df.loc[filled_indexer, [name_col, sample_col]]
    # Step 3:
    if joined_df is None:
        joined_df= col_pair_df
        joined_df.columns= ['Names', sample_col]
    else:
        joined_df= joined_df.merge(col_pair_df, how='outer', left_on='Names', right_on=name_col)
        # now we need to populate the one names column and remove the Names* columns
        names_na_indexer= joined_df['Names'].isna()
        joined_df.loc[names_na_indexer, 'Names']= joined_df[name_col]
        joined_df.drop(name_col, axis='columns', inplace=True)
# Step 4:
joined_df.fillna(0, inplace=True)
jottbe
  • 173