download as pdf

Short script to filter csv file using the first couple of columns using a different csv file

Example usecase

I have an output csv file from cell profiler called data.csv.

First column is an image number and second column is an object number, the rest are the different measurments.

Based on the visual control I want to:

  • remove measurments single object (object 1) from an image 1
  • remove measurements of all objects from image 2.
  1. I create a file in excel called filter.csv
  2. Put 1 in first column (image) and second column (object)
  3. Put 2 in first column (image), all in second column
  4. Save as csv.
  5. Run the script:
    • using alias csv-filter filter.csv data.csv (the alias must be set in you .bashrc)
    • using the script itself bash filter_csv.sh filter.csv data.csv
  6. Check the filt_date-time_data.csv

data.csv

filter.csv

filt_date-time_data.csv

Files

script
data.csv
filter.csv
filt_date-time_data.csv

Script

#! /bin/bash
time_stamp=$(date +'%d-%b_%Hh%Mm%Ss')
[ $# -lt 2 ] && echo """
Usage:
  $0 <filter> <data-to-filter>

Description:
  This script filters out lines from <data-to-filter> that match entries in <filter>
  Each line in <filter> should contain one or more comma-separated columns.
  Use 'ALL' (case-insensitive) in the second, third etc. column(s) to match any value in given column.
"""  \
	&& exit 1

sed -E 's/,([Aa][Ll][Ll])$/,[^,]*/' "$1" \ # convert the all, ALL to *
  | sed -E 's/^/^/; s/$/,/' \              # make sure the patterns are anchored to beginning and end with a comma
  | grep -E -v -f - "$2" > "filt_${time_stamp}_$2" 2>/dev/null # filter the data file

[ $? -ne 0 ] && echo "Something went wrong"

echo "Filtering finished"