Extracting relevant columns from csv files

Workflow for quick look of papers from pubmed “similar or cited by”

Download the citation/similar articles in csv format from pubmed

Check the number of lines in the file

wc -l file.csv

See the names of columns of the file

csvcut -n file.csv

Create an output for a quick visual check with title and doi, most recent first

Option 1: for the output in terminal, comma separated

cat file.csv |csvsort -r -c 7 | csvcut -c 2,11 -l

Option 2: for the output in terminal, a bit more nice (?)

cat file.csv |csvsort -r -c 7 | csvcut -c 2,11 -l | csvlook

Option 3: create a table using pandoc in landscape format

cat file.csv | csvsort -r -c 7 | csvcut -c 2,11 -l | csvlook | pandoc --variable geometry:"landscape, margin=1in" -o table_landscape.pdf

Option 4: create a table using pandoc in default (portrait) format

cat file.csv | csvsort -r -c 7 | csvcut -c 2,11 -l | csvlook | pandoc -o table_portrait.pdf

Outputs:

1. Articles count

2. Column names

3. Table in terminal (Option 1)

3. Table in terminal (Option 2)

3. Table in pdf landscape (Option 3)

3. Table in pdf portrait (Option 4)

TROUBLESHOOTING

The tutorial is very good

1. Set the correct encoding

When parsing the csv files from pubmed, some characters are not recognized by default => proper encoding need to be set

UnicodeEncodeError: ‘charmap’ codec can’t encode character ‘\u0144’ in position 110: character maps to <undefined>

Solution:
Set the encoding variable for csvkit in the current terminal, would be used by shell for all other commands.

export PYTHONUTF8=1