Extracting relevant columns from csv files
Workflow for quick look of papers from pubmed “similar or cited by”
Download the citation/similar articles in csv format from pubmed
Check the number of lines in the file
wc -l file.csv
See the names of columns of the file
csvcut -n file.csv
Create an output for a quick visual check with title and doi, most recent first
Option 1: for the output in terminal, comma separated
cat file.csv |csvsort -r -c 7 | csvcut -c 2,11 -l
Option 2: for the output in terminal, a bit more nice (?)
cat file.csv |csvsort -r -c 7 | csvcut -c 2,11 -l | csvlook
Option 3: create a table using pandoc in landscape format
cat file.csv | csvsort -r -c 7 | csvcut -c 2,11 -l | csvlook | pandoc --variable geometry:"landscape, margin=1in" -o table_landscape.pdf
Option 4: create a table using pandoc in default (portrait) format
cat file.csv | csvsort -r -c 7 | csvcut -c 2,11 -l | csvlook | pandoc -o table_portrait.pdf
Outputs:
1. Articles count
2. Column names
3. Table in terminal (Option 1)
3. Table in terminal (Option 2)
3. Table in pdf landscape (Option 3)
3. Table in pdf portrait (Option 4)
TROUBLESHOOTING
The tutorial is very good
1. Set the correct encoding
When parsing the csv files from pubmed, some characters are not recognized by default => proper encoding need to be set
UnicodeEncodeError: ‘charmap’ codec can’t encode character ‘\u0144’ in position 110: character maps to <undefined>
Solution:
Set the encoding variable for csvkit in the current terminal, would be used by shell for all other commands.
export PYTHONUTF8=1