18.24 Rename File Based on PDF Contents

20201115

A simple use case begins with a collection of locally saved bank statements, each named something like 20140913_kt_odbc_saver.pdf, one for each month. The aim is to rename rename each file by appending the bank statement’s final balance to the filename. For example, append _32k if the statement’s final balance is something like $31,745.34, resulting in 20140913_kt_odbc_saver_32k.pdf.

Using pdf2txt I noticed that the dollar balance amount extracted from the pdf is the only dollar amount starting in column 1. Thus we can build a command to rename the files beginning with a for loop, bracketed by do and done. Using echo the basic rename command is constructed, using mv with baseline to extract the base name of the resulting filename. To this resulting filename we append the dollar amount after some processing. The processing extracts just the dollar amount, using egrep, ensuring we have a single value using uniq, deleting the dollar and commas using tr, converting large numbers into SI format using numfmt, and printing the dollar amount to be added to the target filename, with the final extension, using awk. Each line is then a fully formed mv statement, which is then executed by passingg it to sh:

for f in *saver.pdf; do 
  echo -n "mv" $f $(basename $f .pdf)"_"; 
  pdf2txt $f | egrep '^\$[1-9]' | uniq | tr -d '$,' |
  numfmt --to=si --round=nearest | tr 'K' 'k'|
  awk '{print $1".pdf"}';
done | sh

Before you run this command do check each step along the way. In particular, the pattern used to extract the dollar amount of interest will be different for different types of statements. Sometimes it might be embedded in a line that begins with the string , for example.



Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0