4 Pipelines

20201024

A general mlhub philosophy is that the output from a command should be, for example, a well defined text format. Typically this will use a csv (comma separated value) format and will be consistent so that follow-on processes within a pipeline can further process the results. These might even be other mlhub models. The mlhub commands focus on their specific task, not solving all problems, but implementing their specific task well. We can then leave extra processing to other specialist tools, like sed, or cut, and awk.

This example deploys an optical character recognition capability from the ocr command of the azcv model:

$ ml ocr azcv ~/.mlhub/azcv/cache/images/mycat.png | head -2
51.0 43.0 668.0 51.0 667.0 85.0 51.0 77.0,My cats name is freckles. She like's to climb up
37.0 97.0 691.0 104.0 690.0 134.0 37.0 128.0,high. She is 2 years old. She likes to play a...

$ ml ocr azcv ~/.mlhub/azcv/cache/images/mycat.png | head -2 | sed 's/,/\t/'
51.0 43.0 668.0 51.0 667.0 85.0 51.0 77.0   My cats name is freckles. She like's to cl...
37.0 97.0 691.0 104.0 690.0 134.0 37.0 128.0    high. She is 2 years old. She likes to pla...

If you do not care for the bounding boxes that is output by default from the ocr command then simply remove them using cut:

$ ml ocr azcv ~/.mlhub/azcv/cache/images/mycat.png | head -2 | cut -d, -f2-
My cats name is freckles. She like's to climb up
high. She is 2 years old. She likes to play a lot of games.

We can process every jpg image file in a directory where we may have several hundred files. We will save the text output into a txt file. The following pipeline utilises a for loop, an ml model, and the cut command:

$ for f in images/*.jpg; 
  do 
    echo "=====> $f"; 
    ml ocr azcv $f | 
    cut -d, -f2- > $(dirname $f)/$(basename $f .jpg).txt; 
  done

Change the two instances of jpg to png to process png image files, and similarly for pdf files.

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0