Raw data is often delivered in Excel-sheets with a lot of noise and formating around. For analysis in R or other packages the real raw data is required. Scripting the “deformating” in plain text / csv files using shell tools like SED, AWK or Pearl to remove excess text in the datasheets makes it possible to rerun the procedure or track systematic errors.
Removing empty lines from a file containing code in plain text (like .csv, .html, .php, etc…) is very easy with SED in a UNIX/ MAC OS shell and even possible in the Windows CMD (after installing SED). The following is a blockquote from ZoneO-tips for Mandriva Linux which I found really useful and well written:
So, open up a konsole and move into the directory where your file resides (cd MyDirectory). And here we go with the two lines that’ll do the job
sed '/^$/d' myFile > tt
mv tt myFileHere is what happens:
sed '/^$/d' myFile
removes all empty lines from the file myFile and outputs the result in the console,> tt
redirects the output into a temporary file called tt,
mv tt myFile
moves the temporary file tt to myFile.Now, you may have 100 html files to correct at the same time. That’s where foreach comes in… Let’s say you want to correct all files ending with .html, here is what you should do:
Open up a konsole, move into the directory where your html files reside, type the following commands:
foreach file (*html)
sed '/^$/d' $file > tt
mv tt $file
endFinished!