cleaning up email lists with ruby and regex
Posted by on Jan 22nd, 2007
2007
Jan 22
lately, we have been getting a ton of email lists from clients. these usually consist of THOUSANDS of email addresses that have been gathered on their websites. in every case, there was no address validation on the code that gathered the addresses, so we end up with a bunch of gobledeygoop. the list management software that we use (”php list”:http://www.phplist.com/) will poop if there are any invalid addresses in the text file.. so i came up with this..
i am aware that the function that creates the text file from the array is clunky.. i just cut and pasted it from another program i wrote to do something entirely different.. if it turns out that this becomes a bottleneck, i will look at it..
anyway.. hope this helps someone out..
if ARGV.length != 1
puts "Usage: clean_addresses.rb [lt]input_filename[gt]\n”
puts “Will output [lt]input_filename[gt]_cleaned.txt”
exit
end
filename = ARGV[0]
unless FileTest::exists?(filename)
puts “Input file does not exist..”
exit
end
successes = Array.new
failures = Array.new
def create_txt_from_array(array,filename)
text_file = String.new
array.each do |e|
text_file [lt][lt] e
end
open(filename,’w') {|f| f[lt][lt] text_file }
end
filename_parsed = filename.split(”.”)
output_filename = filename_parsed[0].to_s+’_cleaned.txt’
fails_filename = filename_parsed[0].to_s+’_failed.txt’
output_file = File.new(output_filename,’w+’)
fails_file = File.new(fails_filename,’w+’)
puts “Outputting to: “+ output_filename
IO.foreach(filename) do |line|
if line.chomp =~ /\A[\w\._%-]+@[\w\.-]+\.[a-zA-Z]{2,4}\z/
successes [lt][lt] line
else
failures [lt][lt] line
end
end
create_txt_from_array(successes,output_filename)
create_txt_from_array(failures,fails_filename)
puts “There were: “+successes.length.to_s+” successes.\n”
puts “There were: “+failures.length.to_s+” failures.\n”