Hvern hefði grunað að örnefni í Bandaríkjunum innihéldu kynþáttaníð? Kannski flesta. En hversu mörg örnefni?
Um daginn barst á póstlista sem ég er á, beiðni um að komast að nákvæmlega því. Og þar sem ég hafði ekkert annað að gera þá beit ég á agnið.
Svona hljómaði beiðnin:
So, I’ve got a massive delimited text file of the names of every
geographic location in the US, a list of racial slurs, and I’m looking
for the names of locations that include a racial slur. Specifically, I’ll need a CSV with the following information for each controversially named location separated into columns: Feature ID, Feature Name, Feature Class, State Name Abbreviation, State Number, County Name, County Number, Primary Latitude DEC and Primary Longitude DEC.
Here are link to the geographic names and a list of racist terms.
PS. I’d like to see how you do this one so I can learn.
Og svona svaraði ég:
Here are the steps I took. I am fairly confident in the results, but
I’m also fairly confident it could be done different/better.
1. The file with geographic names was pipe (|) delimited and not commadelimited. Although not necessary I replaced the pipes with commas. In a terminal:
cat NationalFile_20120416.txt | sed -i ‘s/|/,/g’ > places.csv
2. I had a hunch that a file with 2.221.270 lines would have some errors, so I ran it through the awesome csvclean tool
(http://csvkit.readthedocs.org/en/0.4.4/scripts/csvclean.html). That stripped 645 lines from the file, most of them with the message “Expected 20 columns, found 21 columns”. The error lines get saved toa seperate file, places_err.csv, for inspection and the clean file in a new one, places_out.csv.
3. Next I wanted to remove the columns that were not needed to make the file a little bit more managable (it was at 301mb). So I ran csvcut (http://csvkit.readthedocs.org/en/0.4.4/scripts/csvcut.html)
csvcut -c 1,2,3,4,5,6,7,8,9,10,11 places_out.csv > places_out_cut.csv
but it gave a encoding error. A bit of googling around led me to conclude that the file was utf-8 with BOM. To strip the BOM from the file I used the shell script found here: http://thegreyblog.blogspot.com/2010/09/shell-script-to-find-and-remo… (if you are on OS X then you will have to install seq for it to work – instructions here: http://scruss.com/blog/2008/02/08/seq-for-os-x/
After the BOM cleaning then the csvcut command ran fine and the new file was down to 206 mb.
4. Then I wrote a short python script that processed each line from the file and if found a match from a slur csv file, writes a new file: matched.csv, adding a column with the slur word found.
Unique values: 83
5 most frequent values:
And here they are on a Google Fusion map: http://go.gogn.in/IXcX7m
The files are in a dropbox folder here: http://go.gogn.in/M28DEH
Hér eru niðurstöðurnar (athugið að þetta er algjörlega óyfirfarið – það eru eflaust tvítekningar, sem og staðir sem eiga bara alls ekki heima þarna. Spyrjandinn, Gerald Rich, hefur aðeins unnið með gögnin. Hann lýsir því hér.)
Hérna er svo hægt að skoða Dead Negro Hollow á Google Maps.