Web Scraping with Python for a Friend
Have you ever received a request from a friend asking if something could be done easier by writing custom code? I try to embrace these types of interactions as they usually provide an opportinuty to learn something about a different industry, and they offer a nice distraction from my normal projects.
Earlier today, I had one such opportunity when my realtor friend Fred reached out to me:
Hopefully you have an easy (quick) answer to this...
Hopefully you have an easy (quick) answer to this...
I am attempting to capture this list of Realtors to an Excel file. When I copy and Paste the entire list to Excel, each line prints to one cell with no spaces or commas between the entries. My only solution has been to manually enter commas between each entry, then run Text to Columns in Excel. There are only a little over 300 lines, so it's do-able but not fun.
Is there an easy way to copy and paste with keeping the columns intact? Can you point me to something to look at?
fyi... I googled "Internet list to Excel". Looks like there is help on-line for what I want to do.
I would still appreciate your input/direction if it's not too much work for you... but I also want you to know I haven't given up!
I will get back to it again this afternoon.
I see what you mean.
I looked at the html source and saw that they did not use a table element for the data. They used an unordered list, which is why you don't get any spaces with copying.
I could write a script which scrapes the data out. I wouldn't be able to write anything until later this evening though...
How long would that take? And how complicated is that?
Don't want to cause you too much work, but if I could use it repeatedly that would really be great!
I finished the script. Do you want the data in a CSV, TSV, or Excel file?
Excel is the desired final format, I think I can get there from CSV.
Is this a script that I could also run?
You could probably run it. It's a Python script, as a Jupyter notebook. Once you have Python & Jupyter installed, it's simply a matter of pressing a play button for each step.
Amazing!!!
Thank You very much!!!
I am absolutely amazed that you made that seem so easy!
No problem. It was a fun distraction.
Monroe MCAR Realtor Scraper¶
Import urllib2, lxml.html, and xlsxwriter
import urllib2
import lxml.html
import xlsxwriter
Retrieve the html page as a string¶
url = "http://www.priv.monroe.xmlsweb.com/public_roster.asp" # This is the url of the iframe
resp = urllib2.urlopen(url)
page = resp.read()
Parse the html¶
Get the tree structure from the page
tree = lxml.html.fromstring(page)
Using xpath, get all of the span nodes with a class attribute of box20l. In order to find this out, you'll need to use Chrome's element inspector in the Developer Tools. Or, Safari's Web Inspector in the Develop menu.
nodes = tree.xpath('//span[@class="box20l"]/node()')
Below is a function which chunks data into a specified size. This is needed since xpath returns all of the spans in a continuous sequence. In this case, we know there are 7 columns.
# see: http://stackoverflow.com/a/434328
def chunker(seq, size):
return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))
Chunk through the nodes in groups of 7 (since there are 7 columns). If a node is an anchor tag, return the value of it's href attribute. Otherwise, return the string representation of the node since it's the span's text.
rows = []
for node_group in chunker(nodes, 7):
row = []
for node in node_group:
if isinstance(node, lxml.html.HtmlElement):
row.append(node.xpath('@href')[0])
else:
row.append(str(node)) # lxml.etree._ElementStringResult
rows.append(row)
Save to an Excel file¶
workbook = xlsxwriter.Workbook('monroe-mcar-realtors.xlsx')
worksheet = workbook.add_worksheet('Realtors')
for row_index, row in enumerate(rows):
for column_index, column in enumerate(row):
worksheet.write(row_index, column_index, column)
workbook.close()