County website Scraping

26 Replies

Hi There ,

I am about to get a website scraping tool built - but I am not too sure of all the things to consider here - I have given the basic requirement to the developer

for ex: the program built should be able to work with different sites - for ex one can be built on .net , the other on html - the software should be good enough to handle these variations.

If anybody has any pointers , that would help - you can PM me as well .

Alternatively , if anyone knows of a good out of the box software that I can purchase or subscribe to - I can do that as well

Thanks,

David

I can I have done a lot of data mining for various RE sites like Redfin, Zillow, House Canary, Upnest, Homelight, Rematics, homes.com and a few county probate sites. What were you looking for.

@Tony Zuanich That's awesome.  Are you using APIs?  And I'm assuming that you done some work on Trulia since I'm interested in data from all of the above.  Grabbing data from County Probates and Tax Delinquencies is also on my Christmas wish list. 

Please share with me what type of work you're doing and any use cases.

Sorry guys ..didnt see this..

Counties are always tricky - not saying its not possible , but most of them throw up captchas , visual captchas so automating it might be pretty tough . Even if you find some one who can build , the problem is that , the county websites keep updating and then the program breaks. For county information , I just found lot of data providers .

But one more thing , some counties have pretty basic websites...so the scraping becomes easier

I use zillow for collecting lot of data : For ex : when I researching a new area for flipping ... I will collect the data and find properties which were bought and sold with in a year time frame , which means they might have been flipped..

I use this to predict the flip prices ....

Other use cases :

1. Find properties which have teh fastest percent price drop

2. For a specific realtor - what is his list/sell price ratio - this tells whether he consistently over prices his listings

3. Find properties which got sold with in a week after hitting the market - This will give me examples of good deals in that market

4. For multifamily , 2-4 units , I use it to calculate rental income 

============

APIs won't solve this problem - as they are restricted in terms of number of calls you can make .. meaning number of records you can extract , the type of data and so on...

============

Programming language : I think selenium with beautiful soup - makes it easy to build these. If you want a destop application - then C++ or .net

Also keep in mind that - you also need to use proxies so you won't get banned 

But once you do it for one website , the process is essentially the same...

@David Des Thank you sir.  Your response is very consistent with what I've been seeing and hearing as I do my research.  But you also provided a few more gems in addition to what I have.  So thank you dropping them on us.  

I like the logic that you use for Zillow.  Very interested in the "fastest percent price drop" metric since to me it would indicate a Measure of Motivation which I would not have thought of.  I was studying a property a few weeks back and looked in the history and noticed that it had dropped from $240k to $180k in the past year.  And it looked like the seller either changed agents or more agents picked up the listing.  So I subconsciously interpreted that as motivation but did not bridge that over to something that could be measured and automated.  So thanks for that.

Also, identifying an area as "hot" by measuring the quick sales is quite clever. 

As for the API, I was recently talking to a data geek and he advised me on the limitations there where one site I wanted to grab data from had a limitation of 500 calls a day.  That was disappointing given that I wanted to grab thousands of records... and that's per zipcode.

@Will Morris   Other than what @David Des just sent, I haven't seen too many response to my inquiries.  The "data science" and automating of the data pulls is not something that is in much practice around here.  And I think we are all seeing that when it comes to Probate and Tax Delinquencies, the counties are "all over the map".  Some counties have downloaded files... some don't.  Some post to their websites, some post only to the local newspapers.

I think many in the Wholesaling community simply hire a VA to manually pull the data or send Bird Dogs to the courthouses where they take pictures of the records or paperwork and have a VA transcribe.

@David Des I’ve built a scraper for my county site (Snohomish County in Washington). Here’s how I use this:

I get my lists from title, public disclosure survives, MLS, driving for dollars, etc. I keep all of the parcel numbers for those properties. I can then upload the CSV file. An API endpoint at AWS is set up with python to look at each parcel number, go to the county site for that parcel, scrape the HTML for all data, then push that data to my Podio database. This happens to work amazingly well for my area because the county site is as such:

www.countysite.sample/?=12345  (12345 would be the parcel number)

For more advanced sites you could still use a tool like Python’s Senium to actually click through and submit web forms to get to the data you need.

Doing this is extremely powerful. I’m able to get a list of a couple thousand properties and have them all imported into my CRM within an hour... including owner name, owner address, back taxes due, sales history, bedrooms, bathrooms, images, floor plans, etc.If you need guidance on how to spec this out to a developer just shoot me a message.

Hi All,

I come from a programming background and have built many custom web scrapers over the years. I just wanted to add some information and clarity that some might find useful.

Selenium (which has been mentioned a lot here) is an automated web browser that can be programmed to navigate to certain URLs, fill forms, click buttons, etc. In my experience, I would only use Selenium if the site produces dynamic content within a single page.

If you are looking to build a web scraper on your own, Python is the probably the best choice -- especially if you are new to programming. it has packages built to use Selenium. It also has packages for making the web requests (requests) and for parsing the HTML of the webpages (BeautifulSoup, lxml, etc.).

I would become familiar with the webpage you are targeting by inspecting the HTML, which can be done in any web browser by clicking CTRL+U, and/or by using your browser's web developer tools -- in Chrome, you can access this by clicking CTRL+SHIFT+I.

Good luck and happy scraping!

-Dylan

I you are looking to get data from a specific County's WWW site, fairly straight forward, if you what multiple, more complicated. Give me the County URL and what you are looking for.

I wrote a scraper for my county database that pulled everything down into a database that I use for searching and filtering.

I also get the code violations and import that as well. Right now I mostly use it for generating lists of out of state owners in certain neighborhoods for direct mailers and facebook ad targeting. Im definitely going to expand this to other markets over time

@David Des

My only advice is to narrow your goals. Know what you're looking for and your metric for success. Scrapers have a very short lifespan (websites change, data sets die out). You'll want to get some actionable data out of your first few days / weeks of research. Also, make sure whatever you get out of it is more valuable than the cost of the work itself.

Don't get me wrong. I understand the temptation to play around with cool toys and technology. It's just important not to get carried away. Focus on your business and your investment goals.

@Trevor Ewen   What you said is 200 percent correct . especially  "You'll want to get some actionable data out of your first few days / weeks of research"

This is what I learnt after wasting lot of time , why didnt you come and post this last year ,ha ha , Just kidding , But yes that is solid advice and there s the temptation to over improve stuff and loose the big picture is a big deal

@David Des building a scraping tool that is flexible and has it all will be hard since it looks at HTML code behind websites, which will be extremely different across different resources and may change over time. I think it is better to focus on just a few websites that have most of the info you need, scrape them, save data into a database and then build an separate analysis tool. My two cents.

I agree, most sites change. We focus on 2 programs per county site. First one extracts all data and hosts, the second updates a few hundred records a day. Varies by county but only updating all SFR isnt bad. Updating all parcels takes a while. If site changes you still have all the data hosted and can rebuild/update script without being down.

By hosting data you can create relationships with property owners in other counties and get a better picture of motivations

Just an idea ;)

@David Des

I think you've received this advice already. But as a developer who does a fair amount of data mining, be careful about anyone that offers you an 'all-in-one' solution at a reasonable price. 

This is hard thing to manage, because there are so many variations in county sites and the data they provide. If you are going county by county, imagine developing at least some part of customization for each one. That is a linear price, and therefore hard to do unless you have explicit goals and can pre-filter counties. I wouldn't start with anymore than 5 counties.

To echo @Trevor Ewen 's advice... doing this for even two counties will take enough up front time and updating that it's probably worth paying for a company who has already done the hard part and use an API to connect to their data.  I've built scrapers for both Snohomish and King County WA, but will be using something like ATTOM Data when the time comes for the national CRM I'm building.