Big question - why hasn't anyone applied data science and machine learning in the real estate domain?
With the recent (7 years?) advances in natural language processing, image recognition and the validation of various machine learning models why haven't savy investors started a mad race towards developing the ultimate valuation tool?
I've been dabbling in tackling this problem - and so far, it's not that bad.
The only big problems I see is that the data sources for national markets are not uniform.
That the given data might be incomplete or even contain errors (which will affect your model).
That the sample set might be orders of magnitude smaller than your feature space (fixable via removing or combining linearly correlated features to yield an orthogonal feature set)
...So why hasn't anyone done this yet?
The fact that no one has bitten on this thread is a partial answer to your question. For the past few months, I have had a notification set up on BP for "machine learning," "artificial intelligence," and "singularity," but this is the first ping I've seen.
I'm no AI expert by any definition--just a concerned citizen of Earth trying to make sense of things. That you are already working on using these tools to your advantage puts you ahead of most other investors. How do you imagine deploying them for REI, and for which section of the market? SFRs? Commercial? Something else?
Of course, Redfin and Zillow have a leg up in the residential space, but ultimately, especially in residential RE, there is a human element that can't be replaced by robots. I'm not just talking about emotional buying, but also, it's hard to know (for example with Zillow's zestimates) what the inside of a house feels like and/or if there have been unpermitted updates etc.
Maybe you're talking about larger commercial property?
@Jason Ling - I make my living doing that, for REI it's a long shot too many other things need to happen. With 700+ face book likes AI knows you better than you know your self.
@Severin Sadjina Isn't this what you've been working on?
Account Closed - Hey Dani, I haven't thought ahead to the point of deploying the tools just yet. I'm working on the base algorithms and all the code is just a proof of concept. I've already tried a few very simple methods of price prediction - using median price/sqft of SFH and it's failed fantastically. Although my mean error rate is 3% the standard deviation of the error is 20%+... So now I'm starting to investigate if machine learning will produce a good model.
>>How do you imagine deploying them for REI, and for which section of the market? SFRs? Commercial? Something else?
- As far as technical deployment goes - there are no plans, right now everything is working towards a proof of concept. Bells and whistles will be added if the proof of concept goes well.
- As far as deploying it for REI (not sure if I understand the question fully) but the objective is to identify on market, public deals far faster than any human could possibly do so. Objective is to do this analytically, without bias and consistently. Since it is machine learning the model would be iteratively updated, so as market conditions change so does your model. I'm sure if the idea pans out, many insights will be gleaned.
- Which section of the market? SFR is the only area I can remotely wrap my mind around. Commercial is a big scary monster to me - I don't even know where to start with Commercial. Plus, you must remember that the more data there is the better machine learning will generally work. There are a lot more SFR samples (think home sales) than commercial - therefore I think SFR would be easier to apply machine learning to.
If SFR works out well then I'd spend time on commercial.
- As far as Zillow and Redfin goes - I'm honestly surprised at how off Zillow's estimates are. I was convinced they must be using machine learning. Also, as far as I'm aware - neither service alerts you of undervalued deals.
Also another thing to note is that the idea can easily be modified to simply analyze the cash-flow potential of homes or even entire regions of the US (although scalability could be a problem).
In the end, if everything goes well, it might end up being a tool that gives you a shorter list of homes you'd have to go in and make the final call. But even if it reduces your work-load by 80% or increases investment discoveries by 50%, that has to be worthwhile right?
>> I make my living doing that, for REI it's a long shot too many other things need to happen. With 700+ face book likes AI >>knows you better than you know your self.
@Vivek Khoche - Would you mind elaborating? What do you mean by too many other things need to happen. Any sort of insight you'd be willing to share would be greatly appreciated.
@Kim M. - Thanks, I've reached out to him to see if he's willing to share any insights.
@Jason Ling - REI is still pretty much brick and mortar and depends on agent and their expertise. There is digital transformation happening with auction platform like Hubzu and selling platform like Roof Stock. You could be a pioneer introducing the concepts of AI and machine learning to REI but in my mind you are better off getting into the areas which are matured and ready to adopt machine learning. I would be happy to talk about REI or machine learning/AI.
@Vivek Khoche - Thanks Vivek, it sounds like you're saying that the great deals are rarely made public and deal finding happens mostly in off market transactions? If that's the case then yes, machine learning would be of limited use.
Machine learning can only use publicly available and digitally published data to find deals. If it's true that no great deals ever make it to MLS listings or craigslist listings (or whatever) then yes, I agree, no amount of machine learning could ever help!
I would then turn my efforts on trying to predict when a homeowner is willing to sell their property in an effort to try to seek out off-market deals before anyone else is aware of them.
That however, seems like a much much harder problem to solve!
@Jason Ling - I am not saying machine learning can't help REI, At the end of it is the money,look else where. Good data scientist are paid $1m a year these days.
I'm not quite good enough to be paid 1M a year (10 years experience as a software engineer, but only started studying machine learning for about a month) - but I'm interested in machine learning enough to use REI as a "practice project".
If my practice yields something of significant value - then all the better!
If not, at least I'm acquiring real life practical experience when it comes to data science.
I have used machine learning (ML) to build models that estimate values/prices for apartments and to predict the expected gross monthly rental income. I did this for my local market (Ålesund) here in Western Norway, but also for Oslo. I also focused on only apartments because that was the most relevant to me. (And no one rents houses in Norway, everybody buys.)
There is a ton to talk about here, so I'll just dive right into it and try to give a somewhat concise overview. I hope this will spark an interesting conversation ;)
So, the biggest challenge has been data volume. It took me a long time to collect over 200 samples in my local market, which is part of the reason I resorted to Oslo (Norway's capital) where I have almost 2.500 samples. Still not a ton by ML standards, but useful.
I used a linear regression model (LR) and a neural network (NN) to model both values/prices as well as gross rental income. For Oslo, I am now down to 6% median error rate (with the NN), and 90% of the test samples are within 18%-19% error rates at the worst. And this is by using only four features (input variables) for the model: living area, year built, and location (longitude and latitude). I am personally quite happy with that. The reason I only use those four is because others (I have a total of 22 features collected, many very sparse) didn't improve my models or made its performance even deteriorate.
I also use principal component analysis (PCA) as a very simple anomaly detection algorithm on the data set. I do this to help me automatically identify properties on the market which may be undervalued.
I know that Zillow's zestimate gets butchered left, right and center. But truth be told, 6% median error rate (also Zillow's national average) is quite good, and probably better than a lot of humans. On average, that is. Sure, it can be off by 20%. But humans can be too sometimes. At least I can. I would also like to point out that 6% median error rate means that, on average, all the stuff that is NOT captured by the models (living area, location, year built) only accounts for 6% of the price.
But still this means that I can't blindly rely on just the models. I use them to help me identify and decide on potential deals. And I can happily report that I just bought my first investment property (a studio apartment right in the city center) using these methods. Personally, I think this is pretty remarkable, because before I started with any of these in March I had NO clue about the local real estate market.
So, what's on the horizon? I have a lot of things I would love to try out. I would love to use image recognition methods to look at pictures of listings and detect things in them that influence the price (a nice kitchen, a pool, a pet elephant, etc.). It would also be great to use text recognition to do something similar. I also want to collect data on more markets to extend my investment horizon (Oslo is useless to me because it is way too hot and expensive, but Trondheim could be good).
Anyway, I would also definitely be up for a dedicated BP ML/AI group and to work together to improve our efforts and models! And I am more than willing to share more details. I'd like to point out, though, that I found ML and data science back in February, so I am far from an expert ;)
Oh, I forgot one thing I find very interesting personally:
As I mentioned, I have very little data on my local market (about 250 samples). That is really very little data by ML standards. I have experimented a bit with transfer learning, where I would train a NN on the much larger Oslo market (2.500 samples) and use part of the "pre-trained" network to look at the data from the local market. The reasoning is that, while prices and geography and construction years are obviously totally different, the NN may still learn interesting features and correlations which should also be useful elsewhere. So far, the testing has not been very conclusive, but I also didn't manage to implement everything 100% correct.
So the point is: one could maybe pre-train on, say, ALL SFH in the entire US market, giving a huge data set, and then use that knowledge to further "specialize" on the local markets. I wouldn't be surprised if that is already part of what Zillow is doing by the way...
@Jason Ling I'm guessing it has something to do with data quality. When you start looking at sales prices it's really hard to get insight around 'condition'. You could have an updated home sitting next to a non-updated home and have extreme value differences. Those value differences are going to be hard to see from the outside (maybe not impossible) but known by appraisers, agents that have been inside the property, and ultimately the buyer as well. Those could be great things like new A/C units, granite counter-tops, etc. or maybe someone got a "bargain" because the home had knob-and-tube wiring. If you look at older homes how will an algorithm know if the electric system has been replaced, if the plumbing is galvanized, etc. Maybe you could "teach" your system to look at pictures but it doesn't sold the problem if only a couple of pictures are posted. That said, it would surprised me if data (and resulting analysis) wouldn't be far better if you were dealing with a relatively new subdivision where you'd have comps, more similar wear-and-tear, potentially less time for major updates/upgrades to have occurred. The catch-22 is that those are easy neighborhood to comp out in the first place so any algorithm is going to be less useful.
I don't know how you'd get access to the data but I think an interesting exercise would be the look at the number of construction permits pulled, loans taken out, etc. over a given zip code. Then see if that can be correlated to some kind of "gentrification" process for an area. Somehow combine it with Google Maps to pull images and teach some AI engine how to identify a construction crew. It sounds crazy but I'm guessing the latter is possible. Anyway, what do I know.
Originally posted by :
The catch-22 is that those are easy neighborhood to comp out in the first place so any algorithm is going to be less useful.
I'll have to disagree: a human can a.) only deal with a small number of data points and b.) only fathom very simple relationships between them (at least when it comes to the numbers).
An algorithm, on the other hand can learn complicated mathematical relationships (also nonlinear ones that are especially hard for humans to deal with) between all the variables and the price, can do so unbiased (theoretically), and take into account an enormous number of data points. For example, if your algorithm has learned exactly how prices change with location, there is no need to be restricted to only comps in the neighborhood. Same goes for built year etc.
I do fully agree with everything else you've written, though!
@Severin Sadjina -Hey Severin, I'm just starting in ML and haven't gotten quite to neural networks yet. I was principally going to try regression techniques or ensemble methods for predicting home prices.
I do however, have a pretty healthy data set. With little effort I was able to obtain a training set of 10k samples for about a 20km^2 area. I believe I can get even more.
I tried to use a simple non ML algorithm by using median price/sqft but my results were poor, although my median error rate was <5% the standard deviation was 20%! (Assuming normal distribution for error rates, this is quite bad..almost useless!)
Is there a reason why you just stopped with linear regression (I don't know much about neural networks yet). Why not use ensemble methods?
Doesn't linear regression imply that there is a simple linear relationship between price and features?
@Andrew Johnson - I think that ML would be able to meet or beat blind appraisals (depending on the ML practitioner's ability). I think ML would be less useful if a significant portion of the information was hidden (e.g required you to inspect the property visually).
But you do have a good point with the permitting, I had the same intuition.
That the age of the property and the number of permits may indicate the health of the property.
All of this information could be fed into a ML model.
The strength of ML is that the computer will be able to analyze thousands if not tens of thousands of deals per second and it is possible to have notification of a deal within hours of it being available.
@Severin Sadjina You wrote So the point is: one could maybe pre-train on, say, ALL SFH in the entire US market, giving a huge data set, and then use that knowledge to further "specialize" on the local markets. I wouldn't be surprised if that is already part of what Zillow is doing by the way...
Yes, this is what I was thinking. Of sub-dividing regions into properties that behave the same way. Maybe using some clustering technique.
Of course you would allow clusters or groups to evolve and change over time.
I think some powerful insights beyond the price of the property could be had! Something that human intuition alone could not guess at.
@Jason Ling I'll try to answer your questions:
1.) 10k samples sound pretty good! Is that only one type of real estate (SFH...?), or several? I have only looked at apartments, but I am pretty certain that other types of housing are sufficiently different to require using different models (or, at least, a more complicated one). But I'd maybe start with one area and one type anyhow. Better to go for the lowest hanging fruit first ;)
2.) What do you mean by "the median error was 5% but the standard deviation was 20%"?
3.) I didn't stop using linear regression, I still use both (LR and NN). Is use both because sometimes one is better, and because I can double check and/or average.
4.) I don't know much about ensemble methods, and that's the only reason I haven't tried them yet. ...although I may already be doing something similar: I typically average the results of 30 or so NNs (same architecture but initialized with different weights) to give me a prediction. I also played around with averaging over different architectures and different parameters. But I haven't done enough testing really, and my implementations are probably a bit sloppy.
5.) Linear regression does imply that there is a linear relation between a feature and the output. However, you can construct your own features from the ones you have available (such as living area, location, ...), create nonlinear combinations etc. For example, I use the logarithm of the living area (because it follows a log distribution approximately), I may use the square root of the built year, and I use a few combinations of longitude and latitude including cosine, sine, and the Euclidean distance from the city center. As of today, I use a total of 10 constructed features from the four "original" ones (living area, year built, and coordinates). I found these through simply trying and/or by doing some statistical analysis on the raw sample data. I also always use the logarithm of the price as the label/output, because it too follows a log distribution approximately. If you haven't tried that yet, definitely do!
6.) The nice thing about NNs is that they are "automatic feature constructors" (they were in fact invented for that purpose). This means that they themselves learn which features and feature combinations are the most important, and how to use them. This doesn't always work flawlessly of course and you still need to know what you're doing (to choose the right dimensions and parameters etc.), but it's really pretty awesome! The NN architecture I use now has four direct inputs (log of living area, year built, coordinates), one hidden layer with 11 neurons, and a second hidden layer with 5 neurons. And a bit of regularization to make it generalize better. Again, I found this simply by playing around.
7.) And yes, I think some automatic clustering would be great to implement. It would probably reveal some cool insights, it would help deciding on which model to use for prediction, and it could also help with anomaly detection (for data cleaning and/or finding potential deals).
For anyone wanting to get some insight into what neural networks do and how feature designing/constructing works, definitely have a look at this: http://playground.tensorflow.org
Very well done and fun to play around with ;)
>>10k samples sound pretty good! Is that only one type of real estate (SFH...?)
That's single family homes, in popular metro areas I am confident I can easily achieve 10k sample data/10km^2 - maybe more. But the older the sales data the less likely it is to be described with the same model as homes sold today.
>>but I am pretty certain that other types of housing are sufficiently different to require using different models
My hope is to replace human intuition and bias with something more concrete. Right now we classify different residential assets based on a criteria of use and size but perhaps clustering will reveal classifications that people have not realized!
>>What do you mean by "the median error was 5% but the standard deviation was 20%"?
Say I created the model, the model looks at a property whose price we are trying to estimate. It grabs all other properties sold within 6 months and are less then 0.5 miles from it, I call this my comparison property set.
For each item in comparison property set I compute price/sqft and place these values into a list.
I pick the median from this list and multiply this median price/sqft against the sqft of the property that I am trying to estimate.
I yield a value H. I do this for all properties.
Then I do this where H = predicted value, Y = actual value
percentage_error = abs(1-((H-Y)/Y))
I put these numerous percentage errors in a list.
I take the median and it yields ~5%
I perform a sample standard deviation on the list and it yields 20%.
I interpret this to mean that roughly half of my estimates over estimate the price by more than 5% and the other half do better than 5%. How much I over-estimate by 5% is hinted at by the standard deviation of 20%.
I assumed that my error % was normally distributed (I did not check) - and given that my sigma (standard deviation) is horrendous. If it were something ridiculous like 2% or 3% - I would then be very happy with this model and would question my motivation towards seeking a better model using ML.
Note: I did compute the error rate for various distances and history length (e.g I used houses within 0.25 miles, 0.5 miles...10 miles) and 5% was the best.
I'll have to look more into NN but my first approach would probably be to use some ensemble method.
As I understand it ensemble is to use multiple weak learners in a network to yield a strong model. I myself do not know much about it and I have some studying to do.
My hope is that ensemble would effectively create a custom hypothesis for me - right now the biggest challenge I see is to pick a good H(x)! I want a good way to pick H(x)!
Also did you write your own algorithms or did you end up using pre-canned ones? e.g sci-learn?
I'm probably going to be using Octave to develop my approach, rewrite in Python for better performance.
If I really need to scale then I think I'll start looking into Go or maybe even C/C++ .
Wow....didn't know it was this many IT and software engineers on the site....anyway I think there is a ton of potential for software to disrupt some things in real estate. Even the smart home concepts are still in it's very early stages. I would think whoever can solve some major problems within real estate via software will end up with a ton of money. Zilllow estimates are actually 95% off either overestimation or underestimations. But it's ok because that means either they can fix that or someone will develop some program and take some market share from zillow. Also whoever can figure out a way to provide very reliable and solid sold comps or comps either for residential or commercial will provide a ton of value as well. I'm actually surprise how many agents don't know(or just too lazy) to provide sold comps for people. Which brings me to my next point. I honestly think overtime maybe within 15 years agents job duties will be over 70% automated. If by some miracle someone can develop a program to see current pics of the insides of homes via VR or some kind of technology than I would be very scared if I was a agent. But then again software/digital tech will replace a lot of jobs.
If you're interested in applying ML techniques to property valuation I'd highly recommend getting involved with this Kaggle competition:
Good clean data to work with plus a great community of folks to talk about the problem with.
(I am not associated with Kaggle)
Thanks! I'll keep an eye on the Kaggle comp.
>>If by some miracle someone can develop a program to see current pics of the insides of homes via VR or some kind of technology than I would be very scared if I was a agent.
Affordable VR is just starting to become available with "Google DayDream". Commodity 360 degree cameras are available on the cheap these days, I think Samsung sells one for ~$150 and GoPro is releasing one soon.
Superficially it seems like you might be able to use project Tango to create a 3d model of the inside of the home...sort of.
You still have to provide textures and things - so I think real estate agents are safe for at least a decade.
>>But then again software/digital tech will replace a lot of jobs.
Yes, but truck drivers, taxi drivers, radiologists and grocery store clerks have a lot more to worry about in the short term.
Truck drivers/Taxi Drivers = Everyone and their mother is going for level 5 autonomous driving. I'd expect to see some big things in <7 years.
Radiologists - Seems like easy pickings for ML. Radiologists would still exist, you'd just need far fewer of them to do the same amount of work.
Grocery stores - Last I heard Amazon set up a "employee-less" grocery store pilot in Seattle some time last year. You go in, pick up your groceries and leave and you get billed for what you bought. Combine that with the fact that Amazon just bought WholeFoods. Honestly Amazon scares me, they're definitely not the good guys, they are ruthless.
>> My hope is to replace human intuition and bias with something more concrete. Right now we classify different residential assets based on a criteria of use and size but perhaps clustering will reveal classifications that people have not realized!
Exactly, that is one of the very cool things about ML! But I think for starters you can assume that, for example, condos will need different modelling than multi famiy units. I have also started thinking about even having different models for the low end, mid range, and high end parts of any given market. Especially the high end part often shows totally skewed prices! It seems to me that people basically pay double for the extra standard and views etc.
>> I interpret this to mean that roughly half of my estimates over estimate the price by more than 5% and the other half do better than 5%. How much I over-estimate by 5% is hinted at by the standard deviation of 20%.
...I do not think you can do it that way, but I fail to tell you why exactly right now. But let's say the median value of your relative error (as you compute it) is 5%. Now, what if the standard deviation on the list of errors gives you 0%? What would that mean?
Rather look at residuals and normalized residuals and make sure there is no systematic bias (there will definitely be with that simple model).
>> As I understand it ensemble is to use multiple weak learners in a network to yield a strong model. I myself do not know much about it and I have some studying to do.
Same here. I would guess that this is a great approach for our problem!
>> Also did you write your own algorithms or did you end up using pre-canned ones? e.g sci-learn? I'm probably going to be using Octave to develop my approach, rewrite in Python for better performance.
I am actually also using Octave (because of the Coursera ML course I started with). I feel that, as a theoretical physicist, I need to really understand what's going on. Hence, the low level algorithm design with Octave. But switching to Python and Tensorflow has been on my todo list for quite a while now. Especially with deep learning, it'll make things so much easier and faster!
>> I think there is a ton of potential for software to disrupt some things in real estate. Even the smart home concepts are still in it's very early stages. I would think whoever can solve some major problems within real estate via software will end up with a ton of money.
I agree a 100%!
>> Zilllow estimates are actually 95% off either overestimation or underestimations.
I am not sure I follow. I mean, let's say your model predicts a price 1 Cent over the actual price. For a 500.000$ house. Would you still count that as an overestimate?
Zillow cites a 5% median error rate (I believe nation wide), which means that half of all prices are off by only 5% or less (which is very good!), and the other half is off by more than 5%. Zillow's estimate seems pretty good for Phoenix where two thirds of all properties are predicted correctly within 5% (see https://www.zillow.com/zestimate/#acc). I honestly doubt that the average appraiser or agent can do better!