What I Learned From 25 Years of Machine Learning
Source: here
Here is what I learned from practicing machine learning in business settings for over two decades, and prior to that in the academia. Back in the nineties, it was known as computational statistics in some circles, and some problems such as image analysis were already popular. Of course a lot of progress has been made since, thanks in part to the power of modern computers, the cloud, and large data sets now being ubiquitous. The trend has evolved towards more robust and model free, data-driven techniques, sometimes designed as black boxes: for instances, deep neural networks. Text analysis (NLP) has also seen substantial progress. I hope that the advice I provide below, will be helpful in your data science job.
11 pieces of advice
- The biggest achievement in my career was to automate most of the data cleaning / data massaging / outlier detection and exploratory analysis, allowing me to focus on tasks that truly justified my salary. I had to write of few re-usable scripts to take care of that, but it was well worth the effort.
- Be friend with the IT department. In one company, much of my job consisted in producing and blending various reports for decision makers. I got it all automated (which required direct access via Perl code to sensitive databases) and I even told my boss about it. He said that I did not work a lot (compared to hard-workers) but understood and was happy to always receive the reports on time automatically delivered in his mailbox, even when I was in vacation.
- Leverage API’s. In one company, a big project consisted of creating and maintain a list of the top 95% keywords searched for on the web, and attach a value / yield to each of them. The list had about one million keywords. I started by querying internal databases, then scraping the web, and develop yield models. There was a lot of NLP involved. Until I found out that I could get all that information from Google and Microsoft by accessing their API’s. It was not free, but not expensive either, and initially I used my own credit card to pay for the services, which saved me a lot of time. Eventually my boss adopted my idea, and the company reimbursed me for these paid API calls. They continued to use them, under my own personal accounts, long after I was gone.
- Document your code, your models, every core tasks you do, with enough details, and in such a way that other people understand your documentation. Without it, you might not even remember what a piece of your own code is doing 3 years down the road, and have to re-write it from scratch. Use simple English as much as possible. It is also good practice, as it will help you train your replacement when you leave.
- When blending data from different sources, adjust the metrics accordingly, for each data source; metrics are likely to not be fully compatible or some of them missing, as things are probably measured in different ways depending on the source. Even over time, the same metric in the same database can evolve to the point of not being compatible anymore with historical data. I actually have a patent that addresses this issue.
- Be wary of job interviews for a supposedly wonderful data science job requiring a lot of creativity. I was misled quite a few times, the job eventually turned out to be a coding job. It can be a dead-end, boring job. I like doing the job of a software engineer, but only as long as it helps me automate and optimize my tasks.
- Working remotely can have many rewards, especially financial ones. Sometimes it also means fewer time spent in corporate meetings. I had to travel every single week between Seattle and San Francisco, for years. I did not like it, but I saved a lot of money (not the least because there is no employment tax in Washington state, and real estate is much less expensive). Also, walking from your hotel to your workplace is less painful than commuting, and it saves a lot of time. Nowadays telecommute makes it even easier.
- Embrace simple models. Use synthetic or simulated data to test them. For instance, I implemented various statistical tests, and used artificial data (many times from number theory experiments) to fine-tune and assess the validity of my tests / models on datasets for which the exact answer is known. It was a win-win: working on a topic I love (experimental and probabilistic number theory) and at the same time producing good models and algorithms with applications to real business processes.
- Being a generalist rather than a specialist offers more career opportunities, within your company (horizontal move) or anywhere. You still need to be an expert in at least one or two areas. As a generalist, it will be easier for you to become a consultant or start your own company, should you decide to go that route. Also, it may help you understand the real problems that decision makers are facing in your company, and have a better, closer relationship with them. Or with any department (sales, finance, marketing, IT).
- In data we trust. I disagree with that statement. I remember a job at Wells Fargo where I was analyzing user sessions of corporate clients doing online transactions. The sessions were extremely short. I decided to have my boss do a simulated session with multiple transactions, and analyze it the next day. It turned out that the session was broken down into multiple sessions, as the tracking services (powered by Tealeaf back then) started a new session anytime an HTTP request (by the same user) came from a different server (that is, pretty much for every user request). The Tealeaf issue was fixed when notified by Wells Fargo, and I am sure this was my most valuable contribution at the bank. In a different company, reports from a third party were totally erroneous, missing most page views in their count: it turned out that their software was cutting every URL that contained a comma: a glitch caused by bad programming by some software engineer at that third party company, combined with the fact that 95% of our URL’s had contained commas. If you miss those massive glitches (even though in some ways it is not your job to detect them), your analyzes will be totally worthless. One way to detect these glitches is to rely on more than just one single data source.
- Get very precise definitions of the metrics you are dealing with. The fact that there are so many fake news nowadays is probably because the concept of fake news has never been properly defined, rather than a data / modeling issue.
To receive a weekly digest of our new articles, subscribe to our newsletter, here.
About the author: Vincent Granville is a data science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at DataShaping.com, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target). He recently opened Paris Restaurant, in Anacortes. You can access Vincent’s articles and books, here.