If you are learning data science from a book or a course, you most likely have a fairly narrow view of the field. You only get the views and experiences of the person or small group of people that produced the resource. Books and courses are fantastic, but don’t forget to broaden your horizons. Getting data science tips from a wide variety of data scientists is a must.
What useful things have you learnt on the job/in practice that you don’t learn from books and courses?
This is the question I recently asked the users of r/datascience on Reddit and boy did they deliver. The question took off and attracted responses from all kinds of practicing data scientists. I actually ended up picking up a fair few golden nuggets from this post. So, without any ado here’s the top five data science tips from the lovely people at r/datascience.
DON’T ASK WHAT DATA PEOPLE WANT
If you’re helping someone solve a problem, don’t ask what data they want, ask how they intend to use your output. That will help guide your work. – rfix
This is something I wish I had known a little earlier. Blindly following instructions on what data someone wants is not an effective way to work. It’s much easier to direct your analysis if you know the final end goal. Knowing the steps that someone thinks need to be taken to get to where they want to be at isn’t much help and may be wrong.
EVERYTHING A USER ENTERS IS RUBBISH
When allowing a user to enter data, where and when possible, only allow entry through drop-downs or pop-ups. This is especially important with date entry. I’ve had a user enter 2/31/xxxx for several dates and the system ended up puking all over itself due to an error I didn’t account for.
Assume that everything that the user has entered is garbage and then proceed from there; Sanitize your data or you will have multiple entries with various spellings for the same vendor.
If you are gathering requirements for a project and the user tells you “That will never happen”, then you better program for it because IT WILL HAPPEN. – jeffrey_f
This is very true. My workplace had been collecting data when a user signs up and let them freely enter a category. This ended up in a mass of unusable data. There were entries which were just plain invalid and tonnes of categories that ended up having only one user in. Not good, not good at all.
Where possible, have a set selection that the user can choose from. Also, users don’t like choice, choice is difficult. In terms of application usability, this makes sense too. Remember; Don’t Make Me Think!
DOMAIN KNOWLEDGE IS KEY
With domain knowledge, data scientists can craft useful features much quicker. – exthrash
Many times that I have been stuck on a problem, it turned out that I just didn’t understand the data properly because I was missing domain knowledge. As soon as I consulted with someone who knew the system, the business and the user base well, the path instantly became clear.
If there is someone you can talk to that knows a lot about the data, or system that you are doing an analysis on, talk to them. You will be able to understand and therefore handle the data much more easily.
YOU MUST BE ABLE TO COMMUNICATE AND REPRODUCE RESULTS
Analysis is a waste of time if you can’t communicate and reproduce your results. GitHub and Jupyter are required tools of the trade. – morgango
You’re probably not the person that’s going to be using the results you have found. This means that it is important to be able to communicate and transfer your results to someone else. Often a non-technical person. This means that being able simplify your lingo and clearly transmit data to another person is key.
Also, of course results must be reproducible. First off if you can’t reproduce your results, you should be questioning whether or not they are actually valid and whether you made a mistake or not. Secondly, an analysis often has to be ran again at a later time, this allows you to compare results, measure the progress of a company and see how things have changed.
AIM FOR CLARITY OVER SPEED
Many times it’s just not worth the effort of using the correct algorithm in terms of performance. I usually use the simplest algorithm since clarity is almost always more important than speed. – log_2
As I am primarily a Python programmer this really hits home with me. Clarity in your code is so, so important. Sometimes scripts need to be changed or refactored and the like. If you can’t quickly read, navigate and understand your code a month after writing it, then in my opinion you have a problem.
A month after writing some code it may as well be like someone else wrote it. So, when you are creating your programs and scripts, make sure you code in a way that would let any other programmer who sits down and looks at your code instantly understand it. Speed it good, but with modern computers it’s often not a problem. Obviously there are times when speed is needed, but usually, clarity is more important.
And there we have it, some really great tips on conducting data science that you don’t learn from books and courses. Thanks too all the people who responded to this question! You can view the original reddit post here: Data Science with Python – What useful things have you learnt on the job/in practice that you don’t learn from books and courses?.
What’s your top data science tip? Comment bellow!