Thursday, March 24, 2016

Avro vs Parquet: Apple vs Orange!

Comparing Avro vs Parquet is  like comparing Apple vs Orange!
By definition, Avro is is a data serialization system while Parquet is a data storage format. 
The format in which data is stored on disk or sent over the network is different from the format in which it lives in memory. The process of converting data in memory into a format in which it can be stored in disk or sent over the networking is called serialization. The reversing process is called deserialization. 
Data can be serialized using 2 main formats: text format and binary formats. Examples of text formant are CSV, XML, JSON. And Avro is a binary format for data serialization. 
So if you're looking for a way to compress the data in Hadoop ecosystem, then parquet is the best option. 
For more info, visit https://parquet.apache.org.

Friday, March 18, 2016

Some Hints On How to Prepare A Job Interview for BigData Data Scientist

Here are some questions that may be asked for  Data Scientist (BigData platform) job interview. They can break down to 2 areas: technical knowledge/skill and communication skill:

Technical area:
What is the differences between regression and classification problem?
How do you feature selection?
How do you access the accuracy of your model?
What data and how you build model for inventory prediction?
How do you come to know and start using Spark?
What performance tuning have you tried for Spark?
What are the benefit/differences of file compression schemas for Big Data such as Avro, Parquet, Orc, etc?
Why are small files bad for Hadoop?


Communication:
How do you explain the concept of Machine Learning to a first grade kid?
How do you explain to your business partners when you miss the deadline?
How do  you handle the situation when your business partners reject your model?