code to live or live to code

Tuesday, April 12, 2016

Data Scientist Interview Notes

Problem solving skill:

what binary search
write function to return next smallest node

Statistic skill:

2 given datasets a (a1.... an) and b(b1...bm)
how to determine if a,b are from the same or separate population?

Thursday, March 24, 2016

Avro vs Parquet: Apple vs Orange!

Comparing Avro vs Parquet is like comparing Apple vs Orange!
By definition, Avro is is a data serialization system while Parquet is a data storage format.
The format in which data is stored on disk or sent over the network is different from the format in which it lives in memory. The process of converting data in memory into a format in which it can be stored in disk or sent over the networking is called serialization. The reversing process is called deserialization.
Data can be serialized using 2 main formats: text format and binary formats. Examples of text formant are CSV, XML, JSON. And Avro is a binary format for data serialization.
So if you're looking for a way to compress the data in Hadoop ecosystem, then parquet is the best option.
For more info, visit https://parquet.apache.org.

Friday, March 18, 2016

Some Hints On How to Prepare A Job Interview for BigData Data Scientist

Here are some questions that may be asked for Data Scientist (BigData platform) job interview. They can break down to 2 areas: technical knowledge/skill and communication skill:

Technical area:
What is the differences between regression and classification problem?
How do you feature selection?
How do you access the accuracy of your model?
What data and how you build model for inventory prediction?
How do you come to know and start using Spark?
What performance tuning have you tried for Spark?
What are the benefit/differences of file compression schemas for Big Data such as Avro, Parquet, Orc, etc?
Why are small files bad for Hadoop?

Communication:
How do you explain the concept of Machine Learning to a first grade kid?
How do you explain to your business partners when you miss the deadline?
How do you handle the situation when your business partners reject your model?

Tuesday, December 16, 2014

Ubuntu Install - Error invalid arch-independent ELF magic

sudo fdisk -lu /dev/sda
sudo mount /dev/sda1 /mnt (if /dev/sda1 is the boot partition)
sudo grub-install --root-directory=/mnt /dev/sda
sudo reboot

Enjoy your newly installed Ubuntu!

Windows 7: get pc serial number from command line

To get your machine serial number in Windows 7, you can try these command:
wmic bios get serialnumber or
wmic csproduct get identifyingnumber
Have fun without looking at the back of your pc to find the serial number!

Wednesday, November 12, 2014

Flume and rolling into a big file

I takes me for a while to figure out how to stop Flume from rolling small files, but into a big file (e.g 128 MB). I read 2 books on Flumes and none of them showing me how to accomplish that! Even Flume docs doesn't show you to do that neither.
I started Google and have to put pieces together and here's the working config for Flume to roll in the file size you want. This config for JMS queue. The magic is "JmsAgent.sinks.HDFS.hdfs.minBlockReplicas = 1"!

JmsAgent.sources = JmsSrc
JmsAgent.channels = MemChannel
JmsAgent.sinks = HDFS

JmsAgent.sources.JmsSrc.type = jms
JmsAgent.sources.JmsSrc.initialContextFactory = ...
JmsAgent.sources.JmsSrc.connectionFactory = ...
JmsAgent.sources.JmsSrc.providerURL = ...
JmsAgent.sources.JmsSrc.destinationName = ...
#default batchsize = 100
JmsAgent.sources.JmsSrc.batchSize = 500
JmsAgent.sources.JmsSrc.destinationType = QUEUE

JmsAgent.sinks.HDFS.type = hdfs
JmsAgent.sinks.HDFS.hdfs.useLocalTimeStamp = true
JmsAgent.sinks.HDFS.hdfs.path = hdfs://host/path/%Y-%m-%d
JmsAgent.sinks.HDFS.hdfs.filePrefix = jms_sample
JmsAgent.sinks.HDFS.hdfs.fileType = DataStream
JmsAgent.sinks.HDFS.hdfs.writeFormat = Text
JmsAgent.sinks.HDFS.hdfs.batchSize = 10000
#256mg = 268435456
#JmsAgent.sinks.HDFS.hdfs.rollSize = 268435456
#128mg = 134217728
JmsAgent.sinks.HDFS.hdfs.rollSize = 134217728
JmsAgent.sinks.HDFS.hdfs.rollCount = 0
#default rollInterval
JmsAgent.sinks.HDFS.hdfs.rollInterval = 0
JmsAgent.sinks.HDFS.hdfs.idleTimeout = 3600
JmsAgent.sinks.HDFS.hdfs.minBlockReplicas = 1

JmsAgent.channels.MemChannel.type = memory
JmsAgent.channels.MemChannel.capacity = 11000
JmsAgent.channels.MemChannel.transactionCapacity = 10000

JmsAgent.sources.JmsSrc.channels = MemChannel
JmsAgent.sinks.HDFS.channel = MemChannel

Have fun with Flume!

Typesafe Activator behind proxy

As started playing with Scala, I need to run Typesafe Activator. And I tried to run activator ui and got an error. It turned out that Activator needs to download dependencies from maven repos and it was blocked by the firewall.
Here is how to configure the proxy for Activator: https://typesafe.com/activator/docs
Now I can have some fun with Scala and Activator!