A local tech company – Tresata recently sponsored a 24 hour hackathon in Charlotte around big data in the retail space. It was a ton of fun. I think I was the only person who brought along their own monitor, but I was quite prepared 🙂
My experience with statistical analysis, python and Jupyter was quite limited, so I decided to focus my attention on building a mobile focused use case with some big data integration and I ended up winning the innovation award. A pretty cool pair of Google Glasses 🙂
Here is the app I developed and presented:
And here are all my scribbled the notes/commands/datasets from the event:
emacs test.scala
eval-tool test.scala –hdfs –input bsv%/sample/data_with_headers/hackathon_data_headers –output bsv%upc_counts
hadoop fs -ls upc_counts
hadoop fs -cat upc_counts/part-* > upc_counts
wc -l upc_counts
less upc_counts
hadoop fs -cat upc_counts/part-* > upc_counts
python2.6 -m SimpleHTTPServer 8889
import com.twitter.scalding._
import com.tresata.scalding.Dsl._
import com.tresata.scalding.util.ScaldingUtil
import org.joda.time.format.DateTimeFormat
(args: Args) => {
new Job(args) {
val format = new java.text.SimpleDateFormat(“yyyy-MM-dd”)
val likelyProducts = ScaldingUtil.sourceFromArg(args(“input”))
.filter(‘HHID){ HHID : String => HHID == args(“homeid”)}
.groupBy(‘HHID,’UPC_NUMBER,’ITEM_DESCRIPTION) { group =>
group.size(‘PurchaseCount)
.average(‘ITEM_QUANTITY -> ‘AvgQtyOrdered)
.max(‘TRANSACTION_DATETIME -> ‘LastPurchased)
.average(‘DISCOUNT_QUANTITY)
.average(‘EXPRESS_LANE)
}
.write(ScaldingUtil.sourceFromArg(args(“output”)))
}
}
User Creation
http://dev.hackathonclt.org:5000/
- Note takes ~5min for user creation across the cluster
- If you have any DNS issues direct ips are:
- slave01 204.15.96.198
- slave02 204.15.96.195
- slave03 204.15.96.196
- slave04 204.15.96.197
- And user creation at: 192.168.1.115:5000
Machines
Machines avaliable to work on:
> slave01.hackathonclt.org
> slave02.hackathonclt.org
> slave03.hackathonclt.org
> slave04.hackathonclt.org
Please spread yourselvs out across the machines
NOTE: if you have any dns issues, speak to staff
Getting Started
Ssh into the server where you can access the retail data stored on the hackathon HDFS cluster.
> ssh username@slave01.hackathonclt.org
and enter the password you specified in user creation.
We made Hive, Spark, and pySpark command-line interfaces available, and included a tool to compile and run simple Scalding scripts on-the-fly.
Hive
Give Hive a whirl and run a sample query:
> hive
Try pasting the following query into the hive command-line interface:
hive> select UPC_NUMBER, ITEM_DESCRIPTION, DEPARTMENT_DESCRIPTION, EXTENDED_PRICE_AMOUNT from hackathon_sample limit 10;
This will a launch a (map-only) MapReduce job and return the specified fields for the first ten items in the ‘hackathon’ table.
Spark
Now give the Spark-shell a test:
> spark-shell
Read in the data and run a simple query that calculates the number of purchases for each upc in the sample data:
>>> val dataRDD = sc.textFile(“hdfs://master.hackathonclt.org:8020/sample/data_with_headers/hackathon_data_headers”)
>>> val upcs = dataRDD.flatMap(line => line.split(“\\|”).take(1))
>>> val wordCounts = upcs.map(word => (word, 1)).reduceByKey((a, b) => a + b)
>>> wordCounts.collect()
pySpark
You can also do the same query using a python version of the Spark shell.
> pyspark
>>> dataRDD = sc.textFile(“hdfs://master.hackathonclt.org:8020/sample/data_with_headers/hackathon_data_headers”)
>>> upcs = dataRDD.map(lambda line: line.split(‘|’)[0])
>>> wordCounts = upcs.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
>>> wordCounts.take(10)
Scalding
In addition to the Hive and Spark shells, we’re also packaging Eval-tool, a tool to compile and run Scalding scripts without having to create a project. If you create a file called test.scala with the following contents:
import com.twitter.scalding._
import com.tresata.scalding.Dsl._
import com.tresata.scalding.util.ScaldingUtil
(args: Args) => {
new Job(args) {
ScaldingUtil.sourceFromArg(args(“input”))
.groupBy(‘UPC_NUMBER) { _.size }
.write(ScaldingUtil.sourceFromArg(args(“output”)))
}
}
you can run a query on the data set sample from the command-line:
> eval-tool test.scala –hdfs –input bsv%/sample/data_with_headers/hackathon_data_headers –output bsv%upc_counts
This will generate a bar-separated file called ‘upc_counts’ in your HDFS home directory, containing the upc numbers along with their total counts.
Job Tracker
http://master.hackathonclt.org:50030
Spark Job Tracker
http://master.hackathonclt.org:8080
Namenode information
http://master.hackathonclt.org:50070
Data Dictionary
UPC_NUMBER int unique product code of item
MASTER_UPC_NUMBER int master UPC number, UPC numbers go under this
ITEM_DESCRIPTION string describes item
DEPARTMENT_NUMBER int department number
DEPARTMENT_DESCRIPTION string describes department
CATEGORY_NUMBER int category number of item
CATEGORY_DESCRIPTION string describes category of item
SUBCATEGORY_NUMBER int subcategory of item
SUBCATEGORY_DESCRIPTION string describes subcategory of item
RECEIPT_NUMBER string recipe number of the purchase
ITEM_QUANTITY int how many items was bought
EXTENDED_PRICE_AMOUNT float actual sale per swipe
DISCOUNT_QUANTITY float number of coupons applied
EXTENDED_DISCOUNT_AMOUNT float amount discounted
TENDER_AMOUNT float amount tendered by the customer for the transaction
TRANSACTION_DATETIME string date of transaction
EXPRESS_LANE int flag of whether the purchase was through Express Lane, tagged to recipe number. 1 mean yes, 0 means no
HHID string house hold id
You must be logged in to post a comment.