Tuesday, September 13, 2016

Homework 1 - Features and Classification

Overview
This marks the first programming-based homework assignment of the course.  The intention behind this assignment is to prepare you for the class project and give some practical experience using the material we've been discussing in lecture.  You will need to implement feature transformations and perform some classification on those features for this homework.  It may be considered across two parts:

1. Rubine Features
2. Weka Classification

Instructions
There is no required programming language for this homework.  The only requirements are that you implement Rubine and use Weka.  To measure your performance on these tasks, you will need to follow some data format guidelines -- use the data provided and make sure your feature extractor outputs or can be made to easily output a CSV file.  These details will be discussed shortly, but keep them in mind when choosing how you wish to begin the homework.

Also, a "Homework 1" directory has been added inside the shared class Google Drive.  It contains all the downloadable materials for this assignment.

1. Rubine Features

a. Data

First, you'll need the data.  There are two data sets.  One is the small sample set which we have seen in quizzes and in discussion.  The sample CSV output is also available for this data set so that you may check your feature extractor before applying it to the second data set.  The second data set is a collection of alphabet letters.  There are 20 samples for each letter.

Data is available to you in three ways.

i. TXT Files

The TXT version of the data is the most rudimentary means of storing a sketch available.  These files list only the points, with each line containing x, y, and t for a single point.  Reading the raw data from these files should be relatively simple, but there's no inherent structure about the sketch that is saved.

ii. JSON files

The JSON version of the data is saved in the SketchML::JSON format, developed by the Sketch Recognition Lab and based on MIT's original SketchML format on the XML platform.  This format is still fairly lightweight compared to XML, although not as compact as the TXT format.  It provides structure to the data, giving a collection of points, strokes, and more to the sketch object.  JSON is very easy to work with in many modern languages, but you may wish to reference the SketchML::JSON specification document if you decide to parse the JSON yourself.  It is saved alongside the data in the same folder.

iii. Sketch Recognition Library API

The Sketch Recognition Library (srlib) is a collection of sketches from different domains that have been gathered into a single format (SketchML::JSON) and made available through a RESTful API.  All the data used for this assignment may be accessed from the srlib API, but it is not available on the Google Drive.  It is live online from http://srl-prod1.cs.tamu.edu:7750/.  You must be on campus or on the VPN to access it.  That link will direct you to the documentation.  All you need to get started is that the "sample" data set is available with the domain "rubine" and the "letters" data set is available with the domain "letters".  An example is included in the starter code discussed below.

b. Feature Extraction

Regardless of how you choose to get the data, you must implement all 13 of Rubine's features.  The sample data set is provided to assist you in debugging, while the larger data set of letters will be used in the next stage with Weka.

To provide further assistance with this step, starter code has been included in the shared folder.  The starter code does not implement any of the feature transforms.  It is actually intended to be a viewer for you as another means of debugging.  To use it as a viewer, download all the files in the starter folder and open "index.html" in your browser.

Because it uses the srlib database and includes a special build of the srlib Javascript toolkit (data management functions only), it is essentially the same code you would be writing to access the data from srlib in Javascript on your own.  For that reason, I added a single line, a callback function to an empty "getFeatures()" function where you may implement your feature extractor if you wish to work with the starter code.  Again, you may use any language you want, so there's no requirement to use the starter code.  Even if you do wish to work in Javascript, you may want to look at the data in its raw TXT or JSON form to see how it looks.  But because this homework is concerned with feature extraction and classification, not data handling and plotting, the viewer and Javascript tools are available as a resource.

c. Output

The reason the language doesn't matter is because the output will be important.  Your program should generate a CSV file, or log console output which can be readily saved as CSV, where each row represents a sketch.  Thus, for the sample data set, you will have 8 rows of features.  The letters data set will have 20*26 rows.

Each row should contain the sketch's class/interpretation followed by the 13 Rubine feature values in order of F1 through F13.  For the sample data set, the class can just be the name or ID; it doesn't really matter.  For the letters data set, you should save the letter as the class.  The letter is saved in the SketchML::JSON data under the top-level shape's "interpretation" field.

The file "sample.csv" provides the Rubine feature values for the sample data set to assist in your evaluation.  You will need to generate "letters.csv" using your feature extractor.

2. Weka Classification

The second half of the assignment will not require any additional programming.  For this part, you'll be using Weka to build some classifiers with the features you extracted in the previous part.  Weka can import CSV directly, which is one of the primary reasons that your feature extractor should support saving CSV features... the other being ease of grading.

Once you have a "letters.csv" which has the class label and set of 13 features for all 520 sketches, import the data into the Weka Explorer.  From Explorer mode, you can test out many different functions in Weka, including classification, dimensionality reduction, and visualization.

Try at least 5 different classifiers on the data.  Think about mixing k-fold cross-validation and other data splitting methods.  You should report the classifiers and settings you chose along with the results in your report to be submitted with the code and CSV files.

I've tried to include all the materials you'll need either online or in the shared class folder.

At the end of both parts of the homework, you'll have gained some familiarity with gesture-based sketch features and classification methods through Weka.  Later, we'll be investigating the programming-based recognition a bit more, so you will probably have the chance to implement classifiers, segmenters, and other sketch algorithms in upcoming assignments.

Obtaining Credit
You will need to email the grader all of your files.

For simplicity, place all your files inside a single folder named "HW1_<last name>_<first name>" that will be compressed and submitted as a ZIP.  You should include the following files:

+ If you decided to download a data set (JSON or TXT), include these files in a "data" directory.

+ Include the source code which reads the data and generates the features in a "source" directory.  This can be a single file or multiple files, even a combination of scripts.  If your source is more than a single file, please add a small README that tells which one to run.

+ Include your CSV files for both the sample and letters data sets in a "results" directory.

+ Include the output of the weka console after testing each of your 5 different classifiers and configurations in the same "results" directory with the CSV files.  This can just be a TXT file that you make by copying-and-pasting from the weka window.

+ Include your report which (briefly) discusses the options you tried out in Weka and reports your findings in the top level.  You should have at least a small discussion of your findings.  Were you pleased with the results?  Did the exercise help you think of different features which you feel would be more important?  Any general impressions you had during either part of the assignment could be excellent material for a discussion.  (e.g. "First, I tried a Random Forest with 10-fold cross validation and obtained an F-measure of 86.4%.  Next, I wanted to see how a neural network would perform, so I ran a multilayer perceptron with...")

To recap, a submission might look like this:

HW1_Polsley_Seth/
|-- data/
|---- sample-json/ (contains all the sample json files)
|---- letters-json/ (contains all the letter json files)
|-- source/
|---- rubine.py
|-- results/
|---- sample.csv
|---- letters.csv
|---- wekalog.txt
|-- report.txt

This layout isn't an exact science.  I mainly want to make sure you include everything to demonstrate that you completed each part, and the structure is intended to make it easier for you to show that.  If you used the Sketch Recognition Library starter code, you can have many fewer files.  If you wish to consolidate the weka log into your report, that is also fine.  So another submission may be:

HW1_Polsley_Seth/
|-- index.html
|-- require.js
|-- srlib.js
|-- sample.csv
|-- letters.csv
|-- report.pdf

This is also fine.  I mostly just need your code, the code output (csv files), weka output (weka log), and a short write-up (discussion/report).  Include these all in some obvious manner, and it's ok.

Again, please ZIP all your files together into a single file submission titled "HW1_<last name>_<first initial>.zip"

As always, contact me if you have any questions.

Due Date
Sep. 26, Monday @ Midnight
25% deducted per day late

No comments:

Post a Comment