Archive for Technical

… and logistic regression

A follow-up on the collaborative filter, it occurs to me that it should be possible to add a logistic regression layer in Keras.  The following is a regularized multinomial logistic regression model:

from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.regularizers import l2

model = Sequential()
model.add(Dense(numClasses, input_shape=(numFeatures, ), \
    init='zero', W_regularizer=l2(0.01)))
model.add(Activation('softmax'))
model.compile(loss='binary_crossentropy', optimizer='sgd')

As with the collaborative filter, you can easily modify the code to use multiple regularizers and different learning algorithms, say Adamax or adam instead of SGD.

Neat.

Logistic regression model in Keras

Comments off

Collaborative filtering in Keras

Ten years ago, Netflix started the Netflix challenge.  A contest to see if the community could come up with a movie recommendation approach that beat their own by 10%.  One of the primary modeling techniques that came out of the contest was a set of sparse matrix factoring models whose earliest description can be found at Simon Funk’s website.  The basic idea is that the actual ratings of movies for each user can be represented by a matrix, say of users on the rows and movies along the columns.  We don’t have the full rating matrix, instead, we have a very sparse set of entries.  But if we could factor the rating matrix into two separate matrices, say one that was Users by Latent Factors, and one that was Latent Factors by Movies, then we could find the user’s rating for any movie by taking the dot product of the User row and the Movie column.

One thing that is somewhat frustrating about coding Funk’s approach is that it uses Stochastic Gradient Descent as the learning mechanism and it uses L2 regularization which has to be coded up as well.  Also, it’s fairly loop-heavy.  In order to get any sort of performance, you need to implement it in C/C++.  It would be great if we could use a machine learning framework that already has other learning algorithms, multiple types of regularization, and batch training built in.  It would also be nice if the framework used Python on the front end, but implemented most of the tight loops in C/C++.  Sort of like coding the algorithm in Python and compiling with Cython.

My current favorite framework is Keras.  It uses the Theano tensor library for the heavy lifting which also allows the code to run on a GPU, if available.  So, here’s a question, can we implement a sparse matrix factoring algorithm in Keras?  It turns out that we can:

from keras.layers import Embedding, Reshape, Merge
from keras.models import Sequential, 
from keras.optimizers import Adamax
from keras.callbacks import EarlyStopping, ModelCheckpoint

factors = 20
left = Sequential()
left.add(Embedding(numUsers, factors,input_length=1))
left.add(Reshape(dims=(factors,)))
right = Sequential()
right.add(Embedding(numMovies, factors, input_length=1))
right.add(Reshape(dims=(factors,)))
model = Sequential()
model.add(Merge([left, right], mode='dot'))
model.compile(loss='mse', optimizer='adamax')
callbacks = [EarlyStopping('val_loss', patience=2), \ 
    ModelCheckpoint('movie_weights.h5', save_best_only=True)]
model.fit([Users, Movies], Ratings, batch_size=1000000, \
    validation_split=.1, callbacks=callbacks)

Ta da!  We’ve just created a left embedding layer that creates a Users by Latent Factors matrix and a right embedding layer that creates a Movies by Latent Factors matrix.  When the input to these is a user id and a movie id, then they return the latent factor vectors for the user and the movie, respectively.  The Merge layer then takes the dot product of these two things to return rating.  We compile the model using MSE as the loss function and the AdaMax learning algorithm (which is superior to Sparse Gradient Descent).  Our callbacks monitor the validation loss and we save the model weights each time the validation loss has improved.

The really nice thing about this implementation is that we can model hundreds of latent factors quite quickly.  In my particular case, I can train on nearly 100 million ratings, half a million users and almost 20,000 movies in roughly 5 minutes per epoch, 30 epochs – 2.5 hours.  But if I use the GPU (GeForce GTX 960), the epoch time decreases to 90s for a total training time of 45 minutes.

A collaborative filter model for the Keras machine learning framework.

Comments (1)

The Sounds of Science

Last Friday, hsarik pointed out an interesting web site: Echo Nest.  They provide a web service that allows you to analyze and remix music.  The API also can provide information (meta-data) about music, artists, songs etc.  and has Python bindings.  If you’ve seen the “More Cowbell” website where you can upload an mp3 and have more cowbell (and more Christopher Walken) added to it, well that site uses Echo Nest and if you download the python bindings for their API, you can see the script that adds the sounds.  Personally, I’m fond of “Ob-la-di, Ob-la-da” with 80% cowbell and 20% Christopher Walken.

I started playing with the API and as a first cut thought it would be neat to use the “get_similar” function.  So for each artist, you can get the top N similar artists.  Now where can I get a list of artists I like?  Well, I could type ’em in, but that sucks.  So I wrote a small program which:

  1. Opens the database on my iPod (or a directory of mp3 files)
  2. Finds each artist by either reading the iPod db or looking at the id3 tags in all of the files
  3. For each artist, add a node to a graph where the area of the node is proportional to the number of songs that artist has on the iPod (or in the music folder)
  4. For each artist, finds the top 50 similar artists
  5. For all of the similar artists that are in my collection of artists, add a graph edge between the two nodes
  6. Plot the graph

What can I say, I’ve been working on a fair amount of graph-theory at work recently.  So after processing my iPod, I came up with the following graph of my current music (click to embiggen):

Okay, that’s pretty cool.  Almost completely illegible, but cool.  FWIW, the graph has 15 connected components, unfortunately, 13 of them are “singles” (not connected to anything), with one pair (Louis Armstrong paired with Louis Armstrong and Duke Ellington).  Fortunately, the graphing tool I use (igraph), has built in tools for doing community analysis (using the leading eigenvector method), i.e., we can automatically find tightly coupled subgraphs.  A few examples from the 25 or so communities:

which arguably correspond to “Indie,”  “Classic Rock,”  “Jam Bands,”  “Guitar Gods,” and “Alternative.”  If I processed my complete music database, I suspect we would wind up with several other communities, e.g., Blues.  But since Robert Johnson is the only blues I’ve got on there right now… he’s in a class by himself.

I suppose it goes w/o saying, that my musical tastes aren’t everyone’s and that if you don’t like my musical tastes, you can keep it to yourself or go DIAF 🙂

So, what’s next?  I was talking with M from my office and we’ve come up with another interesting project for the Echo Nest API.  This one a) uses the audio analysis functions, and b) if we do it right might cause someone to send us a cease and desist.  So, win all the way around.

Comments off

Model update (updated!, updated again)

Earlier, I posted my current model for predicting the NCAA tournament.  Since the whole thing is probabilistic, I figured that I would test it out against the current NCAA standings.  I considered four models:

  1. The one that I described
  2. A random selection of which team would win (50/50 chance)
  3. Always picking the top seeded team
  4. A model suggested by a colleague at work

For each model, I ran 10,000 tests and compared them to the current NCAA tournament results, counting the scores for each test.  Results are:

The X axis is the score (0-64 at this point), the Y axis is the number of test runs (out of 10k) that achieved that score.  The number in the legend is the expected value (score) for each model.  As you can see, my model had the [second] highest expected value.  Choosing the top seeded team was the worst (guaranteed 10 points) best [see the 2nd update].  Choosing randomly was better than selecting the top seed the worst [see update] and my colleague’s model (cyan) was between my model and the random model.  Not bad.  I’ll update after the next two rounds of the tournament.

Update: one interesting thing is that this suggests that there was still a lot of luck in my ESPN pick.  Only about 0.5% of my model runs were as good as that one.

Update 2: So, I’m lying in bed when it occurs to me that I’m an idiot… the team with the *lowest* seed wins a game in Model 3.  This is why I say I don’t really know basketball.

Comments off

They laughed at my theories!

They laughed at my theories.  They threw tomatoes when I presented my paper at the academy!  Tomatoes I tell you!  My minions cower in terror, shrinking in fright from the very ideas contained herein!  But I will show them!  I will PROVE IT TO THEM ONCE AND FOR ALL.  The FOOLS, I WILL DESTROY THEM!! MWAHAHAHAAAA! (ask me how)

Oh, sorry.  Where was I?  Apparently, there’s this basketball thing going on.  Some sort of NCAA tournament that will prove who has the best basketball team.  But what if it doesn’t?  What if it’s all just arbitrary?  Could it be that the chances of any team winning a game are not deterministic, but rather stochastic?  I’ll admit that I don’t know that much about basketball.  I mean, I played the sport in junior high.  I do know the rules.  And I even think that it’s a pretty game.  But I don’t follow the ins and outs of a particular season.

So what’s a guy to do when he doesn’t really follow basketball, but you live in NC where bball is life and it’s bracket time?

You model it.   Which is exactly what I did.

The basic model:

  1. Compute a team’s wins minus their losses, I’m sure there’s a word for this, but let’s call it demonstrated strength (D)
  2. For a given match-up, take a draw from a Beta distribution parameterized by each team’s demonstrated strength (D1 and D2)
  3. The resulting draw is the probability that the team representing the first parameter wins
  4. Draw from a uniform random variable to predict if that team actually will win

There are some flaws with the model, the two obvious ones:

  1. Different teams have different schedules, so one team with a 30-5 record might be a lot better than another with a 30-5 record in a different conference (I’m looking at you SEC)
  2. It’s not clear that you should parameterize directly on the demonstrated strengths.  There should probably be a scaling factor in there.  So that rather than drawing from Beta(D1, D2), you should draw from Beta(alpha*D1, alpha*D2)

But this is close enough.  The nice features of the model are:

  1. The expected probability that a team will win is proportional to D1/(D1+D2).  So, a team whose wins outnumber their losses by 10, will have an expected probability of winning of 50% when playing against another team with D2=10.  And only a 33% chance of winning when playing against someone with a D2=20
  2. The closer two teams’ demonstrated strength is to zero, the broader the probability distribution is.  This reflects added uncertainty for two teams who win only slightly more often than they lose.
  3. The larger two team’s demonstrated strength is, the narrower the probability distribution is.  For example, D1=20, D2=40 has the same expected probability as D1=10, D2=20; but because this is a more common pattern for the two teams, we don’t have the same variance.
  4. This is actually pretty rigorous in Bayesian terms.  Throughout the season, we can update the posterior distribution of the probability of winning based on the prior distribution and the most recent game.

So, how well does the model work?  Good question.  I used it on ESPN, and it’s currently ranked in the 92.9th percentile, i.e., better than almost 93% of all ESPN brackets.  All of my final four teams are still alive, and in general, the model predicted several of the biggest upsets in the tournament (e.g., Murray State vs Vanderbilt!).  That said, this is just one random draw from the model.  To test it further, I would like to go through a whole season of games and figure out if the probabilities of winning correspond to the statistics of a Beta distribution for the game’s D1 and D2.  Moreover, I would like to infer the alpha parameter that I mention above.

If the model appears accurate, and we can properly infer alpha, then we get a probabilistic assessment of how feasible it is to even pick tournament champions.  It may just be that at the end of the day, it comes down to luck.

Comments (1)

“Hacking” predator drones

This just makes me sad.  Two articles, one in the WSJ, the other on CNN, describing how insurgents in Iraq are hacking predator drones and receiving the video feeds that the drones are sending back to U.S. ground stations.   First things first, let’s fix the headlines.  Both are running something like “Iraqi insurgents hacked Predator drone feeds.”  That should more clearly read:  “Iraqi insurgents watching the videos that the Predator drone sends out unencrypted.”  Or maybe “Iraqi insurgents watch Predator drone feeds on TV.”

If you look into the article, you find that insurgents are apparently using a $26 piece of software that let takes satellite data and saves parts of it that might not be intended for your computer.  Essentially, it monitors the data that is sent and when it sees a file transferred will save it to your hard drive, regardless of whether or not your computer was the intended destination.

Now, I’ve been doing computer security work for over a decade.  I was the first person at my university to implement anti-virus in email, I was the first to require a department to use all-encrypted communication for transmitting passwords.  I discovered one of the earliest IRC-based botnets.  I’ve found vulnerabilities in financial systems.  I’ve seen … [a]ttack ships on fire off the shoulder of Orion. I’ve watched C-beams glitter in the dark near the Tannhauser Gate.  Er, some of that last bit may have been someone else, but you get the idea.

This stuff isn’t that hard.  SSL is over 15 years old, we know how to do encryption.  Hell, back in the 90s when we were developing the Predator, the U.S. was treating encryption as a munition – you had to get the government’s blessing to use decent encryption.  Is it too much to ask that an actual weapon include the munition that was encryption?  And this from the WSJ article strikes me as BS:

Predator drones are built by General Atomics Aeronautical Systems Inc. of San Diego. Some of its communications technology is proprietary, so widely used encryption systems aren’t readily compatible, said people familiar with the matter.

In an email, a spokeswoman said that for security reasons, the company couldn’t comment on “specific data link capabilities and limitations.”

Or more  to the point, entirely irrelevant.  First, the communication system can’t be *that* proprietary, since the commercial (if somewhat sketchy) SkyGrabber software can read the transmissions.  Second, you developed a proprietary communication system in the mid to late 90s and didn’t include encryption?  That’s the sort of thing that makes the baby Bruce Schneier cry.

On the other hand, this from CNN seems far more likely:

A senior defense official who was not authorized to speak about the security breach said, “This was an old issue for us and it has been taken care of,” but he would not elaborate on what specifically had been taken care of.

The official said that many of the UAV feeds need to be sent out live to numerous people at one time, and encryption was found to slow the real-time link. The encryption therefore was removed from many feeds.

Removing the encryption, however, allowed outsiders with the correct tools to gain unauthorized access to these feeds.

I’ll buy that.   There are certainly a few encryption schemes that will send encrypted data to multiple parties, hell at the very least, you could use symmetric encryption with shared keys.  But that kinda sucks.  Most commercial communication encryption technology assumes point to point transfers.  If you wanted to send the same data to many people… you send it multiple times.

Regardless, this is just embarrassing.  These days I’m doing security modelling work and if this is the sort of thing that we’ll have to consider, I’m going to sink into


Comments off

Trash heap of programming history

When we put in the massive wall-eating bookcase, I thought that we had enough book space for the next 10,000 years! Or at least the next 10. Unfortunately, things were filling up a bit too fast, so I grabbed some of the books that I know I’ll never use again and will send them to recycling. Most are completely out of date (from bottom to top, these are from the mid-80s to the late 90s). Linux Application Development is probably still relevant, but a little too basic. Linux Device Drivers goes all the way through the 2.0 kernel series (and may be relevant for the experimental 2.1 series)! Switched LANs (snicker) and In Search of Clusters were given to me by vendors. And if I ever have to program native X-Windows again, I’ll kill myself. I kept the books on C/C++, Python, Perl and PHP – though I’ll probably never buy another programming book. Maybe Celeste is right, I should just get a Safari account… Google works pretty well too. If there’s anything you want in here, let me know…
10258

Comments (1)

Great moments in . . .

Minor notes, none worth their own post.

  • Traffic management: I get a call from K around 5:30. She’s stuck behind an accident and the cops on the scene, a) don’t tell people to take a detour until they’ve been there for a half hour; and b) once the ambulance has left the scene, don’t direct traffic around the one remaining open lane. So, after waiting a half hour, K has to take a 20+ minute detour home.
  • Memory: Once she gets in, K and I are fixing leftovers for dinner. C: “Hey, where are the mashed potatoes?” K: “Where did you put them?” “In the fridge, but I can’t find them.” “Maybe they’re in the freezer.” “Nope, not there either.” Ten minutes of looking for the potatoes. Did we throw them out on Sunday? Nope, not in the trash. Did C put them in the pantry? Nope. Can’t find ’em, can’t find ’em. Finally, K says, “wait, we fixed rice on Sunday.” There weren’t any potatoes. I would attribute it to getting old, but I’ve always been this way.
  • FUD (fear, uncertainty and doubt): we’re testing some things at the office – will our authentication system (active directory) honor password failure lockouts when using LDAP authentication? I ask our windows consultant to either a) answer the question, or b) enable an account lockout policy so we can test. He responds back that he can do that, but with the warning that “many Linux services aren’t well-designed for this, and repeatedly try a cached or user-provided password, so that users or service accounts may be mysteriously locked out after one attempt or at some future time when passwords change.” Which is complete and utter B.S. Signs that it’s BS? He references Linux services as opposed to open source, i.e. attempted linux dig. And I used to “own” identity management services, including authentication at a large university and if this was the case, things would have blown up within 10 minutes. I thanked him for the advice and noted that I’ve never seen this, but that it’s why we test.
  • OS Performance: we’re looking into some new ideas at the office. Things that could be useful as a preprocessor for a host based intrusion detection system. As part of my testing, I told my laptop to audit all syscalls made to the kernel, by all processes on the system. CPU load spiked, system performance went through the floor, the windowing system became almost completely non-responsive. In the two minutes it took to get access to a terminal, I logged 150 MB of audit logs. On the plus side, all of the information we need can be collected. Now I just need to figure out how to keep a usable system.
  • Self aggrandizement: talking to my technical manager, we need to write up two journal papers based on our recent work. Cool!

I hope everyone had a good Veteran’s Day and remembered to thank the veterans in their lives.

Comments off

Facebook security vulnerabilities

and this is why I like cross-posting to facebook from my blog.  It’s a healthy reminder that nothing on fb is actually private.  If it’s online – it’ll be exposed eventually, whether through a new exploit, or just because you “friend” someone in the future that you had written about in the past.


h/t hsarik

Comments off

Toy!

For a guy who uses computers as much as I do, you would think that I would have better machines at home.  But really, no.  Our last desktop was eight years old before we replaced it.  My laptop was an old Thinkpad X31 – maybe six or seven years old.  So, for a combination birthday/Christmas present, I got a new (sort of) laptop.  I picked up a refurbished thinkpad X200s from the Lenovo outlet.  So far it’s a nice machine: good processor, lots of high speed memory, and a big LED backlit screen.  Lenovo sent me notice that they were shipping the laptop today and the next thing I knew, it arrived at the office.  The advantages of working 5 miles from the outlet.  The only problem is that they didn’t ship all of the parts.  I’ve contacted them and they’ll send the dock/dvd.

When I got the machine home, I spent about a minute trying to decide if I would keep a partition with Windows on it.  But then I remembered that I don’t play video games, so I didn’t really need Vista.  I’ve installed Fedora 11 and moved all my files.

Comments (2)

« Previous entries Next Page » Next Page »