Analyzing 2018 World Cup match data with Clojure

So, Russia 2018 World Cup is over. There were lots of good moments, and some worse (especially for us, here in Poland….). But is there something we can learn from this event as programmers? Recently, I had a few free evenings and played a bit with a World Cup API. In this post I want to show you what is Clojure way of dealing with REST APIs and when it beats other languages in this field.

TLDR; The code is on GitHub. Setup is based on Atom and proto-repl package which makes the interactive development experience really pleasant. I highly recommend you follow its installation guide while dealing with my solution.

As a problem to solve, I wanted to find something not trivial, to show some of the Clojure super-powers. Finally, I came up with this:

The problem: find all matches the winning team made a come-back — they lost a goal first, but managed to win.

For example Sweden — Germany in a group stage:

Ok, so a good starting point would be getting all matches data for further processing. I’m using clj-http as an http client (with cheshire for json parsing support):

(ns worldcup.matches
  (:require
    [clj-http.client :as http]))

(def api-root "https://worldcup.sfg.io")

(defn get-all-matches []
  (->
    (http/get (str api-root "/matches") {:as :json})
    :body))

The Clojure philosophy relies heavily on dealing with built-in data structures. In statically typed languages you’d probably start with modelling response JSON as some POJO classes; here you just get a map and can start to investigate it right from the beginning. Huge difference.

One thing that may not be clear in the above code is this -> macro. It is really helpful for nested collection operations. Instead of nesting (xxx (yyy (zzz … calls, you can use both -> and ->> macros to make code similar to e.g. Java equivalent:

(->> lst
  (map op1)
  (map op2)
  (filter p1)
  first)

lst.stream()
    .map(op1)
    .map(op2)
    .filter(p1)
    .findFirst()

Ok, let’s start to investigate the API response. Using proto-repl it is trivial to examine the value of any data structure straight from your editor. Let’s see the match data — by using Proto REPL: Execute Block action — ctrl+alt+,s shortcut by default:

It’s a common practice to wrap interactive invocations into comment macro which makes them ignored when evaluating whole file at once.

Ok, so we have some basic attributes of the match, the number of goals for each team and some team events. Let’s investigate these:

Looks promising — we have all the times of the goals scored by each team. Having this, our algorithm may look like this:

  1. Get times of goals for each team
  2. Sort goals by times
  3. Get the team with the first goal
  4. Check if the winner is the other team

Let’s start with filtering the goals from team’s events. First, we need to find how to filter goals from match events:

Ok, so we have goal, goal-penalty and goal-own:

defn goal? [e]
  (let [type (:type_of_event e)]
    (or (= "goal" type)
        (= "goal-penalty" type)
        (= "goal-own" type))))

Looks good.

We’ll also need the time when the goal was scored and this is in a really strange format, e.g 90'+4'. We need to split on + character, remove ' and trim and then add additional time to get exact minute of the goal:

(require '[clojure.string :as string])

(defn goal-time [t]
  (->> 
    (string/split t #"\+")
    (map string/trim)
    (map #(string/replace % #"'" ""))
    (map #(Integer/parseInt %))
    (reduce +)))

EDIT:

Bartek Tartanus (thanks!) found a bug in just adding additional time to half time: suppose we had late goal in the first half and early goal in the second (45'+5' and 46) — my algorithm will pick second half goal as earlier. We need to somehow pass half (or part for extra-time) data to sort goal times. Fortunately, Clojure sorts collections of pairs well:

We need to calculate both half (part) and exact time of scoring a goal:

(defn part [goal]
  (let [time (first goal)]
    (cond
      (<= time 45) 1
      (<= time 90) 2
      (<= time 105) 3   
      (<= time 120) 4))) ;; parts 3 and 4 are for extra-time

(defn parts [time-str]
  (->>
    (string/split time-str #"\+")
    (map string/trim)
    (map #(string/replace % #"'" ""))
    (map #(Integer/parseInt %))))

(defn part-and-time [time-str]
  (let [parts (parts time-str)]
    [(part parts) (reduce + parts)]))

Now we can test improved sorting:

Seems to work! So let’s try to find a team that scored a first goal. My idea was to mix both :home and :away goal times in one collection:

([:home [1 38]] [:home [1 39]] [:away [1 28]])

and then sort by times:

([:away [1 28]] [:home [1 38]] [:home [1 39]])

to know the side that scored first goal (:away in this case).

(defn first-scored-side [match]
  (let [goal-times
          (concat
            (team-goal-times (:home_team_events match) :home)
            (team-goal-times (:away_team_events match) :away))]
   (->> goal-times (sort-by second) first first)))

I use first to get the first pair from sorted collection and then first again to get the first element from the pair:

To create a list with goal times and side I used this function:

(defn team-goal-times [events side]
  (->> events
    (filter goal?)
    (map :time)
    (map part-and-time)
    (map vector (repeat side))))

What’s happening in this last line? So, repeat returns an infinite lazy sequence with an element repeated and vector just creates an indexed collection from provided elements.

And map can also take two collections, applying provided function to pairs: <c1_first, c2_first>, <c1_second, c2_second>

Which leads us easily to this:

Ok, now we are ready write this come-back? function now:

(defn winner-side [match]
  (let [winner (:winner match)]
    (cond
      (= winner (:home_team_country match)) :home
      (= winner (:away_team_country match)) :away)))

(defn come-back? [match]
  (let [first-scored-side (first-scored-side match)
        winner-side (winner-side match)]
    (and (not (nil? winner-side))
         (not= winner-side first-scored-side))))

Let’s see if it works, just filter the team’s data from the match to remove noise:

(defn teams [match]
  (select-keys match [:home_team_country :away_team_country :winner]))

Ok, we got some results! But, hmm… Morocco — Iran?

0:1 with last-minute own goal? This definitely doesn’t look like come-back…. Let’s look at this match events:

Ok, now I can see the problem. Own goals are contained in wrong side’s events! This goal should be in Iran’s events, otherwise our solution won’t work…

We can make a special case for own goals or just move own goals to correct side events. I decided to go with the second solution, knowing that Clojure is really good at transforming nested data structures e.g. with update and update-in functions:

So, update takes a map, a key and a function and applies to value under that key, leaving rest of the map untouched. update-in works the same, but you can pass a whole path to some nested element in your data structure.

Let’s use update to move all the own-goals from :home_team_events to :away_team_events and vice-versa:

(defn own-goal? [event]
  (= "goal-own" (:type_of_event event)))

(defn own-goals [events]
  (->> events (filter own-goal?)))

(defn remove-own-goals [events]
  (remove own-goal? events))

(defn fix-match [match-to-fix]
  (let [own-home-goals (own-goals (:home_team_events match-to-fix))
        own-away-goals (own-goals (:away_team_events match-to-fix))]
   (-> match-to-fix
     (update :home_team_events remove-own-goals)
     (update :away_team_events remove-own-goals)
     (update :home_team_events concat own-away-goals)
     (update :away_team_events concat own-home-goals))))

And finally let’s fix come-back? function:

(defn come-back? [match-to-fix]
  (let [match (fix-match match-to-fix)
        first-scored-side (first-scored-side match)
        winner-side (winner-side match)]
    (and (not (nil? winner-side))
         (not= winner-side first-scored-side))))

Check the results again:

Yes, we have it! We found 9 such games, with overall leader Croatia coming back 3 times (2 by penalties).

You May Also Like

33rd Degree day 1 review

33rd Degree is over. After the one last year, my expectations were very high, but Grzegorz Duda once again proved he's more than able to deliver. With up to five tracks (most of the time: four presentations + one workshop), and ~650 attendees,  there was a lot to see and a lot to do, thus everyone will probably have a little bit different story to tell. Here is mine.

Twitter: From Ruby on Rails to the JVM

Raffi Krikorian talking about Twitter and JVM
The conference started with  Raffi Krikorian from Twitter, talking about their use for JVM. Twitter was build with Ruby but with their performance management a lot of the backend was moved to Scala, Java and Closure. Raffi noted, that for Ruby programmers Scala was easier to grasp than Java, more natural, which is quite interesting considering how many PHP guys move to Ruby these days because of the same reasons. Perhaps the path of learning Jacek Laskowski once described (Java -> Groovy -> Scala/Closure) may be on par with PHP -> Ruby -> Scala. It definitely feels like Scala is the holy grail of languages these days.

Raffi also noted, that while JVM delivered speed and a concurrency model to Twitter stack, it wasn't enough, and they've build/customized their own Garbage Collector. My guess is that Scala/Closure could also be used because of a nice concurrency solutions (STM, immutables and so on).

Raffi pointed out, that with the scale of Twitter, you easily get 3 million hits per second, and that means you probably have 3 edge cases every second. I'd love to learn listen to lessons they've learned from this.

 

Complexity of Complexity


The second keynote of the first day, was Ken Sipe talking about complexity. He made a good point that there is a difference between complex and complicated, and that we often recognize things as complex only because we are less familiar with them. This goes more interesting the moment you realize that the shift in last 20 years of computer languages, from the "Less is more" paradigm (think Java, ASM) to "More is better" (Groovy/Scala/Closure), where you have more complex language, with more powerful and less verbose syntax, that is actually not more complicated, it just looks less familiar.

So while 10 years ago, I really liked Java as a general purpose language for it's small set of rules that could get you everywhere, it turned out that to do most of the real world stuff, a lot of code had to be written. The situation got better thanks to libraries/frameworks and so on, but it's just patching. New languages have a lot of stuff build into, which makes their set of rules and syntax much more complex, but once you get familiar, the real world usage is simple, faster, better, with less traps laying around, waiting for you to fall.

Ken also pointed out, that while Entity Service Bus looks really simple on diagrams, it's usually very difficult and complicated to use from the perspective of the programmer. And that's probably why it gets chosen so often - the guys selling/buying it, look no deeper than on the diagram.

 

Pointy haired bosses and pragmatic programmers: Facts and Fallacies of Software Development

Venkat Subramaniam with Dima
Dima got lucky. Or maybe not.

Venkat Subramaniam is the kind of a speaker that talk about very simple things in a way, which makes everyone either laugh or reflect. Yes, he is a showman, but hey, that's actually good, because even if you know the subject quite well, his talks are still very entertaining.
This talk was very generic (here's my thesis: the longer the title, the more generic the talk will be), interesting and fun, but at the end I'm unable to see anything new I'd have learned, apart from the distinction between Dynamic vs Static and Strong vs Weak typing, which I've seen the last year, but managed to forgot. This may be a very interesting argument for all those who are afraid of Groovy/Ruby, after bad experience with PHP or Perl.

Build Trust in Your Build to Deployment Flow!


Frederic Simon talked about DevOps and deployment, and that was a miss in my  schedule, because of two reasons. First, the talk was aimed at DevOps specifically, and while the subject is trendy lately, without big-scale problems, deployment is a process I usually set up and forget about. It just works, mostly because I only have to deal with one (current) project at a time. 
Not much love for Dart.
Second, while Frederic has a fabulous accent and a nice, loud voice, he tends to start each sentence loud and fade the sound at the end. This, together with mics failing him badly, made half of the presentation hard to grasp unless you were sitting in the first row.
I'm not saying the presentation was bad, far from it, it just clearly wasn't for me.
I've left a few minutes before the end, to see how many people came to Dart presentation by Mike West. I was kind of interested, since I'm following Warsaw Google Technology User Group and heard a few voices about why I should pay attentions to that new Google language. As you can see from the picture on the right, the majority tends to disagree with that opinion.

 

Non blocking, composable reactive web programming with Iteratees

Sadek Drobi's talk about Iteratees in Play 2.0 was very refreshing. Perhaps because I've never used Play before, but the presentation was flawless, with well explained problems, concepts and solutions.
Sadek started with a reflection on how much CPU we waste waiting for IO in web development, then moved to Play's Iteratees, to explain the concept and implementation, which while very different from the that overused Request/Servlet model, looked really nice and simple. I'm not sure though, how much the problem is present when you have a simple service, serving static content before your app server. Think apache (and faster) before tomcat. That won't fix the upload/download issue though, which is beautifully solved in Play 2.0

The Future of the Java Platform: Java SE 8 & Beyond


Simon Ritter is an intriguing fellow. If you take a glance at his work history (AT&T UNIX System Labs -> Novell -> Sun -> Oracle), you can easily see, he's a heavy weight player.
His presentation was rich in content, no corpo-bullshit. He started with a bit of history of JCP and how it looks like right now, then moved to the most interesting stuff, changes. Now I could give you a summary here, but there is really no point: you'd be much better taking look at the slides. There are only 48 of them, but everything is self-explanatory.
While I'm very disappointed with the speed of changes, especially when compared to the C# world, I'm glad with the direction and the fact that they finally want to BREAK the compatibility with the broken stuff (generics, etc.).  Moving to other languages I guess I won't be the one to scream "My god, finally!" somewhere in 2017, though. All the changes together look very promising, it's just that I'd like to have them like... now? Next year max, not near the heat death of the universe.

Simon also revealed one of the great mysteries of Java, to me:
The original idea behind JNI was to make it hard to write, to discourage people form using it.
On a side note, did you know Tegra3 has actually 5 cores? You use 4 of them, and then switch to the other one, when you battery gets low.

BOF: Spring and CloudFoundry


Having most of my folks moved to see "Typesafe stack 2.0" fabulously organized by Rafał Wasilewski and  Wojtek Erbetowski (with both of whom I had a pleasure to travel to the conference) and knowing it will be recorded, I've decided to see what Josh Long has to say about CloudFoundry, a subject I find very intriguing after the de facto fiasco of Google App Engine.

The audience was small but vibrant, mostly users of Amazon EC2, and while it turned out that Josh didn't have much, with pricing and details not yet public, the fact that Spring Source has already created their own competition (Could Foundry is both an Open Source app and a service), takes a lot from my anxiety.

For the review of the second day of the conference, go here.