Clojure集合管道函数练习

发表于 2016-09-12 更新于 2016-10-06 分类于 Clojure

起源

TDD讨论组里的申导最近在B站直播了Martin Fowler的经典文章*Refactoring with Loops and Collection Pipelines中谈到的利用集合管道对循环进行函数式重构。视频地址在这里，申导的翻译在这里。组织者小波（Seaborn Lee）趁机出了一道关于集合管道函数题目。我就想啊，论函数式编程，舍Clojure其谁？而且我在Clojure*很少能写出loop... recur这样偏底层的循环代码。话不多说，撸起袖子开工。

题目

一家澡堂有 m 个房间，每个房间有 n 个时段，现在要给用户推荐「最早的可以预约的时段」。

例子：

rooms: [
  {
     room_id: 1,
     periods: [
        {
           time: '17:00-18:00',
           status: 'available'
        },
        {
           time: '18:00-19:00',
           status: 'occupied'
        }
     ]
  }, {
     room_id: 2,
     periods: [
        {
           time: '17:00-18:00',
           status: 'occupied'
        },
        {
           time: '18:00-19:00',
           status: 'available'
        }
     ]
  }
]

期望返回：

{
  room_id: 1,
  time: '17:00-18:00'
}

解析

题目很简单，基本思路：首先过滤出每个房间periods中status为available的时间段，然后取第一个也就是最早的时间段（默认为递增排序的），接着将room_id和这个时间段以期望返回的形式合并。再然后对所有合并的结果依据时间段进行一次排序（sort），最后取第一个结果即可。

1. Clojure 解法

转换数据格式

原题中给的是json的格式，不适合在Clojure中处理，所以我们手工转换成需要的形式，如下：
清单1-1 数据定义

(def rooms
  [{:room-id 1
    :periods [{:time "17:00-18:00"
               :status :available}
              {:time "18:00-19:00"
               :status :occupied}]}
   {:room-id 2
    :periods [{:time "17:00-18:00"
               :status :occupied}
              {:time "18:00-19:00"
               :status :available}]}])

代码

清单1-2 房间最早可预约时间段

(defn the-earliest-available-room [rooms]
  (->> rooms
       (map
        (juxt first (fn [{:keys [periods]}]
                      (->> periods
                           (filter #(= (:status %) :available))
                           (ffirst)))))
       (map #(into {} %))
       (sort-by :time)
       (first)))
       
(the-earliest-available-room rooms)
-> {:room-id 1, :time "17:00-18:00"}

这段代码和上面的解析是一一对应的关系。为了让程序清晰，符合管道的用法，这里使用了thread last宏（->>），它的作用是把前面一个form作为后一个form的最后一个参数。与之呼应的是thread first宏（->），它的作用类似，不过会传成第一个参数。

我们先看(map (juxt ...) ...)这一段代码。juxt是一个非常有意思的函数，而且超级实用。它的文档描述如下：

(doc juxt)
->
clojure.core/juxt
 [f]
 [f g]
 [f g h]
 [f g h & fs]
Added in 1.1
  Takes a set of functions and returns a fn that is the juxtaposition
  of those fns.  The returned fn takes a variable number of args, and
  returns a vector containing the result of applying each fn to the
  args (left-to-right).
  ((juxt a b c) x) => [(a x) (b x) (c x)]

它的神奇之处在于可以对同一个参数应用不同的函数，而且还能将应用的结果全部收集起来。想想题目的解析中提及的以期望返回的形式合并，如果我们应用juxt函数，就能得到[(:room-id 1) (:time "17:00-18:00")]这样的中间结果。

(juxt first (fn ...))中first用于提取:room-id，而后面的lambda表达式则用于提取:time。解法很直观，筛选出:status是:available的时间段，然后使用(ffirst)取第一个map的首个entry。如：{:time "17:00-18:00" :status :available}，那么应用(ffirst)的结果就是[:time "17:00-18:00"]。

接下来，又进行了一次map操作，这次的目的是把元组的entries，转换为map。举个例子：[[:room-id 1] [:time "17:00-18:00"]] => {:room-id 1 :time "17:00-18:00"}。转换成map之后，方便以:time对结果进行排序(sort-by :time)，最后取出第一个元素(first)，即我们期望的返回。

写完之后，我很想再写个TDD版本的。话不多说，继续撸袖子。

2. Clojure TDD 解法

环境准备

生成工程

进入命令行，输入lein new midje the-earliest-available-period-of-bathroom，leiningen会生成基于midje这个测试框架的工程。

Git
1
2
3
4
5
6
7
8
git init
> .gitignore
.lein*
.nrep*
target/
这里ctrl-c退出
git add .
git commit --message "init commit"
我使用了zsh和oh-my-zsh，自带了很多git操作的alias，可以通过alias |grep git查看。后续的git操作都会使用alias。
自动测试

输入lein repl，然后(use 'midje.repl)，最后输入(autotest)。这样一旦文件修改保存，测试就会自动触发。

Emacs

用来写代码的。

Tasking（任务拆分）

先不急着敲代码，我们先从测试的角度看看完成这个需求需要哪几步？

单间澡堂有一个可用时间段
单间澡堂有多个可用时间段
所有澡堂（包含输入为空）没有可用时间段
多间澡堂都有可用时间段
多间澡堂中有的有可用时间段，有的没有可用时间段

第1个任务

单间澡堂有一个可用时间段

1. 写测试

(def room-1 {:room-id 1
    :periods [{:time "17:00-18:00"
               :status :available}
              {:time "18:00-19:00"
               :status :occupied}]})

(def room-2 {:room-id 2
             :periods [{:time "17:00-18:00"
                        :status :occupied}
                       {:time "18:00-19:00"
                        :status :available}]})
                        
(facts "about `the-earliest-avaible-period-of-bathroom`"
  (fact "should recommand if there is only one room with available period"
    ;; 1号
    (the-earliest-available-recommand [room-1]) => {:room-id 1 :time "17:00-18:00"})
    ;; 2号
    (the-earliest-available-recommand [room-2]) => {:room-id 2 :time "18:00-19:00"}))

2. 写实现

1 2	(defn the-earliest-available-recommand [rooms] {:room-id 1 :time "17:00-18:00"})

针对1号测试，这个实现有点“荒诞”，术语hard code说的就是这个，但是眼下足够了。不过此时，应该再写一个类似的测试来去除hard code，即2号测试。
相应地，我们要修改实现。

defn the-earliest-available-recommand [rooms]
  (let [{:keys [room-id periods]} (first rooms)
        available-periods (filter #(#{:available} (:status %)) periods)]
    (merge {:room-id room-id}
           (select-keys (first available-periods) [:time]))))

3. 关闭并提交

单间澡堂有一个可用时间段

1 2	ga . gcmsg "one available room"

第2个任务

单间澡堂有多个可用时间段

1. 写测试

(def room-3 {:room-id 3
             :periods [{:time "17:00-18:00"
                        :status :occupied}
                       {:time "18:00-19:00"
                        :status :available}
                       {:time "19:00-20:00"
                        :status :available}]})
...                        
(fact "should recommand the earliest one if there is only one room with multiple available periods"
    (the-earliest-available-recommand [room-3]) => {:room-id 3 :time "18:00-19:00"})

保存，发现测试还是跑过了。原因在于我们默认了period是递增排序的。我们看看有没有重构点？实现太简单了暂时找不到，那就欢欢喜喜地跳过实现步骤。

2. 关闭并提交

单间澡堂有多个可用时间段

1 2	ga . gcmsg "one room with multiple available periods"

第3个任务

所有澡堂（包含输入为空）没有可用时间段

1. 写测试

(def non-available-room {:room-id 4
             :periods [{:time "17:00-18:00"
                        :status :occupied}
                       {:time "18:00-19:00"
                        :status :occupied}
                       {:time "19:00-20:00"
                        :status :occupied}]})
                        
(fact "should show `:no-available-room` if there is no available room"
    (the-earliest-available-recommand []) => :no-available-room
    (the-earliest-available-recommand [non-available-room]) => :no-available-room))

这回肯定挂掉。

2. 写实现

(defn the-earliest-available-recommand [rooms]
  (let [{:keys [room-id periods]} (first rooms)
        available-periods (filter #(#{:available} (:status %)) periods)]
    (if (seq available-periods)
      (merge {:room-id room-id}
             (select-keys (first available-periods) [:time]))
      :no-available-room)))

这里使用了Clojure中判断集合是否为空较为常用的手法(seq )，如果集合非空，那么返回集合本身；反之，返回nil，nil在逻辑上是false。测试通过。

3. 关闭并提交

所有澡堂（包含输入为空）没有可用时间段

1 2	ga . gcmsg "no available room"

第4个任务

多间澡堂都有可用时间段

1. 写测试

(fact "should recommand the earliest if there has more than one room and each has available periods"
   (the-earliest-available-recommand [room-1 room-2]) => {:room-id 1 :time "17:00-18:00"}
   (the-earliest-available-recommand [room-2 room-1]) => {:room-id 1 :time "17:00-18:00"}
   (the-earliest-available-recommand [room-2 room-3]) => {:room-id 2 :time "18:00-19:00"}
   (the-earliest-available-recommand [room-1 room-2 room-3]) => {:room-id 1 :time "17:00-18:00"})

2. 写实现

(defn the-earliest-available-recommand [rooms]
  (if (seq rooms)
    (first (sort-by :time
                    (map (fn [room]
                           (let [{:keys [room-id periods]} room
                                 available-periods (filter #(#{:available} (:status %)) periods)]
                             (if (seq available-periods)
                               (merge {:room-id room-id}
                                      (select-keys (first available-periods) [:time]))
                               :no-available-room)))
                         rooms)))
    :no-available-room))

到这里，我们开始使用(map )函数处理多个房间的内容。注意，当输入房间是空集合的时候，这里需要相应地做(seq rooms)判空处理，否则会返回nil，而不是我们想要的:no-available-room。

3. 关闭并提交

多间澡堂都有可用时间段

1 2	ga . gcmsg "more than one room"

4. 重构

代码写到这里，再不重构就说不过去了。另外，管道没看到，倒是看到一堆括号。
我们使用thread last(->> )做一次重构：

(defn the-earliest-available-recommand [rooms]
  (if (seq rooms)
    (->> rooms
         (map (fn [room]
                (let [{:keys [room-id periods]} room
                      available-periods (filter #(#{:available} (:status %)) periods)]
                  (if (seq available-periods)
                    (merge {:room-id room-id}
                           (select-keys (first available-periods) [:time]))
                    :no-available-room))))
         (sort-by :time)
         first)
    :no-available-room))

还行，至少没那么多嵌套了。提交一次。

1 2	ga . gcmsg "[refactor] use macro thread-last ->> to pipe"

继续重构，使用我们的juxt函数。

(defn the-earliest-available-recommand [rooms]
  (letfn [(period [{:keys [periods]}]
            (->> periods
                 (filter #(#{:available} (:status %)))
                 ffirst
                 (#(or % [:time ::non-available]))))]
    (->> rooms
         (map (fn [room]
                (apply conj {} ((juxt first period) room))))
         (remove #(#{::non-available} (:time %)))
         (sort-by :time)
         first
         (#(or % :no-available-room)))))

看上去还行，不过不爽的是(#(or % [:time ::non-available]))。为了迎合(->> )宏，我们给(or )包了一层。原因是(->> )会让前面的结果出现在最后一个参数的位置，而我们需要将结果放到(or )的第一个参数的位置。有没有什么好看的解决方法呢？当然有！我们可以使用(-> )来做到这点。

(defn the-earliest-available-recommand [rooms]
  (letfn [(period [{:keys [periods]}]
            (-> periods
                (->> (filter #(#{:available} (:status %)))
                     ffirst)
                (or [:time ::non-available])))]
    (-> rooms
        (->> (map (fn [room]
                    (apply conj {} ((juxt first period) room))))
             (remove #(#{::non-available} (:time %)))
             (sort-by :time)
             first)
        (or :no-available-room))))

顿时觉得世界干净了不少。再提交一次。

1 2	ga . gcmsg "[refactor] use juxt to extract needed fields"

第5个任务

多间澡堂中有的有可用时间段，有的没有可用时间段

1. 写测试

1
2
3

(fact "should recommand the earliest available room even if there has non available room"
    (the-earliest-available-recommand [room-1 non-available-room]) => {:room-id 1 :time "17:00-18:00"}
    (the-earliest-available-recommand [room-2 non-available-room]) => {:room-id 2 :time "18:00-19:00"})

测试直接通过，又可以跳过实现代码了。不过，这也预示着我们的测试是有覆盖的，也需要花时间整理这些测试用例。在那之前，先提交一下。

2. 关闭并提交

多间澡堂中有的有可用时间段，有的没有可用时间段

1 2	ga . gcmsg "mixed non-available and available rooms"

为第3个任务补上测试用例

所有（包含多个）澡堂（包含输入为空）没有可用时间段

(fact "should show `:no-available-room` if there is no available room"
   (the-earliest-available-recommand []) => :no-available-room
   (the-earliest-available-recommand [non-available-room]) => :no-available-room
   (the-earliest-available-recommand [non-available-room non-available-room]) => :no-available-room))

这里的第3个用例包含第2个用例，我们待会整理掉。不过现在先提交一下。

1 2	ga . gcmsg "multiple non-available rooms"

整理测试

在前面进行的任务当中，我们发现有两次没有写实现测试就通过的情况。这说明测试用例是有覆盖的。

第2个任务的测试用例其实覆盖了第1个任务的测试用例，所以可以直接删去后者；
第5个任务的测试用例覆盖了第4个任务的部分测试用例，所以可以合并到一起。

整理下来，最终的测试变成下面这样：

(facts "about `the-earliest-avaible-period-of-bathroom`"
  (fact "should recommand the earliest one if there is only one room with multiple available periods"
    (the-earliest-available-recommand [room-3]) => {:room-id 3 :time "18:00-19:00"})
  
  (fact "should show `:no-available-room` if there is no available room"
    (the-earliest-available-recommand []) => :no-available-room
    (the-earliest-available-recommand [non-available-room non-available-room]) => :no-available-room)
 
  (fact "should recommand the earliest if there has more than one room and each may have available periods"
    (the-earliest-available-recommand [room-1 room-2]) => {:room-id 1 :time "17:00-18:00"}
    (the-earliest-available-recommand [room-2 room-1]) => {:room-id 1 :time "17:00-18:00"}
    (the-earliest-available-recommand [room-2 room-3]) => {:room-id 2 :time "18:00-19:00"}
    (the-earliest-available-recommand [room-1 room-2 room-3]) => {:room-id 1 :time "17:00-18:00"}
    (the-earliest-available-recommand [room-1 non-available-room]) => {:room-id 1 :time "17:00-18:00"}))

文档

The final goal of any engineering activity is some type of documentation.

更新README.md文件，其中描述程序解决的问题以及运行步骤，当然包含设计思路那更好了。提交一下。

1 2	ga . gcmsg "update readme"

美化代码

代码是诗行 - by lambeta

什么是好看的代码？除了清晰明了，格式也必须产生美感。

(defn the-earliest-available-recommand [rooms]
  (letfn [(period [{:keys [periods]}]
            (-> periods
                (->> (filter #(#{:available} (:status %))))
                (ffirst) ; 统一套上括号
                (or [:time ::non-available])))]
    (-> rooms
        (->> (map (juxt first period))
             (map #(into {} %)) ; 合并单独提出来
             (remove #(#{::non-available} (:time %)))
             (sort-by :time)
             (first)) ; 统一套上括号
        (or :no-available-room))))

顺眼不少，最后提交一下。

1 2	ga . gcmsg "[refactor] beautify pipe format"

这篇文章发出来一天，TDD讨论群的一位麦姓朋友@我道：

core=> (first {:a 1 :b 2})
[:a 1]
core=> (first {:b 2 :a 1})
[:b 2]
@lambeta map的元素应该是无序的，用first来获得key value pair是不可靠的。

看到这个建议的时候，我心里一阵欣喜——又有一员Clojurians，可以切磋技艺了！冷静下来，发现自己确实忽略了map中的entries可能是无序的。所以我做了如下的验证：

1 2	(type {}) -> clojure.lang.PersistentArrayMap

看到PersistentArrayMap的时候，我明白这些entries是保持插入顺序的，也就是说，(first {:a 1 :b 2})的求值结果一定是[:a 1]。照这个思路，在我的程序当中使用(first )取map的第一个元素并不会出错。不过，本着谨慎的心态，我查了一下clojure的array-map，发现一个有趣的例子：

(defn make-map [count] (zipmap (range count) (range count)))
(type (make-map 9))
;; => clojure.lang.PersistentArrayMap
(type (make-map 10))
;; => clojure.lang.PersistentHashMap

这表明当map中的entries数量超过一定数量（不一定是9，例外见：PersistentArrayMap’s assoc doesn’t respect HASHTABLE_THRESHOLD）时，PersistentArrayMap就变成了PersistentHashMap，那也就意味着，(first )取出来的值可能是随机的。举个例子：

1 2	(first {7 7, 1 1, 4 4, 6 6, 3 3, 2 2, 9 9, 0 0, 8 8, 5 5}) -> [0 0]

返回的结果并不是相当然的[7 7]，而是[0 0]。那么(first )到底干了些什么呢？Cognitect公司的alexmiller回答我说：(first )会把它的参数强制转换（coerce）成了一个序列，然后取第一个值。我们试着用(seq )转换一下：

(type { 7 7, 1 1, 4 4, 6 6, 3 3, 2 2, 9 9, 0 0, 8 8, 5 5})
-> clojure.lang.PersistentHashMap
(seq { 7 7, 1 1, 4 4, 6 6, 3 3, 2 2, 9 9, 0 0, 8 8, 5 5})
-> ([0 0] [7 7] [1 1] [4 4] [6 6] [3 3] [2 2] [9 9] [8 8])

果然，[0 0]出现在序列的首位。至于为什么是这样的顺序，需要深入Clojure的hash算法和数据结构当中，有时间另起一篇博客解释。我们再试试PersistentArrayMap的情况：

(type { 7 7, 1 1, 4 4, 6 6, 3 3, 2 2, 9 9, 0 0})
-> clojure.lang.PersistentArrayMap
(seq { 7 7, 1 1, 4 4, 6 6, 3 3, 2 2, 9 9, 0 0})
-> ([7 7] [1 1] [4 4] [6 6] [3 3] [2 2] [9 9] [0 0])

顺序确实和原来的一致。

我们的程序当中是不应该假设map是有序的，所以需要修改实现代码。

(defn the-earliest-available-recommand [rooms]
  (letfn [(period [{:keys [periods]}]
            (-> periods
                (->> (filter (comp #{:available} :status)))
                (first)
                (find :time)))
          (room-id [room]
            (find room :room-id))]
    (-> rooms
        (->> (map (comp (partial into {}) (juxt room-id period)))
             (filter :time)
             (sort-by :time)
             (first))
        (or :no-available-room))))

(find )函数，用于从map中获取包含该键值的entry，如果找不到，返回nil。这样就避免了潜在无序的entries对程序的干扰。另外，(partial into {})和Currying很像，它通过接收into函数及其首个参数，构造出一个接收后续参数的函数。当然也可以直接使用#(into {} %)这样的形式。

下面是麦姓朋友的另一种解法，和我的解法思路不完全一样，值得学习借鉴。

(defn the-earliest-available-recommand [rooms]
  (letfn [(earliest-available-time [periods]
            (->> periods
                 (filter (comp #{:available} :status))
                 (map :time)
                 (sort)
                 (first)))]
    (rename-keys
     (->> rooms
          (map #(update-in % [:periods] earliest-available-time))
          (filter :periods)
          (sort-by :periods)
          (first))
     {:periods :time})))

真诚欢迎大家继续点评。

本文样例