VEGA Transform功能

 

Vega 学习

#BI/大数据

Transform

aggregate

聚合操作,fields 和 ops 两个数组对应,产生的结果在as 属性中指明:

[
  {“foo”: 1, "bar": 1},
  {“foo”: 1, “bar”: 2},
  {“foo”: null, “bar”: 3}
]

执行aggregate 转换

{
  “type”: "aggregate”,
  “fields”: [“foo”, “bar”, “bar”],
  “ops”: [“valid”, “sum”, “median”],
  "as": ["v", "s", "m"]
}

结果

[{“v”: 2, "s": 6, "m”: 2}]

以上是对所有数据集进行聚合操作,结果为1行数据,可使用groupby,产生分组数据:

[
  {“foo”: "a", "bar”: 1},
  {“foo”: “a”, “bar”: 2},
  {“foo”: “b”, “bar”: 3}
]

转换

{
  “type”: "aggregate”,
  “groupby”: [“foo”],
}

结果

[
  {“foo”: "a", "count”: 2},
  {“foo”: “b”, “count”: 1}
]

注意,可以对嵌套属性进行aggregate groupby操作,结果中的“parent.child” 是字符串,不是嵌套属性引用。后续引用此属性需要以”parent\\.child"方式引用。

collect

对整个数据(数组)操作,主要用于排序。通过sort 属性配置排序方式:

数据:

[
  {“a”: 3, "b": 1},
  {“a”: 2, “b”: 2},
  {“a”: 1, “b”: 4},
  {“a”: 1, “b”: 3}
]

配置:

{
  “type”: "collect”,
  “sort”: {
    “field”: [“a”, “b”],
    “order”: [“descending”, “ascending”]
  }
}

结果:

[
  {“a”: 3, "b": 1},
  {“a”: 2, “b”: 2},
  {“a”: 1, “b”: 3},
  {“a”: 1, “b”: 4}
]

可以对多列,使用不同顺序进行排序操作。

CountPattern

对输入的正则表达式进行匹配,返回匹配次数 转换配置:

{
  “type”: "countpattern”,
  “field”: “comment”,
  “pattern”: “\\d+”,
  “stopwords”: "13"
}

数据:

[
  {“comment": "between 12 and 12.43”},
  {“comment”: “43 minutes past 12 o’clock (and 13 seconds)”}
]

结果:

[
  {“text": "12", "count”: 3},
  {“text”: “43”, “count”: 2},
]

contour

轮廓转换:将地理信息数据转换为不连续的级别。常用于实现二维点数据的密度统计,对于大量点数据的展现表现的更具“扩展性”。 输出新的GeoJSON数据,产生的“shape”数据可用于后续的geoshape和geopath转换。

cross

对数据自身产生 cross-product 转换
[{v:1}, {v:2}, {v:3}] 转换后为:

[
  {“a”: {"v": 1}, “b”: {“v”: 1}},
  {“a”: {“v”: 1}, “b”: {“v”: 2}},
  {“a”: {“v”: 1}, "b": {"v": 3}},
  {"a": {"v": 2}, "b": {"v": 1}},
  {"a": {"v": 2}, "b": {"v": 2}},
  {“a”: {“v”: 2}, "b”: {“v”: 3}},
  {"a": {"v": 3}, "b": {"v": 1}},
  {"a": {"v": 3}, "b": {"v": 2}},
  {"a": {"v": 3}, "b": {"v": 3}}
]

crossfilter

这一“转换”并不执行真正的filter操作(问题1:是否这样?),而是维护一个“多维”的过滤器。后续可以与resolvefilter联合使用,以实现对大数据集进行快速“可交互”的查询过滤。

  • fields 属性,指定多个用于过滤的 field,同一个filed 可以出现多次
  • queries 属性, 一个定义 field 范围过滤的列表。每一个 entry 都是一个二元数组,表示对某 field 过滤是的最大、最小值(左闭右开区间)
  • signal 属性,singal名称,保存 computed filter mask

使用案例说明:

下例中,对于航班(flights)数据集的delay,time,distance进行cross-filter。首先的crossfilter转换设置“过滤器”。三个后续的数据配置使用resolvefilter执行数据过滤,分别忽略一个field的过滤。

{
  “signals": [
    { “name”: “delayRange”, “value”: [-60, 180] },
    { “name”: “timeRange", "value": [0, 24] },
    { "name": "distanceRange", "value": [0, 2400] },
  ],
  “data”: [
    {
      "name": "flights",
      "url": "data/flights-200k.json",
      "transform": [
        {
          "type": "crossfilter",
          "signal": "xfilter",
          “fields”: [“delay”, “time”, “distance”],
          "query": [
            {"signal": "delayRange"},
            {"signal": "timeRange"},
            {"signal": "distanceRange"}
          ]
        }
      ]
    },
    {
      "name”: “filterTimeDistance”,
      “source”: “flights”,
      “transform": [
        {
          "type": "resolvefilter",
          "filter": {"signal": “xfilter”},
          "ignore": 1,

        },
        ...
      ]
    },
    {
      "name": "filterDelayDistance",
      "source": "flights",
      "transform": [
        {
          “type”: “resolvefilter”,
          “filter": {"signal": "xfilter"},
          "ignore": 2
        },
        ...
      ]
    },
    {
      "name": "filterDelayTime",
      "source": "flights”,
      "transform": [
        {
          "type": "resolvefilter",
          “filter”: {“signal”: “xfilter”},
          “ignore”: 4
        },
        …
      ]
    }
  ]
}

?? 最后一个 ignore 不应该是4么 ??

density

针对由pdf(probablility density function)或者cdf(cumulative density function)产生的一维数据集,产生“空间均匀分布”采样新数据流 @TODO

dotbin

@TODO

extent

计算指定属性的最大、最小值,存储在signal中。不改变原有输入数据流。

{“type”: “extent", "field”: “value”, “signal”: “extent”}

filter

过滤

flatten

将数组型的属性展开。

Fold

“折叠”一个或者多个属性为两个属性:key,value 配置:

{“type”: “fold", "fields”: [“gold”, “silver”]}

数据:

[
  {“country": "USA”, “gold”: 10, “silver”: 20},
  {“country”: “Canada”, “gold”: 7, “silver”: 26}
]

输出:

[
  {“key”: "gold", “value”: 10, “country”: “USA”, “gold”: 10, “silver”: 20},
  {“key”: "silver", "value": 20, "country": "USA", "gold": 10, "silver": 20},
  {“key”: "gold”, “value”: 7, “country”: “Canada”, “gold”: 7, “silver”: 26},
  {"key": "silver", "value": 26, "country": "Canada", "gold": 7, "silver": 26}
]

force

计算“重力布局” @TODO

formula

提供新的“calculated”属性

{“type”: “formula", “as”: “logx”, “expr”: “log(datum.x) / LN10”}

{“type”: “formula", “as”: “hr”, “expr”: “hours(datum.date)”}

GeoJSON

产生新的GeoJSON数据,存储在指定signal。比如,产生的新的GeoJSON signal,可用于projection转换的fit参数

GeoPath

这个转换映射GeoJSON数据到SVG Path(映射的结果跟地图投影方式有关cartographic projection )。该转换与geoshape类似,区别在于,其立即生成SVG path string而不是 shape instance

GeoPoint

在质地感地图上,完成经纬度坐标到(x,y)坐标的转换

{
  “type”: "geopoint”,
  “projection”: “myprojection”,
  “fields”: [“lon”, “lat”]
}

根据经度、纬度,计算出x,y坐标,默认存储在x y属性中(可通过as指定为其他属性)

GeoShape

映射GeoJSON数据到shape instance

graticule

为地图产生网格线

identifier

为数据产生新的全局唯一ID

impute

做数据补全

  • field 要做数据补全的字段|属性
  • key 在分组中唯一标识数据的字段
  • keyvals
  • method 补全数据使用的算法
  • groupby 分组
  • value 用来补全数据的常量

joinaggregate

使用“汇总”(aggregate)值来扩展输入数据对象。有点类似于aggregate 转换,但不是产生新的数据流,而是将结果写回到输入数据对象上。

数据

[
  {“foo”: 1, "bar": 1},
  {“foo”: 1, “bar”: 2},
  {“foo”: null, “bar”: 3}
]

转换配置

{
  “type”: "joinaggregate”,
  “fields”: [“foo”, “bar”, “bar”],
  “ops”: [“valid”, “sum”, “median”],
  "as": ["v", "s", "m"]
}

输出结果

[
  {“foo”: 1, "bar": 1, “v”: 2, “s”: 6, “m”: 2},
  {“foo”: 1, “bar”: 2, “v”: 2, “s”: 6, “m”: 2},
  {"foo": null, "bar": 3, "v": 2, "s": 6, "m": 2}
]

KDE 转换

@ TODO kernel density estimation

LinkPath 转换

用于在两个节点间创建可视化“连线”

{
  “type”: "linkpath”,
  “orient”: “radial”,
  “sourceX”: “source.angle”,
  “sourceY”: "source.radius",
  "targetX": "target.angle",
  "targetY": "target.radius",
  “shape”: “orthogonal”,
  "as": "linkpath"
}

Loess 转换

locally-estimated scatterplot smoothing,常用于产生“趋势线”

lookup

通过在“第二”数据流查询数值来扩展“主要”数据流。 如果匹配,“第二”中的数据集会被添加到“主要”数据流中

  • from “第二”数据流名称,被查询
  • key “第二”数据流的key字段
  • values 传递回“主要数据流”的 “第二”数据流中的字段,如果没指定,整个object会被传递
  • fields 用来执行查找的“主要”数据流字段
  • as 返回的数据,存储为
“data”: [
  {
    "name”: “names”,
    “values”: [
      {“id”: “A”, “name”: “label A”},
      {"id": "B", "name": "label B"},
      {"id": "C", "name": "label C"}
    ]
  },
  {
    “name": "values",
    "values": [
      {"foo": "A", "bar": 28},
      {"foo": "B", "bar": 55},
      {"foo": "C", "bar": 43},
      {“foo": "C", "bar": 91},
      {"foo": “D”, “bar”: 81}
    ],
    “transform”: [
      {
        "type": "lookup",
        "from": "names",
        "key": "id",
        "fields": ["foo"],
        "as": ["obj"]
      }
    ]
  }
]

结果:

{“foo”: “A", "bar": 28, “obj”: {“id”: “A”, “name”: “label A”}},
{“foo”: “B”, “bar”: 55, "obj": {"id": "B", "name": "label B"}},
{“foo”: “C", "bar": 43, “obj”: {“id”: “C”, “name”: “label C”}},
{"foo": "C", "bar": 91, "obj": {"id": "C", "name": "label C"}},
{"foo": "D", "bar": 81, "obj": null}

又一个例子:

{
  “type”: "lookup",
  “from”: “names”,
  “key”: “id”,
  “fields”: [“foo”],
  “values”: ["name"],
  "as": ["obj"],
  "default": "some label"
}

结果:

{“foo”: “A", "bar": 28, “obj”: “label A”},
{“foo”: “B”, “bar”: 55, “obj”: “label B”},
{"foo": "C", "bar": 43, "obj": "label C"},
{"foo": "C", "bar": 91, "obj": "label C"},
{“foo”: “D”, “bar": 81, “obj”: “some label”}

nest 转换

对输入的数据对象,通过将子元素分配到不同的组中,产生一个有层级关系(tree)的数据。这一转换会产生一系列 tree node对象,后续可以用来做 tree、 treemap、 pack、partition转换的输入。 nest 转换类似执行groupby,产生层级关系(但是其输出的数据格式比较特别) 原数据:

[
  {“id”: "A", "job”: “Doctor”, “region”: “East”},
  {“id”: “B”, “job”: “Doctor”, “region": "East"},
  {"id": "C", "job": "Lawyer", "region": "East"},
  {"id": "D", "job": “Lawyer”, “region": "East"},
  {"id": "E", "job": "Doctor", "region": "West"},
  {"id": "F", "job": "Doctor", "region": "West"},
  {"id": "G", "job”: "Lawyer", "region": "West"},
  {“id”: “H”, “job”: “Lawyer”, “region”: “West”}
]

转换配置:

{
  “type”: "nest",
  “keys”: [“job”, “region”]
}

结果:

[
  // original input nodes
  {“id”: “A”, “job”: “Doctor”, “region”: “East”},
  {“id”: "B", "job": "Doctor", "region": "East"},
  {"id": "C", "job": "Lawyer", "region": "East”},
  {“id”: "D", "job": "Lawyer", "region": "East"},
  {"id": "E", "job": "Doctor", "region": "West"},
  {"id": "F", "job": "Doctor", "region": “West"},
  {"id": "G", "job": "Lawyer”, “region”: “West”},
  {“id”: “H”, “job”: "Lawyer", "region": "West"},

  // generated internal nodes
  // for the root node, key is undefined
  // values arrays contain nested groups of objects
  {"values": [ ... ] },
  {"key": "Doctor", "values": [ ... ]},
  {“key”: “Lawyer”, “values”: [ … ] },
  {“key”: “East”, “values”: [ … ]},
  {“key”: “West”, “values”: [ … ]},
  {“key”: “East”, “values”: [ … ]},
  {"key": "West", "values": [ ... ]}
]

总体来说,要求数据必须是flatten形式,对于原来就是通过数组展现为层级结构的数据,不适用

pack 转换

计算出围场图,来表示层级关系。其接受nest 或者 stratify 转换的数据输出作为输入。

Pivot 转换

行-列转换,常用于将多行数据,转换为多列数据 数据

[
  {“country": "Norway”,  “type”: “gold”,   “count”: 14},
  {“country”: “Norway”,  “type": "silver", "count": 14},
  {"country": "Norway",  "type": "bronze", "count": 11},
  {“country”: "Germany", "type": "gold",   "count": 14},
  {"country": "Germany", "type": "silver", "count": 10},
  {"country": "Germany", "type”: “bronze", "count":  7},
  {"country”: “Canada”,  “type”: “gold”,   “count”: 11},
  {"country": "Canada",  "type": "silver", "count":  8},
  {“country": "Canada”,  “type”: “bronze”, “count”: 10}
]

转换配置:

{
  “type”: "pivot",
  “groupby”: [“country”],
  “field”: “type”,
  “value”: “count”
}

结果:

[
  {“country": "Norway”,  “gold”: 14, “silver”: 14, “bronze”: 11},
  {“country”: “Germany", "gold": 14, "silver": 10, "bronze":  7},
  {“country": "Canada”,  “gold”: 11, “silver”:  8, “bronze”: 10},
]

project 转换

relational algebra projection operation。常用于选择数据流的部分数据字段或者对原有字段重命名等操作。 不要与 cartographic projections 混淆

resolvefilter 转换

使用由 crossfilter 产生的filter mask,高效过滤数据

{
  “signals": [
    { “name”: “delayRange”, “value”: [-60, 180] },
    { “name”: “timeRange", "value": [0, 24] },
    { "name": "distanceRange", "value": [0, 2400] },
  ],
  “data”: [
    {
      "name": "flights",
      "url": "data/flights-200k.json",
      “transform": [
        {
          “type”: “crossfilter”,
          “signal”: “xfilter",
          "fields": ["delay", "time", "distance"],
          "query": [
            {“signal": "delayRange"},
            {"signal": "timeRange"},
            {"signal": "distanceRange"}
          ]
        }
      ]
    },
    {
      "name": "filterTimeDistance”,
      “source”: “flights”,
      “transform": [
        {
          "type": "resolvefilter",
          "filter": {"signal": "xfilter"},
          "ignore": 1,

        },
        ...
      ]
    },
    {
      "name": "filterDelayDistance",
      "source": “flights”,
      “transform”: [
        {
          “type”: “resolvefilter”,
          “filter”: {“signal”: “xfilter”},
          “ignore”: 2
        },
        ...
      ]
    },
    {
      "name": "filterDelayTime",
      “source": "flights”,
      "transform": [
        {
          "type": "resolvefilter",
          "filter": {"signal": "xfilter"},
          "ignore": 4
        },
        ...
      ]
    }
  ]
}

Sample 转换

随机采样输入数据流,输出一个较小的数据流

Sequence 转换

{“type”: “sequence", “start”: 0, “stop”: 5}

产生:

[
  {“data": 0},
  {“data”: 1},
  {“data”: 2},
  {“data”: 3},
  {“data”: 4}
]

例2:

{“type”: “sequence", “start”: 1, “stop”: 10, “step”: 2, “as”: “value” }

产生

[
  {“value": 1},
  {“value”: 3},
  {“value”: 5},
  {“value”: 7},
  {“value”: 9}
]

Stack 转换

生成 堆栈 图(柱状图)。这一转换会为数据datum增加两个属性,表示开始和结束值。

  • field 决定 stack 高度的字段
  • groupby 将数据分成不同组的字段s
  • sort 指明排序方式
  • offset 排序方式
  • as 计算出来的开始、终止stack值默认为 [y0, y1]

stratify 分层转换

通过指定key字段、parent字段,将输入数据转换成为tree型数据结构。转换过程中,会生成一些列tree node相关对象,用于后续的tree,treemap,pack,partition转换

属性:

  • Key
  • parentKey

tree 转换

生成树型结构,为每个node生成x,y坐标

  • field node的数值型字段,
  • sort 指定排序方式
  • method 树型布局的方法 cluster, tidy
  • seperation
  • size 布局窗口大小
  • as 默认会生成[x, y, depth, children]属性,可用as 重命名

treelinks 转换

treelinks 转换生成新的数据输出,用来表示tree 节点之间的连线。生成的数据对象会具有 source 和 target字段,用来对应连线的两个节点

window 转换

对已排序分组数据执行计算。包括ranking,lead/lag,sums,average等。结果写回输入对象

类似窗口函数的意思

Leave Comment