读书笔记

强烈推荐使用safaribookonline阅读

Storm real-time processing cookbook(2013) 综合评价：只能作为实例的参考书，让人又爱又恨。 $$3/5$$ 简介：以实例为主，不适合初学者上手，其中对于基本概念的讲解实在让人汗颜。有完整的代码示例。

Chapter 3: 计算TF-IDF,作者希望通过此章介绍Trident(Storm的高级元语言)，但是除了对代码做了简单注释，对基本概念完全没有讲清楚,代码中混入了大量的Trident的复杂概念。并且除Trident编程之外，又混入了DRPC的知识，也没有讲解，让人不明所以.

Storm Blueprints:Patterns for Distributed Real-time Computation (Packt, 2014) 综合评价：强烈推荐！！！Storm入门首选。唯一美中不足，缺失两章代码。简介：内容涉及方面与Storm Cookbook类似，但是对基础概念讲解相当透彻

Chapter :介绍了Graph analysis的基本知识，

Graph Databases, 2nd Edition (O'Reilly, 2015)

Chapter 2: Options for Storing Connected Data: Relational Database和普通的NoSQL的表之间都缺乏relationship (隐含)， Graph Database删除表示relationship
Chapter 3: Data Modeling with Graphs: 简介Cypher(Graph query language). Compare Relational database (tables, normalization, denormalization) and graph modeling

Cassandra Design Patterns （Packt, 2014）

Microsoft Application Architecture Guide, 2nd Edition

Big Data

大数据本质是掌握工具框架，入门之后还是靠语言积累，所以还是算法+数据结构。
所有大数据工具推荐学习顺序：
Standalone --> 运行实例 --> 修改实例 --> 开发实例 --> Cluster mode

Storm

Standalone mode Cluster mode

Apache Hadoop

Apache Oozie

Apache Kafka

console and java driver

坑

DataStax的cassandra core依赖项问题

Apache Spark

Apache Cassandra

分布式Column-based Database

坑

配置文件千万不要在前面加空格，否则报错

Apache Flume

Google Cloud Dataflow

Apache Mesos

Mesos: An open-source cluster manager. Mesos is a cluster manager aiming for improved resource utilization by dynamically sharing resources among multiple frameworks.
提高资源利用率
Shared Cluster

Apache Hama

Apache Giraph

DevOps

Vagrant

Docker

Puppet

Data Visualization

D3.js

Map

Mapbox

优势：高精度， high resolution

Leaflet.js

LeafletSlider https://github.com/dwilhelm89/LeafletSlider

画圆
可以更改坐标和radius
http://stackoverflow.com/questions/22467177/draw-a-circle-of-constant-size-for-all-zoom-levels-leaflet-js

删除圆

add button
基于L.Control.extend函数，生成button。
https://github.com/CliffCloud/Leaflet.EasyButton
简化版 https://github.com/jerroydmoore/leaflet-button/blob/master/L.Control.Button.js
最简版本 http://www.coffeegnome.net/control-button-leaflet/

popup with jquery
http://stackov'erflow.com/questions/1328723/how-to-generate-a-simple-popup-using-jquery
http://jsfiddle.net/SRw67/

popup with css
改变css style, display为inline或者None
http://stackoverflow.com/questions/9220141/how-to-position-a-popup-div-based-on-the-position-of-where-the-cursor-clicks https://jsfiddle.net/hwsamuel/fsp5pgqw/

css垂直居中
https://www.qianduan.net/css-to-achieve-the-vertical-center-of-the-five-kinds-of-methods/

文字图片水平居中


jquert ui create tab  
http://jsfiddle.net/queryj/CnEUh/1/  
http://jsfiddle.net/yeoupooh/4tchxazf/

## Database
### Graph Database
#### Cypher
Cypher: graph database query language  
examples

(emil)<-[:KNOWS]-(jim)-[:KNOWS]->(ian)-[:KNOWS]->(emil)

```
MATCH (a:Person {name:'Jim'})-[:KNOWS]->(b)-[:KNOWS]->(c),
      (a)-[:KNOWS]->(c)
RETURN b, c

MATCH (a:Person)-[:KNOWS]->(b)-[:KNOWS]->(c), (a)-[:KNOWS]->(c)
WHERE a.name = 'Jim'
RETURN b, c

Neo4j

configuration on vagrant. https://github.com/bretcope/vagrant-neo4j

py2neo

max-min fairness

DRF(Domainant Resource Fairness) sharing Incentive, strategy Proofness, Envy Freeness, Pareto Efficiency

经典论文

有趣的项目: https://github.com/cloudera/cdh-twitter-example

Machine Learning预测《冰与火之歌》中的叛徒

Blog

[]Analyzing Twitter Data with Apache Hadoop

How-to: Analyze Twitter Data with Apache Hadoop Analyzing Twitter Data with Apache Hadoop, Part 2: Gathering Data with Flume Analyzing Twitter Data with Apache Hadoop, Part 3: Querying Semi-structured Data with Apache Hive Source Code: https://github.com/cloudera/cdh-twitter-example Flume pipeline

Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop

introduce four NRT(Near Real-Time) architecutre of hadoop

Stream ingestion
Near Real-Time (NTR) Event Processing with External Context
NRT Event Partitioned Processing
Complex Topology for Aggregations or ML

How Spotify Scales Apache Storm

High-level description of Storm Scale. 主要涉及了大量的实际优化操作。

Source and Sink Tuning
- Kafka Tuning
- Cassandra Tuning
Concurrency Issues
- OutputCollector in Storm is not thread-safe
- Parallelism Tuning
Caching for Bolts: Guava's Expirable Cache

Scaling Apache Storm - Strata + Hadoop World 2014

提高Storm performance的方法

分开考虑CPU,I/O，Disk的性能
Key Settings
- topology.max.spout.pending: when reached, Storm will temporaily stop emitting data from Spout(s)
- topology.message.timeout.secs: bottleneck solution, increase timeout and/or inrease component parallelism
Externalize Configuration: no hard code for number of parallelism, props.get("num.workers")

Latency Numbers Every Programmer Should Know

L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms

Online Courses

免费收费课程

Udacity: 个人认为最适合入门的一套课程
edx：公开课鸽子王，课程长期跳票
Big Data University
Coursera: 整体课程偏理论，而且并不深入，实战内容较少
慕课网
麦子学院: 有涉及到架构的课程，讲课老师的整体实力较强
宅客学院
MIT opencourse
极客学院：课程较新，但是讲师的水平和普通话水平参差不齐。但整体来说一年花260元买个会员还是利大于弊的。
Datastax: 主讲Cassandra，以练习为主
safaribookonline：
学堂在线：以清华课程为主，课程涉及内容较为细致，但是视频加载速度不太理想
Hortonwork 系列课程：