I have got some project and i am trying out different things .. of course in clojure..
I stumbled upon Cascalog. It is by the same person nathanmarz who has created twitter-storm.
Cascalog is a "querying" library that runs over hadoop. You can set up a hadoop node and query it with your data. Your queries would look up similar ("in philosophy not in syntax") to the SQL queries. So, i thought of giving it a try. Also it would be my first attempt at hadoop too :) Wanna have fun...
I stumbled upon many hadoop tutorials but finally found one send by God and written by Angel - at -
Hadoop-Set-Up
The only problem that you can encounter is when running cascalog that it refuses to connect to the hadoop with ssh failure. Install sshd (ssh daemon) for that with command :
sudo apt-get install openssh-server
After that, it was straight forward with Cascalog library.. i ran it on the REPL and it was working like a charm.. To start with go to : Getting Started with Cascalog
I just want to add that in cascalog :
defmapcatop is used for partitioning your one row into multiple rows ( speaking in terms of sql).
The used example has a string of words. You can use a defmapcatop to generate a list of words. This is transposing from horizontal to vertical.
here my bash history about my war :) :
467 cp shared/hadoop-2.4.1.tar.* hadoop/
468 ls
469 ls hadoop/
470 cd hadoop/
471 tar -xvf hadoop-2.4.1.tar.gz
472 tar -xvzf hadoop-2.4.1.tar.gz
473 ls
474 emacs &
475 vim set-env-vars.sh
476 cat set-env-vars.sh
477 ls
478 chmod a+x set-env-vars.sh
479 ls -l
480 chmod 755 set-env-vars.sh
481 ls
482 ls -l
483 chmod 766 set-env-vars.sh
484 ls
485 ls -l
486 ls
487 ln -s /home/XYZ/hadoop/hadoop-2.4.1 hadoop
488 ls
489 bash set-env-vars.sh
490 echo $HADOOP-INSTALL
491 cat set-env-vars.sh
492 echo $HADOOP_INSTALL
493 vim set-env-vars.sh
494 cat set-env-vars.sh
495 ./set-env-vars.sh
496 printev
497 printenv
498 printenv $HADOOP_INSTALL
499 echo $HADOOP_INSTALL
500 cat set-env-vars.sh
501 printenv OLDPWD
502 printenv HADOOP_INSTALL
503 bash set-env-vars.sh
504 printenv HADOOP_INSTALL
505 set $HADOOP_INSTALL
506 echo $HADOOP_INSTALL
507 ./set-env-vars.sh
508 echo $HADOOP_INSTALL
509 cat set-env-vars.sh
510 echo HADOOP_INSTALL
511 $HADOOP_INSTALL
512 sudo bash set-env-vars.sh
513 echo $HADOOP_INSTALL
514 export HADOOP=/home/XYZ/hadoop
515 ls
516 echo $HADOOP
517 ./set-env-vars.sh
518 cat set-env-vars.sh
519 echo $HADOOP_INSTALL
520 source set-env-vars.sh
521 echo $HADOOP_INSTALL
522 script help-script.script
523 cat help-script.script | grep "/default"
524 # edit the core-site.xml
525 # edit the yarn-site.xml
526 # edit the mapred-site.xml
527 # create namenode and datanode for hadoop
528 mkdir ../my-store/hdfs/namenode
529 ls
530 mkdir my-store
531 mkdir -r my-store/hdfs/namenode
532 man mkdir
533 mkdir -f my-store/hdfs/namenode
534 mkdir --help
535 mkdir -p my-store/hdfs/namenode
536 mkdir -p my-store/hdfs/datanode
537 # edit the hdfs-site.xml
538 # format the new hadoop filesystem
539 hdfs namenode -format
540 ls
541 source ~/.bashrc
542 hdfs
543 hdfs namenode -format
544 # formatting of namenode need to be done only once.
545 # otherwise it would wipe away the data
546 # do it again before starting hadoop only
547 start-dfs.sh
548 ssh-keyget -t rsa -P ''
549 ssh-keygen -t rsa -P ''
550 cat ~/.ssh/hadoop_rsa.pub >> ~/.ssh/authorized_keys
551 start-dfs.sh
552 jps
553 which ssh
554 which sshd
555 sudo apt-get install sshd
556 sudo apt-get install openssh-server
557 # install the openssh-server for sshd i.e. daemon of ssh
558 which ssh
559 which sshd
560 start-dfs.sh
561 jps
562 script help-script.script
563 cat help-script.script
564 man script
565 cat ~/.bash_history
566 cat ~/.bash_history | less
567 cat ~/.bash_history | grep "hadoop"
568 cat ~/.bash_history | grep "only"
569 cat ~/.bash_history | grep "once"
570 cat ~/.bash_history | grep "#"
571 history -a
572 history
573 history | less
574 man history
575 history --help
576 history
577 history > help-script.script
578 script -a help-script.script
579 history > help-script.script
Script started on Tuesday 29 July 2014 05:17:58 PM IST
XYZ@XYZ-VirtualBox: ~/hadoop XYZ@XYZ-VirtualBox:~/hadoop$ start-yar n.sh
starting yarn daemons
starting resourcemanager, logging to /home/XYZ/hadoop/hadoop-2.4.1/logs/yarn-XYZ-resourcemanager-XYZ-VirtualBox.out
localhost: starting nodemanager, logging to /home/XYZ/hadoop/hadoop-2.4.1/logs/yarn-XYZ-nodemanager-XYZ-VirtualBox.out
]0;XYZ@XYZ-VirtualBox: ~/hadoop XYZ@XYZ-VirtualBox:~/hadoop$ jps
6082 NodeManager
5178 DataNode
4981 NameNode
6117 Jps
5439 SecondaryNameNode
5883 ResourceManager
]0;XYZ@XYZ-VirtualBox: ~/hadoop XYZ@XYZ-VirtualBox:~/hadoop$ exit
exit
Script done on Tuesday 29 July 2014 05:19:56 PM IST
Script started on Tuesday 29 July 2014 05:20:27 PM IST
XYZ@XYZ-VirtualBox: ~/hadoop XYZ@XYZ-VirtualBox:~/hadoop$ watch -n 5 free -m
XYZ@XYZ-VirtualBox/hadoop$ stop-ya rn.sh
stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
no proxyserver to stop
]0;XYZ@XYZ-VirtualBox: ~/hadoop XYZ@XYZ-VirtualBox:~/hadoop$ stop [K-s [Kdfs .sh
14/07/29 17:27:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
14/07/29 17:28:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
]0;XYZ@XYZ-VirtualBox: ~/hadoop XYZ@XYZ-VirtualBox:~/hadoop$ exit
exit
Script done on Tuesday 29 July 2014 05:30:36 PM IST
// Cacalog bash history : Counting the word example :
Script started on Wednesday 30 July 2014 04:14:01 PM IST
]0;XYZ@XYZ-VirtualBox: ~/clojure/cascalog XYZ@XYZ-VirtualBox:~/clojure/cascalog$ lein repl
nREPL server started on port 50449 on host 127.0.0.1 - nrepl://127.0.0.1:50449
REPL-y 0.3.1
Clojure 1.5.1
Docs: (doc function-name-here)
(find-doc "part-of-name-here")
Source: (source function-name-here)
Javadoc: (javadoc java-object-or-class-here)
Exit: Control+D or (exit) or (quit)
Results: Stored in vars *1, *2, *3, an exception in *e
user=> (use 'cascalog.api) [26G [8G [27G
nil
user=> (use 'cascalog.playground) [33G [8G [34G
nil
user=> (require '[cascalog.logic.def :as def] [46G [19G [47G) [47G [8G [48G
nil
user=>
user=> (def/defmapcatfn tokenise [line] [39G [34G [40G
#_=> (clojure.string/split line #"[\[\] [43G [41G [44G\\\(\) [49G [47G [50G,.) [52G [10G [53G\s+] [56G [39G [57G") [58G [59G) [59G [60G
#'user/tokenise
user=>
user=>
user=> (require '[cascalog.logic.ops :as c] [43G [18G [44G) [44G [8G [45G
nil
user=>
user=>
user=> (?- (stdout) [19G [12G [20G
#_=> (<- [?line] [22G [16G [23G
#_=> (sentence :> ?line) [34G [16G [35G) [35G [36G) [36G [37G
14/07/30 16:15:09 INFO util.HadoopUtil: using default application jar, may cause class not found exceptions on the cluster
14/07/30 16:15:09 INFO planner.HadoopPlanner: using application jar: /home/XYZ/.m2/repository/cascading/cascading-hadoop/2.5.3/cascading-hadoop-2.5.3.jar
14/07/30 16:15:09 INFO property.AppProps: using app.id: C9B7157B563647F4A5B9DD8C2B4CC9DB
14/07/30 16:15:10 INFO util.Version: Concurrent, Inc - Cascading 2.5.3
14/07/30 16:15:10 INFO flow.Flow: [] starting
14/07/30 16:15:10 INFO flow.Flow: [] source: MemorySourceTap["MemorySourceScheme[[UNKNOWN]->[ALL]]"]["/b7d6055f-3789-487d-946f-4b5ac88a4f51"]
14/07/30 16:15:10 INFO flow.Flow: [] sink: StdoutTap["SequenceFile[[UNKNOWN]->['?line']]"]["/tmp/temp81179071450888796715696517475394"]
14/07/30 16:15:10 INFO flow.Flow: [] parallel execution is enabled: false
14/07/30 16:15:10 INFO flow.Flow: [] starting jobs: 1
14/07/30 16:15:10 INFO flow.Flow: [] allocating threads: 1
14/07/30 16:15:10 INFO flow.FlowStep: [] starting step: (1/1) ...1450888796715696517475394
14/07/30 16:15:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/07/30 16:15:11 INFO flow.FlowStep: [] submitted hadoop job: job_local_0001
14/07/30 16:15:11 INFO flow.FlowStep: [] tracking url: http://localhost:8080/
14/07/30 16:15:11 INFO util.ProcessTree: setsid exited with exit code 0
14/07/30 16:15:12 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@fc824e
14/07/30 16:15:12 INFO mapred.MapTask: numReduceTasks: 0
14/07/30 16:15:12 INFO hadoop.FlowMapper: cascading version: 2.5.3
14/07/30 16:15:12 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
14/07/30 16:15:12 INFO hadoop.FlowMapper: sourcing from: MemorySourceTap["MemorySourceScheme[[UNKNOWN]->[ALL]]"]["/b7d6055f-3789-487d-946f-4b5ac88a4f51"]
14/07/30 16:15:12 INFO hadoop.FlowMapper: sinking to: StdoutTap["SequenceFile[[UNKNOWN]->['?line']]"]["/tmp/temp81179071450888796715696517475394"]
14/07/30 16:15:12 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
14/07/30 16:15:12 INFO mapred.LocalJobRunner:
14/07/30 16:15:12 INFO mapred.Task: Task attempt_local_0001_m_000000_0 is allowed to commit now
14/07/30 16:15:12 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000000_0' to file:/tmp/temp81179071450888796715696517475394
14/07/30 16:15:12 INFO mapred.LocalJobRunner:
14/07/30 16:15:12 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
14/07/30 16:15:12 INFO mapred.FileInputFormat: Total input paths to process : 1
RESULTS
-----------------------
Four score and seven years ago our fathers brought forth on this continent a new nation
conceived in Liberty and dedicated to the proposition that all men are created equal
Now we are engaged in a great civil war testing whether that nation or any nation so
conceived and so dedicated can long endure We are met on a great battlefield of that war
We have come to dedicate a portion of that field as a final resting place for those who
here gave their lives that that nation might live It is altogether fitting and proper
that we should do this
But in a larger sense we can not dedicate we can not consecrate we can not hallow
this ground The brave men living and dead who struggled here have consecrated it
far above our poor power to add or detract The world will little note nor long remember
what we say here but it can never forget what they did here It is for us the living rather
to be dedicated here to the unfinished work which they who fought here have thus far so nobly
advanced It is rather for us to be here dedicated to the great task remaining before us
that from these honored dead we take increased devotion to that cause for which they gave
the last full measure of devotion that we here highly resolve that these dead shall
not have died in vain that this nation under God shall have a new birth of freedom
and that government of the people by the people for the people shall not perish
from the earth
-----------------------
14/07/30 16:15:12 INFO util.Hadoop18TapUtil: deleting temp path /tmp/temp81179071450888796715696517475394/_temporary
nil
user=>
user=> (?- (stdout) [19G [12G [20G
#_=> (<- [?word] [22G [16G [23G
#_=> (sentence :> ?line) [34G [16G [35G
#_=> (tokenise :< ?line :> ?word) [43G [16G [44G) [44G [45G) [45G [46G
14/07/30 16:15:14 INFO util.HadoopUtil: using default application jar, may cause class not found exceptions on the cluster
14/07/30 16:15:14 INFO planner.HadoopPlanner: using application jar: /home/XYZ/.m2/repository/cascading/cascading-hadoop/2.5.3/cascading-hadoop-2.5.3.jar
14/07/30 16:15:14 INFO flow.Flow: [] starting
14/07/30 16:15:14 INFO flow.Flow: [] source: MemorySourceTap["MemorySourceScheme[[UNKNOWN]->[ALL]]"]["/840fe467-c73b-4b4c-80bc-727a8ea52ce7"]
14/07/30 16:15:14 INFO flow.Flow: [] sink: StdoutTap["SequenceFile[[UNKNOWN]->['?word']]"]["/tmp/temp273045733006808956615702592648076"]
14/07/30 16:15:14 INFO flow.Flow: [] parallel execution is enabled: false
14/07/30 16:15:14 INFO flow.Flow: [] starting jobs: 1
14/07/30 16:15:14 INFO flow.Flow: [] allocating threads: 1
14/07/30 16:15:14 INFO flow.FlowStep: [] starting step: (1/1) ...3006808956615702592648076
14/07/30 16:15:14 INFO flow.FlowStep: [] submitted hadoop job: job_local_0002
14/07/30 16:15:14 INFO flow.FlowStep: [] tracking url: http://localhost:8080/
14/07/30 16:15:14 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1737be7
14/07/30 16:15:14 INFO mapred.MapTask: numReduceTasks: 0
14/07/30 16:15:15 INFO hadoop.FlowMapper: cascading version: 2.5.3
14/07/30 16:15:15 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
14/07/30 16:15:15 INFO hadoop.FlowMapper: sourcing from: MemorySourceTap["MemorySourceScheme[[UNKNOWN]->[ALL]]"]["/840fe467-c73b-4b4c-80bc-727a8ea52ce7"]
14/07/30 16:15:15 INFO hadoop.FlowMapper: sinking to: StdoutTap["SequenceFile[[UNKNOWN]->['?word']]"]["/tmp/temp273045733006808956615702592648076"]
14/07/30 16:15:15 INFO mapred.Task: Task:attempt_local_0002_m_000000_0 is done. And is in the process of commiting
14/07/30 16:15:15 INFO mapred.LocalJobRunner:
14/07/30 16:15:15 INFO mapred.Task: Task attempt_local_0002_m_000000_0 is allowed to commit now
14/07/30 16:15:15 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0002_m_000000_0' to file:/tmp/temp273045733006808956615702592648076
14/07/30 16:15:15 INFO mapred.LocalJobRunner:
14/07/30 16:15:15 INFO mapred.Task: Task 'attempt_local_0002_m_000000_0' done.
14/07/30 16:15:15 INFO mapred.FileInputFormat: Total input paths to process : 1
RESULTS
-----------------------
Four
score
and
seven
years
ago
our
fathers
brought
forth
on
this
continent
a
new
nation
conceived
in
Liberty
and
dedicated
to
the
proposition
that
all
men
are
created
equal
Now
we
are
engaged
in
a
great
civil
war
testing
whether
that
nation
or
any
nation
so
conceived
and
so
dedicated
can
long
endure
We
are
met
on
a
great
battlefield
of
that
war
We
have
come
to
dedicate
a
portion
of
that
field
as
a
final
resting
place
for
those
who
here
gave
their
lives
that
that
nation
might
live
It
is
altogether
fitting
and
proper
that
we
should
do
this
But
in
a
larger
sense
we
can
not
dedicate
we
can
not
consecrate
we
can
not
hallow
this
ground
The
brave
men
living
and
dead
who
struggled
here
have
consecrated
it
far
above
our
poor
power
to
add
or
detract
The
world
will
little
note
nor
long
remember
what
we
say
here
but
it
can
never
forget
what
they
did
here
It
is
for
us
the
living
rather
to
be
dedicated
here
to
the
unfinished
work
which
they
who
fought
here
have
thus
far
so
nobly
advanced
It
is
rather
for
us
to
be
here
dedicated
to
the
great
task
remaining
before
us
that
from
these
honored
dead
we
take
increased
devotion
to
that
cause
for
which
they
gave
the
last
full
measure
of
devotion
that
we
here
highly
resolve
that
these
dead
shall
not
have
died
in
vain
that
this
nation
under
God
shall
have
a
new
birth
of
freedom
and
that
government
of
the
people
by
the
people
for
the
people
shall
not
perish
from
the
earth
-----------------------
14/07/30 16:15:15 INFO util.Hadoop18TapUtil: deleting temp path /tmp/temp273045733006808956615702592648076/_temporary
nil
user=>
user=> (?- (stdout) [19G [12G [20G
#_=> (<- [?word ?count] [29G [16G [30G
#_=> (sentence :> ?line) [34G [16G [35G
#_=> (tokenise :< ?line :> ?word) [43G [16G [44G
#_=> (c/count :> ?count) [34G [16G [35G) [35G [36G) [36G [37G
14/07/30 16:15:17 INFO util.HadoopUtil: using default application jar, may cause class not found exceptions on the cluster
14/07/30 16:15:17 INFO planner.HadoopPlanner: using application jar: /home/XYZ/.m2/repository/cascading/cascading-hadoop/2.5.3/cascading-hadoop-2.5.3.jar
14/07/30 16:15:17 INFO flow.Flow: [] starting
14/07/30 16:15:17 INFO flow.Flow: [] source: MemorySourceTap["MemorySourceScheme[[UNKNOWN]->[ALL]]"]["/80b7d1fb-93bf-4947-9026-9ccb40f44326"]
14/07/30 16:15:17 INFO flow.Flow: [] sink: StdoutTap["SequenceFile[[UNKNOWN]->['?word', '?count']]"]["/tmp/temp128925885307038237815705521554123"]
14/07/30 16:15:17 INFO flow.Flow: [] parallel execution is enabled: false
14/07/30 16:15:17 INFO flow.Flow: [] starting jobs: 1
14/07/30 16:15:17 INFO flow.Flow: [] allocating threads: 1
14/07/30 16:15:17 INFO flow.FlowStep: [] starting step: (1/1) ...5307038237815705521554123
14/07/30 16:15:17 INFO flow.FlowStep: [] submitted hadoop job: job_local_0003
14/07/30 16:15:17 INFO flow.FlowStep: [] tracking url: http://localhost:8080/
14/07/30 16:15:17 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1726ac1
14/07/30 16:15:17 INFO mapred.MapTask: numReduceTasks: 1
14/07/30 16:15:17 INFO mapred.MapTask: io.sort.mb = 100
14/07/30 16:15:17 INFO mapred.MapTask: data buffer = 79691776/99614720
14/07/30 16:15:17 INFO mapred.MapTask: record buffer = 262144/327680
14/07/30 16:15:17 INFO hadoop.FlowMapper: cascading version: 2.5.3
14/07/30 16:15:17 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
14/07/30 16:15:18 INFO hadoop.FlowMapper: sourcing from: MemorySourceTap["MemorySourceScheme[[UNKNOWN]->[ALL]]"]["/80b7d1fb-93bf-4947-9026-9ccb40f44326"]
14/07/30 16:15:18 INFO hadoop.FlowMapper: sinking to: GroupBy(d56a05ef-4117-4b3c-9a92-70d8e8a341e0)[by:[{1}:'?word']]
14/07/30 16:15:18 INFO assembly.AggregateBy: using threshold value: 10000
14/07/30 16:15:18 INFO mapred.MapTask: Starting flush of map output
14/07/30 16:15:18 INFO mapred.MapTask: Finished spill 0
14/07/30 16:15:18 INFO mapred.Task: Task:attempt_local_0003_m_000000_0 is done. And is in the process of commiting
14/07/30 16:15:18 INFO mapred.LocalJobRunner:
14/07/30 16:15:18 INFO mapred.Task: Task 'attempt_local_0003_m_000000_0' done.
14/07/30 16:15:18 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1e6356d
14/07/30 16:15:18 INFO mapred.LocalJobRunner:
14/07/30 16:15:18 INFO mapred.Merger: Merging 1 sorted segments
14/07/30 16:15:18 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 2422 bytes
14/07/30 16:15:18 INFO mapred.LocalJobRunner:
14/07/30 16:15:18 INFO hadoop.FlowReducer: cascading version: 2.5.3
14/07/30 16:15:18 INFO hadoop.FlowReducer: child jvm opts: -Xmx200m
14/07/30 16:15:18 INFO hadoop.FlowReducer: sourcing from: GroupBy(d56a05ef-4117-4b3c-9a92-70d8e8a341e0)[by:[{1}:'?word']]
14/07/30 16:15:18 INFO hadoop.FlowReducer: sinking to: StdoutTap["SequenceFile[[UNKNOWN]->['?word', '?count']]"]["/tmp/temp128925885307038237815705521554123"]
14/07/30 16:15:18 INFO mapred.Task: Task:attempt_local_0003_r_000000_0 is done. And is in the process of commiting
14/07/30 16:15:18 INFO mapred.LocalJobRunner:
14/07/30 16:15:18 INFO mapred.Task: Task attempt_local_0003_r_000000_0 is allowed to commit now
14/07/30 16:15:18 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0003_r_000000_0' to file:/tmp/temp128925885307038237815705521554123
14/07/30 16:15:18 INFO mapred.LocalJobRunner: reduce > reduce
14/07/30 16:15:18 INFO mapred.Task: Task 'attempt_local_0003_r_000000_0' done.
14/07/30 16:15:18 INFO mapred.FileInputFormat: Total input paths to process : 1
RESULTS
-----------------------
4
But 1
Four 1
God 1
It 3
Liberty 1
Now 1
The 2
We 2
a 7
above 1
add 1
advanced 1
ago 1
all 1
altogether 1
and 6
any 1
are 3
as 1
battlefield 1
be 2
before 1
birth 1
brave 1
brought 1
but 1
by 1
can 5
cause 1
civil 1
come 1
conceived 2
consecrate 1
consecrated 1
continent 1
created 1
dead 3
dedicate 2
dedicated 4
detract 1
devotion 2
did 1
died 1
do 1
earth 1
endure 1
engaged 1
equal 1
far 2
fathers 1
field 1
final 1
fitting 1
for 5
forget 1
forth 1
fought 1
freedom 1
from 2
full 1
gave 2
government 1
great 3
ground 1
hallow 1
have 5
here 8
highly 1
honored 1
in 4
increased 1
is 3
it 2
larger 1
last 1
little 1
live 1
lives 1
living 2
long 2
measure 1
men 2
met 1
might 1
nation 5
never 1
new 2
nobly 1
nor 1
not 5
note 1
of 5
on 2
or 2
our 2
people 3
perish 1
place 1
poor 1
portion 1
power 1
proper 1
proposition 1
rather 2
remaining 1
remember 1
resolve 1
resting 1
say 1
score 1
sense 1
seven 1
shall 3
should 1
so 3
struggled 1
take 1
task 1
testing 1
that 13
the 9
their 1
these 2
they 3
this 4
those 1
thus 1
to 8
under 1
unfinished 1
us 3
vain 1
war 2
we 8
what 2
whether 1
which 2
who 3
will 1
work 1
world 1
years 1
-----------------------
14/07/30 16:15:18 INFO util.Hadoop18TapUtil: deleting temp path /tmp/temp128925885307038237815705521554123/_temporary
nil
user=>
user=> (?- (stdout) [19G [12G [20G
#_=> (<- [?word ?count] [29G [16G [30G
#_=> (sentence :> ?line) [34G [16G [35G
#_=> (tokenise :< ?line :> ?word) [43G [16G [44G
#_=> (c/count :> ?count) [34G [16G [35G) [35G [36G) [36G [37G
14/07/30 16:15:20 INFO util.HadoopUtil: using default application jar, may cause class not found exceptions on the cluster
14/07/30 16:15:20 INFO planner.HadoopPlanner: using application jar: /home/XYZ/.m2/repository/cascading/cascading-hadoop/2.5.3/cascading-hadoop-2.5.3.jar
14/07/30 16:15:20 INFO flow.Flow: [] starting
14/07/30 16:15:20 INFO flow.Flow: [] source: MemorySourceTap["MemorySourceScheme[[UNKNOWN]->[ALL]]"]["/6d53a4d0-849d-40d6-a577-36f3cfa732d1"]
14/07/30 16:15:20 INFO flow.Flow: [] sink: StdoutTap["SequenceFile[[UNKNOWN]->['?word', '?count']]"]["/tmp/temp203622615933442583315708535973833"]
14/07/30 16:15:20 INFO flow.Flow: [] parallel execution is enabled: false
14/07/30 16:15:20 INFO flow.Flow: [] starting jobs: 1
14/07/30 16:15:20 INFO flow.Flow: [] allocating threads: 1
14/07/30 16:15:20 INFO flow.FlowStep: [] starting step: (1/1) ...5933442583315708535973833
14/07/30 16:15:20 INFO flow.FlowStep: [] submitted hadoop job: job_local_0004
14/07/30 16:15:20 INFO flow.FlowStep: [] tracking url: http://localhost:8080/
14/07/30 16:15:20 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@caf446
14/07/30 16:15:20 INFO mapred.MapTask: numReduceTasks: 1
14/07/30 16:15:20 INFO mapred.MapTask: io.sort.mb = 100
14/07/30 16:15:20 INFO mapred.MapTask: data buffer = 79691776/99614720
14/07/30 16:15:20 INFO mapred.MapTask: record buffer = 262144/327680
14/07/30 16:15:20 INFO hadoop.FlowMapper: cascading version: 2.5.3
14/07/30 16:15:20 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
14/07/30 16:15:21 INFO hadoop.FlowMapper: sourcing from: MemorySourceTap["MemorySourceScheme[[UNKNOWN]->[ALL]]"]["/6d53a4d0-849d-40d6-a577-36f3cfa732d1"]
14/07/30 16:15:21 INFO hadoop.FlowMapper: sinking to: GroupBy(c11a686a-603d-4108-9d34-3f3f04e9ca37)[by:[{1}:'?word']]
14/07/30 16:15:21 INFO assembly.AggregateBy: using threshold value: 10000
14/07/30 16:15:21 INFO mapred.MapTask: Starting flush of map output
14/07/30 16:15:21 INFO mapred.MapTask: Finished spill 0
14/07/30 16:15:21 INFO mapred.Task: Task:attempt_local_0004_m_000000_0 is done. And is in the process of commiting
14/07/30 16:15:21 INFO mapred.LocalJobRunner:
14/07/30 16:15:21 INFO mapred.Task: Task 'attempt_local_0004_m_000000_0' done.
14/07/30 16:15:21 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1c82eaa
14/07/30 16:15:21 INFO mapred.LocalJobRunner:
14/07/30 16:15:21 INFO mapred.Merger: Merging 1 sorted segments
14/07/30 16:15:21 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 2422 bytes
14/07/30 16:15:21 INFO mapred.LocalJobRunner:
14/07/30 16:15:21 INFO hadoop.FlowReducer: cascading version: 2.5.3
14/07/30 16:15:21 INFO hadoop.FlowReducer: child jvm opts: -Xmx200m
14/07/30 16:15:21 INFO hadoop.FlowReducer: sourcing from: GroupBy(c11a686a-603d-4108-9d34-3f3f04e9ca37)[by:[{1}:'?word']]
14/07/30 16:15:21 INFO hadoop.FlowReducer: sinking to: StdoutTap["SequenceFile[[UNKNOWN]->['?word', '?count']]"]["/tmp/temp203622615933442583315708535973833"]
14/07/30 16:15:21 INFO mapred.Task: Task:attempt_local_0004_r_000000_0 is done. And is in the process of commiting
14/07/30 16:15:21 INFO mapred.LocalJobRunner:
14/07/30 16:15:21 INFO mapred.Task: Task attempt_local_0004_r_000000_0 is allowed to commit now
14/07/30 16:15:21 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0004_r_000000_0' to file:/tmp/temp203622615933442583315708535973833
14/07/30 16:15:21 INFO mapred.LocalJobRunner: reduce > reduce
14/07/30 16:15:21 INFO mapred.Task: Task 'attempt_local_0004_r_000000_0' done.
14/07/30 16:15:21 INFO mapred.FileInputFormat: Total input paths to process : 1
RESULTS
-----------------------
4
But 1
Four 1
God 1
It 3
Liberty 1
Now 1
The 2
We 2
a 7
above 1
add 1
advanced 1
ago 1
all 1
altogether 1
and 6
any 1
are 3
as 1
battlefield 1
be 2
before 1
birth 1
brave 1
brought 1
but 1
by 1
can 5
cause 1
civil 1
come 1
conceived 2
consecrate 1
consecrated 1
continent 1
created 1
dead 3
dedicate 2
dedicated 4
detract 1
devotion 2
did 1
died 1
do 1
earth 1
endure 1
engaged 1
equal 1
far 2
fathers 1
field 1
final 1
fitting 1
for 5
forget 1
forth 1
fought 1
freedom 1
from 2
full 1
gave 2
government 1
great 3
ground 1
hallow 1
have 5
here 8
highly 1
honored 1
in 4
increased 1
is 3
it 2
larger 1
last 1
little 1
live 1
lives 1
living 2
long 2
measure 1
men 2
met 1
might 1
nation 5
never 1
new 2
nobly 1
nor 1
not 5
note 1
of 5
on 2
or 2
our 2
people 3
perish 1
place 1
poor 1
portion 1
power 1
proper 1
proposition 1
rather 2
remaining 1
remember 1
resolve 1
resting 1
say 1
score 1
sense 1
seven 1
shall 3
should 1
so 3
struggled 1
take 1
task 1
testing 1
that 13
the 9
their 1
these 2
they 3
this 4
those 1
thus 1
to 8
under 1
unfinished 1
us 3
vain 1
war 2
we 8
what 2
whether 1
which 2
who 3
will 1
work 1
world 1
years 1
-----------------------
14/07/30 16:15:21 INFO util.Hadoop18TapUtil: deleting temp path /tmp/temp203622615933442583315708535973833/_temporary
nil
user=> 14/07/30 16:15:41 INFO util.Update: newer Cascading release available: 2.5.5
user=> (quit) [13G [8G [14G
Bye for now!
]0;XYZ@XYZ-VirtualBox: ~/clojure/cascalog XYZ@XYZ-VirtualBox:~/clojure/cascalog$ exit
exit
Script done on Wednesday 30 July 2014 04:29:33 PM IST
I stumbled upon Cascalog. It is by the same person nathanmarz who has created twitter-storm.
Cascalog is a "querying" library that runs over hadoop. You can set up a hadoop node and query it with your data. Your queries would look up similar ("in philosophy not in syntax") to the SQL queries. So, i thought of giving it a try. Also it would be my first attempt at hadoop too :) Wanna have fun...
I stumbled upon many hadoop tutorials but finally found one send by God and written by Angel - at -
Hadoop-Set-Up
The only problem that you can encounter is when running cascalog that it refuses to connect to the hadoop with ssh failure. Install sshd (ssh daemon) for that with command :
sudo apt-get install openssh-server
After that, it was straight forward with Cascalog library.. i ran it on the REPL and it was working like a charm.. To start with go to : Getting Started with Cascalog
I just want to add that in cascalog :
defmapcatop is used for partitioning your one row into multiple rows ( speaking in terms of sql).
The used example has a string of words. You can use a defmapcatop to generate a list of words. This is transposing from horizontal to vertical.
here my bash history about my war :) :
467 cp shared/hadoop-2.4.1.tar.* hadoop/
468 ls
469 ls hadoop/
470 cd hadoop/
471 tar -xvf hadoop-2.4.1.tar.gz
472 tar -xvzf hadoop-2.4.1.tar.gz
473 ls
474 emacs &
475 vim set-env-vars.sh
476 cat set-env-vars.sh
477 ls
478 chmod a+x set-env-vars.sh
479 ls -l
480 chmod 755 set-env-vars.sh
481 ls
482 ls -l
483 chmod 766 set-env-vars.sh
484 ls
485 ls -l
486 ls
487 ln -s /home/XYZ/hadoop/hadoop-2.4.1 hadoop
488 ls
489 bash set-env-vars.sh
490 echo $HADOOP-INSTALL
491 cat set-env-vars.sh
492 echo $HADOOP_INSTALL
493 vim set-env-vars.sh
494 cat set-env-vars.sh
495 ./set-env-vars.sh
496 printev
497 printenv
498 printenv $HADOOP_INSTALL
499 echo $HADOOP_INSTALL
500 cat set-env-vars.sh
501 printenv OLDPWD
502 printenv HADOOP_INSTALL
503 bash set-env-vars.sh
504 printenv HADOOP_INSTALL
505 set $HADOOP_INSTALL
506 echo $HADOOP_INSTALL
507 ./set-env-vars.sh
508 echo $HADOOP_INSTALL
509 cat set-env-vars.sh
510 echo HADOOP_INSTALL
511 $HADOOP_INSTALL
512 sudo bash set-env-vars.sh
513 echo $HADOOP_INSTALL
514 export HADOOP=/home/XYZ/hadoop
515 ls
516 echo $HADOOP
517 ./set-env-vars.sh
518 cat set-env-vars.sh
519 echo $HADOOP_INSTALL
520 source set-env-vars.sh
521 echo $HADOOP_INSTALL
522 script help-script.script
523 cat help-script.script | grep "/default"
524 # edit the core-site.xml
525 # edit the yarn-site.xml
526 # edit the mapred-site.xml
527 # create namenode and datanode for hadoop
528 mkdir ../my-store/hdfs/namenode
529 ls
530 mkdir my-store
531 mkdir -r my-store/hdfs/namenode
532 man mkdir
533 mkdir -f my-store/hdfs/namenode
534 mkdir --help
535 mkdir -p my-store/hdfs/namenode
536 mkdir -p my-store/hdfs/datanode
537 # edit the hdfs-site.xml
538 # format the new hadoop filesystem
539 hdfs namenode -format
540 ls
541 source ~/.bashrc
542 hdfs
543 hdfs namenode -format
544 # formatting of namenode need to be done only once.
545 # otherwise it would wipe away the data
546 # do it again before starting hadoop only
547 start-dfs.sh
548 ssh-keyget -t rsa -P ''
549 ssh-keygen -t rsa -P ''
550 cat ~/.ssh/hadoop_rsa.pub >> ~/.ssh/authorized_keys
551 start-dfs.sh
552 jps
553 which ssh
554 which sshd
555 sudo apt-get install sshd
556 sudo apt-get install openssh-server
557 # install the openssh-server for sshd i.e. daemon of ssh
558 which ssh
559 which sshd
560 start-dfs.sh
561 jps
562 script help-script.script
563 cat help-script.script
564 man script
565 cat ~/.bash_history
566 cat ~/.bash_history | less
567 cat ~/.bash_history | grep "hadoop"
568 cat ~/.bash_history | grep "only"
569 cat ~/.bash_history | grep "once"
570 cat ~/.bash_history | grep "#"
571 history -a
572 history
573 history | less
574 man history
575 history --help
576 history
577 history > help-script.script
578 script -a help-script.script
579 history > help-script.script
Script started on Tuesday 29 July 2014 05:17:58 PM IST
XYZ@XYZ-VirtualBox: ~/hadoop XYZ@XYZ-VirtualBox:~/hadoop$ start-yar n.sh
starting yarn daemons
starting resourcemanager, logging to /home/XYZ/hadoop/hadoop-2.4.1/logs/yarn-XYZ-resourcemanager-XYZ-VirtualBox.out
localhost: starting nodemanager, logging to /home/XYZ/hadoop/hadoop-2.4.1/logs/yarn-XYZ-nodemanager-XYZ-VirtualBox.out
]0;XYZ@XYZ-VirtualBox: ~/hadoop XYZ@XYZ-VirtualBox:~/hadoop$ jps
6082 NodeManager
5178 DataNode
4981 NameNode
6117 Jps
5439 SecondaryNameNode
5883 ResourceManager
]0;XYZ@XYZ-VirtualBox: ~/hadoop XYZ@XYZ-VirtualBox:~/hadoop$ exit
exit
Script done on Tuesday 29 July 2014 05:19:56 PM IST
Script started on Tuesday 29 July 2014 05:20:27 PM IST
XYZ@XYZ-VirtualBox: ~/hadoop XYZ@XYZ-VirtualBox:~/hadoop$ watch -n 5 free -m
XYZ@XYZ-VirtualBox/hadoop$ stop-ya rn.sh
stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
no proxyserver to stop
]0;XYZ@XYZ-VirtualBox: ~/hadoop XYZ@XYZ-VirtualBox:~/hadoop$ stop [K-s [Kdfs .sh
14/07/29 17:27:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
14/07/29 17:28:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
]0;XYZ@XYZ-VirtualBox: ~/hadoop XYZ@XYZ-VirtualBox:~/hadoop$ exit
exit
Script done on Tuesday 29 July 2014 05:30:36 PM IST
// Cacalog bash history : Counting the word example :
Script started on Wednesday 30 July 2014 04:14:01 PM IST
]0;XYZ@XYZ-VirtualBox: ~/clojure/cascalog XYZ@XYZ-VirtualBox:~/clojure/cascalog$ lein repl
nREPL server started on port 50449 on host 127.0.0.1 - nrepl://127.0.0.1:50449
REPL-y 0.3.1
Clojure 1.5.1
Docs: (doc function-name-here)
(find-doc "part-of-name-here")
Source: (source function-name-here)
Javadoc: (javadoc java-object-or-class-here)
Exit: Control+D or (exit) or (quit)
Results: Stored in vars *1, *2, *3, an exception in *e
user=> (use 'cascalog.api) [26G [8G [27G
nil
user=> (use 'cascalog.playground) [33G [8G [34G
nil
user=> (require '[cascalog.logic.def :as def] [46G [19G [47G) [47G [8G [48G
nil
user=>
user=> (def/defmapcatfn tokenise [line] [39G [34G [40G
#_=> (clojure.string/split line #"[\[\] [43G [41G [44G\\\(\) [49G [47G [50G,.) [52G [10G [53G\s+] [56G [39G [57G") [58G [59G) [59G [60G
#'user/tokenise
user=>
user=>
user=> (require '[cascalog.logic.ops :as c] [43G [18G [44G) [44G [8G [45G
nil
user=>
user=>
user=> (?- (stdout) [19G [12G [20G
#_=> (<- [?line] [22G [16G [23G
#_=> (sentence :> ?line) [34G [16G [35G) [35G [36G) [36G [37G
14/07/30 16:15:09 INFO util.HadoopUtil: using default application jar, may cause class not found exceptions on the cluster
14/07/30 16:15:09 INFO planner.HadoopPlanner: using application jar: /home/XYZ/.m2/repository/cascading/cascading-hadoop/2.5.3/cascading-hadoop-2.5.3.jar
14/07/30 16:15:09 INFO property.AppProps: using app.id: C9B7157B563647F4A5B9DD8C2B4CC9DB
14/07/30 16:15:10 INFO util.Version: Concurrent, Inc - Cascading 2.5.3
14/07/30 16:15:10 INFO flow.Flow: [] starting
14/07/30 16:15:10 INFO flow.Flow: [] source: MemorySourceTap["MemorySourceScheme[[UNKNOWN]->[ALL]]"]["/b7d6055f-3789-487d-946f-4b5ac88a4f51"]
14/07/30 16:15:10 INFO flow.Flow: [] sink: StdoutTap["SequenceFile[[UNKNOWN]->['?line']]"]["/tmp/temp81179071450888796715696517475394"]
14/07/30 16:15:10 INFO flow.Flow: [] parallel execution is enabled: false
14/07/30 16:15:10 INFO flow.Flow: [] starting jobs: 1
14/07/30 16:15:10 INFO flow.Flow: [] allocating threads: 1
14/07/30 16:15:10 INFO flow.FlowStep: [] starting step: (1/1) ...1450888796715696517475394
14/07/30 16:15:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/07/30 16:15:11 INFO flow.FlowStep: [] submitted hadoop job: job_local_0001
14/07/30 16:15:11 INFO flow.FlowStep: [] tracking url: http://localhost:8080/
14/07/30 16:15:11 INFO util.ProcessTree: setsid exited with exit code 0
14/07/30 16:15:12 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@fc824e
14/07/30 16:15:12 INFO mapred.MapTask: numReduceTasks: 0
14/07/30 16:15:12 INFO hadoop.FlowMapper: cascading version: 2.5.3
14/07/30 16:15:12 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
14/07/30 16:15:12 INFO hadoop.FlowMapper: sourcing from: MemorySourceTap["MemorySourceScheme[[UNKNOWN]->[ALL]]"]["/b7d6055f-3789-487d-946f-4b5ac88a4f51"]
14/07/30 16:15:12 INFO hadoop.FlowMapper: sinking to: StdoutTap["SequenceFile[[UNKNOWN]->['?line']]"]["/tmp/temp81179071450888796715696517475394"]
14/07/30 16:15:12 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
14/07/30 16:15:12 INFO mapred.LocalJobRunner:
14/07/30 16:15:12 INFO mapred.Task: Task attempt_local_0001_m_000000_0 is allowed to commit now
14/07/30 16:15:12 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000000_0' to file:/tmp/temp81179071450888796715696517475394
14/07/30 16:15:12 INFO mapred.LocalJobRunner:
14/07/30 16:15:12 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
14/07/30 16:15:12 INFO mapred.FileInputFormat: Total input paths to process : 1
RESULTS
-----------------------
Four score and seven years ago our fathers brought forth on this continent a new nation
conceived in Liberty and dedicated to the proposition that all men are created equal
Now we are engaged in a great civil war testing whether that nation or any nation so
conceived and so dedicated can long endure We are met on a great battlefield of that war
We have come to dedicate a portion of that field as a final resting place for those who
here gave their lives that that nation might live It is altogether fitting and proper
that we should do this
But in a larger sense we can not dedicate we can not consecrate we can not hallow
this ground The brave men living and dead who struggled here have consecrated it
far above our poor power to add or detract The world will little note nor long remember
what we say here but it can never forget what they did here It is for us the living rather
to be dedicated here to the unfinished work which they who fought here have thus far so nobly
advanced It is rather for us to be here dedicated to the great task remaining before us
that from these honored dead we take increased devotion to that cause for which they gave
the last full measure of devotion that we here highly resolve that these dead shall
not have died in vain that this nation under God shall have a new birth of freedom
and that government of the people by the people for the people shall not perish
from the earth
-----------------------
14/07/30 16:15:12 INFO util.Hadoop18TapUtil: deleting temp path /tmp/temp81179071450888796715696517475394/_temporary
nil
user=>
user=> (?- (stdout) [19G [12G [20G
#_=> (<- [?word] [22G [16G [23G
#_=> (sentence :> ?line) [34G [16G [35G
#_=> (tokenise :< ?line :> ?word) [43G [16G [44G) [44G [45G) [45G [46G
14/07/30 16:15:14 INFO util.HadoopUtil: using default application jar, may cause class not found exceptions on the cluster
14/07/30 16:15:14 INFO planner.HadoopPlanner: using application jar: /home/XYZ/.m2/repository/cascading/cascading-hadoop/2.5.3/cascading-hadoop-2.5.3.jar
14/07/30 16:15:14 INFO flow.Flow: [] starting
14/07/30 16:15:14 INFO flow.Flow: [] source: MemorySourceTap["MemorySourceScheme[[UNKNOWN]->[ALL]]"]["/840fe467-c73b-4b4c-80bc-727a8ea52ce7"]
14/07/30 16:15:14 INFO flow.Flow: [] sink: StdoutTap["SequenceFile[[UNKNOWN]->['?word']]"]["/tmp/temp273045733006808956615702592648076"]
14/07/30 16:15:14 INFO flow.Flow: [] parallel execution is enabled: false
14/07/30 16:15:14 INFO flow.Flow: [] starting jobs: 1
14/07/30 16:15:14 INFO flow.Flow: [] allocating threads: 1
14/07/30 16:15:14 INFO flow.FlowStep: [] starting step: (1/1) ...3006808956615702592648076
14/07/30 16:15:14 INFO flow.FlowStep: [] submitted hadoop job: job_local_0002
14/07/30 16:15:14 INFO flow.FlowStep: [] tracking url: http://localhost:8080/
14/07/30 16:15:14 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1737be7
14/07/30 16:15:14 INFO mapred.MapTask: numReduceTasks: 0
14/07/30 16:15:15 INFO hadoop.FlowMapper: cascading version: 2.5.3
14/07/30 16:15:15 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
14/07/30 16:15:15 INFO hadoop.FlowMapper: sourcing from: MemorySourceTap["MemorySourceScheme[[UNKNOWN]->[ALL]]"]["/840fe467-c73b-4b4c-80bc-727a8ea52ce7"]
14/07/30 16:15:15 INFO hadoop.FlowMapper: sinking to: StdoutTap["SequenceFile[[UNKNOWN]->['?word']]"]["/tmp/temp273045733006808956615702592648076"]
14/07/30 16:15:15 INFO mapred.Task: Task:attempt_local_0002_m_000000_0 is done. And is in the process of commiting
14/07/30 16:15:15 INFO mapred.LocalJobRunner:
14/07/30 16:15:15 INFO mapred.Task: Task attempt_local_0002_m_000000_0 is allowed to commit now
14/07/30 16:15:15 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0002_m_000000_0' to file:/tmp/temp273045733006808956615702592648076
14/07/30 16:15:15 INFO mapred.LocalJobRunner:
14/07/30 16:15:15 INFO mapred.Task: Task 'attempt_local_0002_m_000000_0' done.
14/07/30 16:15:15 INFO mapred.FileInputFormat: Total input paths to process : 1
RESULTS
-----------------------
Four
score
and
seven
years
ago
our
fathers
brought
forth
on
this
continent
a
new
nation
conceived
in
Liberty
and
dedicated
to
the
proposition
that
all
men
are
created
equal
Now
we
are
engaged
in
a
great
civil
war
testing
whether
that
nation
or
any
nation
so
conceived
and
so
dedicated
can
long
endure
We
are
met
on
a
great
battlefield
of
that
war
We
have
come
to
dedicate
a
portion
of
that
field
as
a
final
resting
place
for
those
who
here
gave
their
lives
that
that
nation
might
live
It
is
altogether
fitting
and
proper
that
we
should
do
this
But
in
a
larger
sense
we
can
not
dedicate
we
can
not
consecrate
we
can
not
hallow
this
ground
The
brave
men
living
and
dead
who
struggled
here
have
consecrated
it
far
above
our
poor
power
to
add
or
detract
The
world
will
little
note
nor
long
remember
what
we
say
here
but
it
can
never
forget
what
they
did
here
It
is
for
us
the
living
rather
to
be
dedicated
here
to
the
unfinished
work
which
they
who
fought
here
have
thus
far
so
nobly
advanced
It
is
rather
for
us
to
be
here
dedicated
to
the
great
task
remaining
before
us
that
from
these
honored
dead
we
take
increased
devotion
to
that
cause
for
which
they
gave
the
last
full
measure
of
devotion
that
we
here
highly
resolve
that
these
dead
shall
not
have
died
in
vain
that
this
nation
under
God
shall
have
a
new
birth
of
freedom
and
that
government
of
the
people
by
the
people
for
the
people
shall
not
perish
from
the
earth
-----------------------
14/07/30 16:15:15 INFO util.Hadoop18TapUtil: deleting temp path /tmp/temp273045733006808956615702592648076/_temporary
nil
user=>
user=> (?- (stdout) [19G [12G [20G
#_=> (<- [?word ?count] [29G [16G [30G
#_=> (sentence :> ?line) [34G [16G [35G
#_=> (tokenise :< ?line :> ?word) [43G [16G [44G
#_=> (c/count :> ?count) [34G [16G [35G) [35G [36G) [36G [37G
14/07/30 16:15:17 INFO util.HadoopUtil: using default application jar, may cause class not found exceptions on the cluster
14/07/30 16:15:17 INFO planner.HadoopPlanner: using application jar: /home/XYZ/.m2/repository/cascading/cascading-hadoop/2.5.3/cascading-hadoop-2.5.3.jar
14/07/30 16:15:17 INFO flow.Flow: [] starting
14/07/30 16:15:17 INFO flow.Flow: [] source: MemorySourceTap["MemorySourceScheme[[UNKNOWN]->[ALL]]"]["/80b7d1fb-93bf-4947-9026-9ccb40f44326"]
14/07/30 16:15:17 INFO flow.Flow: [] sink: StdoutTap["SequenceFile[[UNKNOWN]->['?word', '?count']]"]["/tmp/temp128925885307038237815705521554123"]
14/07/30 16:15:17 INFO flow.Flow: [] parallel execution is enabled: false
14/07/30 16:15:17 INFO flow.Flow: [] starting jobs: 1
14/07/30 16:15:17 INFO flow.Flow: [] allocating threads: 1
14/07/30 16:15:17 INFO flow.FlowStep: [] starting step: (1/1) ...5307038237815705521554123
14/07/30 16:15:17 INFO flow.FlowStep: [] submitted hadoop job: job_local_0003
14/07/30 16:15:17 INFO flow.FlowStep: [] tracking url: http://localhost:8080/
14/07/30 16:15:17 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1726ac1
14/07/30 16:15:17 INFO mapred.MapTask: numReduceTasks: 1
14/07/30 16:15:17 INFO mapred.MapTask: io.sort.mb = 100
14/07/30 16:15:17 INFO mapred.MapTask: data buffer = 79691776/99614720
14/07/30 16:15:17 INFO mapred.MapTask: record buffer = 262144/327680
14/07/30 16:15:17 INFO hadoop.FlowMapper: cascading version: 2.5.3
14/07/30 16:15:17 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
14/07/30 16:15:18 INFO hadoop.FlowMapper: sourcing from: MemorySourceTap["MemorySourceScheme[[UNKNOWN]->[ALL]]"]["/80b7d1fb-93bf-4947-9026-9ccb40f44326"]
14/07/30 16:15:18 INFO hadoop.FlowMapper: sinking to: GroupBy(d56a05ef-4117-4b3c-9a92-70d8e8a341e0)[by:[{1}:'?word']]
14/07/30 16:15:18 INFO assembly.AggregateBy: using threshold value: 10000
14/07/30 16:15:18 INFO mapred.MapTask: Starting flush of map output
14/07/30 16:15:18 INFO mapred.MapTask: Finished spill 0
14/07/30 16:15:18 INFO mapred.Task: Task:attempt_local_0003_m_000000_0 is done. And is in the process of commiting
14/07/30 16:15:18 INFO mapred.LocalJobRunner:
14/07/30 16:15:18 INFO mapred.Task: Task 'attempt_local_0003_m_000000_0' done.
14/07/30 16:15:18 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1e6356d
14/07/30 16:15:18 INFO mapred.LocalJobRunner:
14/07/30 16:15:18 INFO mapred.Merger: Merging 1 sorted segments
14/07/30 16:15:18 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 2422 bytes
14/07/30 16:15:18 INFO mapred.LocalJobRunner:
14/07/30 16:15:18 INFO hadoop.FlowReducer: cascading version: 2.5.3
14/07/30 16:15:18 INFO hadoop.FlowReducer: child jvm opts: -Xmx200m
14/07/30 16:15:18 INFO hadoop.FlowReducer: sourcing from: GroupBy(d56a05ef-4117-4b3c-9a92-70d8e8a341e0)[by:[{1}:'?word']]
14/07/30 16:15:18 INFO hadoop.FlowReducer: sinking to: StdoutTap["SequenceFile[[UNKNOWN]->['?word', '?count']]"]["/tmp/temp128925885307038237815705521554123"]
14/07/30 16:15:18 INFO mapred.Task: Task:attempt_local_0003_r_000000_0 is done. And is in the process of commiting
14/07/30 16:15:18 INFO mapred.LocalJobRunner:
14/07/30 16:15:18 INFO mapred.Task: Task attempt_local_0003_r_000000_0 is allowed to commit now
14/07/30 16:15:18 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0003_r_000000_0' to file:/tmp/temp128925885307038237815705521554123
14/07/30 16:15:18 INFO mapred.LocalJobRunner: reduce > reduce
14/07/30 16:15:18 INFO mapred.Task: Task 'attempt_local_0003_r_000000_0' done.
14/07/30 16:15:18 INFO mapred.FileInputFormat: Total input paths to process : 1
RESULTS
-----------------------
4
But 1
Four 1
God 1
It 3
Liberty 1
Now 1
The 2
We 2
a 7
above 1
add 1
advanced 1
ago 1
all 1
altogether 1
and 6
any 1
are 3
as 1
battlefield 1
be 2
before 1
birth 1
brave 1
brought 1
but 1
by 1
can 5
cause 1
civil 1
come 1
conceived 2
consecrate 1
consecrated 1
continent 1
created 1
dead 3
dedicate 2
dedicated 4
detract 1
devotion 2
did 1
died 1
do 1
earth 1
endure 1
engaged 1
equal 1
far 2
fathers 1
field 1
final 1
fitting 1
for 5
forget 1
forth 1
fought 1
freedom 1
from 2
full 1
gave 2
government 1
great 3
ground 1
hallow 1
have 5
here 8
highly 1
honored 1
in 4
increased 1
is 3
it 2
larger 1
last 1
little 1
live 1
lives 1
living 2
long 2
measure 1
men 2
met 1
might 1
nation 5
never 1
new 2
nobly 1
nor 1
not 5
note 1
of 5
on 2
or 2
our 2
people 3
perish 1
place 1
poor 1
portion 1
power 1
proper 1
proposition 1
rather 2
remaining 1
remember 1
resolve 1
resting 1
say 1
score 1
sense 1
seven 1
shall 3
should 1
so 3
struggled 1
take 1
task 1
testing 1
that 13
the 9
their 1
these 2
they 3
this 4
those 1
thus 1
to 8
under 1
unfinished 1
us 3
vain 1
war 2
we 8
what 2
whether 1
which 2
who 3
will 1
work 1
world 1
years 1
-----------------------
14/07/30 16:15:18 INFO util.Hadoop18TapUtil: deleting temp path /tmp/temp128925885307038237815705521554123/_temporary
nil
user=>
user=> (?- (stdout) [19G [12G [20G
#_=> (<- [?word ?count] [29G [16G [30G
#_=> (sentence :> ?line) [34G [16G [35G
#_=> (tokenise :< ?line :> ?word) [43G [16G [44G
#_=> (c/count :> ?count) [34G [16G [35G) [35G [36G) [36G [37G
14/07/30 16:15:20 INFO util.HadoopUtil: using default application jar, may cause class not found exceptions on the cluster
14/07/30 16:15:20 INFO planner.HadoopPlanner: using application jar: /home/XYZ/.m2/repository/cascading/cascading-hadoop/2.5.3/cascading-hadoop-2.5.3.jar
14/07/30 16:15:20 INFO flow.Flow: [] starting
14/07/30 16:15:20 INFO flow.Flow: [] source: MemorySourceTap["MemorySourceScheme[[UNKNOWN]->[ALL]]"]["/6d53a4d0-849d-40d6-a577-36f3cfa732d1"]
14/07/30 16:15:20 INFO flow.Flow: [] sink: StdoutTap["SequenceFile[[UNKNOWN]->['?word', '?count']]"]["/tmp/temp203622615933442583315708535973833"]
14/07/30 16:15:20 INFO flow.Flow: [] parallel execution is enabled: false
14/07/30 16:15:20 INFO flow.Flow: [] starting jobs: 1
14/07/30 16:15:20 INFO flow.Flow: [] allocating threads: 1
14/07/30 16:15:20 INFO flow.FlowStep: [] starting step: (1/1) ...5933442583315708535973833
14/07/30 16:15:20 INFO flow.FlowStep: [] submitted hadoop job: job_local_0004
14/07/30 16:15:20 INFO flow.FlowStep: [] tracking url: http://localhost:8080/
14/07/30 16:15:20 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@caf446
14/07/30 16:15:20 INFO mapred.MapTask: numReduceTasks: 1
14/07/30 16:15:20 INFO mapred.MapTask: io.sort.mb = 100
14/07/30 16:15:20 INFO mapred.MapTask: data buffer = 79691776/99614720
14/07/30 16:15:20 INFO mapred.MapTask: record buffer = 262144/327680
14/07/30 16:15:20 INFO hadoop.FlowMapper: cascading version: 2.5.3
14/07/30 16:15:20 INFO hadoop.FlowMapper: child jvm opts: -Xmx200m
14/07/30 16:15:21 INFO hadoop.FlowMapper: sourcing from: MemorySourceTap["MemorySourceScheme[[UNKNOWN]->[ALL]]"]["/6d53a4d0-849d-40d6-a577-36f3cfa732d1"]
14/07/30 16:15:21 INFO hadoop.FlowMapper: sinking to: GroupBy(c11a686a-603d-4108-9d34-3f3f04e9ca37)[by:[{1}:'?word']]
14/07/30 16:15:21 INFO assembly.AggregateBy: using threshold value: 10000
14/07/30 16:15:21 INFO mapred.MapTask: Starting flush of map output
14/07/30 16:15:21 INFO mapred.MapTask: Finished spill 0
14/07/30 16:15:21 INFO mapred.Task: Task:attempt_local_0004_m_000000_0 is done. And is in the process of commiting
14/07/30 16:15:21 INFO mapred.LocalJobRunner:
14/07/30 16:15:21 INFO mapred.Task: Task 'attempt_local_0004_m_000000_0' done.
14/07/30 16:15:21 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1c82eaa
14/07/30 16:15:21 INFO mapred.LocalJobRunner:
14/07/30 16:15:21 INFO mapred.Merger: Merging 1 sorted segments
14/07/30 16:15:21 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 2422 bytes
14/07/30 16:15:21 INFO mapred.LocalJobRunner:
14/07/30 16:15:21 INFO hadoop.FlowReducer: cascading version: 2.5.3
14/07/30 16:15:21 INFO hadoop.FlowReducer: child jvm opts: -Xmx200m
14/07/30 16:15:21 INFO hadoop.FlowReducer: sourcing from: GroupBy(c11a686a-603d-4108-9d34-3f3f04e9ca37)[by:[{1}:'?word']]
14/07/30 16:15:21 INFO hadoop.FlowReducer: sinking to: StdoutTap["SequenceFile[[UNKNOWN]->['?word', '?count']]"]["/tmp/temp203622615933442583315708535973833"]
14/07/30 16:15:21 INFO mapred.Task: Task:attempt_local_0004_r_000000_0 is done. And is in the process of commiting
14/07/30 16:15:21 INFO mapred.LocalJobRunner:
14/07/30 16:15:21 INFO mapred.Task: Task attempt_local_0004_r_000000_0 is allowed to commit now
14/07/30 16:15:21 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0004_r_000000_0' to file:/tmp/temp203622615933442583315708535973833
14/07/30 16:15:21 INFO mapred.LocalJobRunner: reduce > reduce
14/07/30 16:15:21 INFO mapred.Task: Task 'attempt_local_0004_r_000000_0' done.
14/07/30 16:15:21 INFO mapred.FileInputFormat: Total input paths to process : 1
RESULTS
-----------------------
4
But 1
Four 1
God 1
It 3
Liberty 1
Now 1
The 2
We 2
a 7
above 1
add 1
advanced 1
ago 1
all 1
altogether 1
and 6
any 1
are 3
as 1
battlefield 1
be 2
before 1
birth 1
brave 1
brought 1
but 1
by 1
can 5
cause 1
civil 1
come 1
conceived 2
consecrate 1
consecrated 1
continent 1
created 1
dead 3
dedicate 2
dedicated 4
detract 1
devotion 2
did 1
died 1
do 1
earth 1
endure 1
engaged 1
equal 1
far 2
fathers 1
field 1
final 1
fitting 1
for 5
forget 1
forth 1
fought 1
freedom 1
from 2
full 1
gave 2
government 1
great 3
ground 1
hallow 1
have 5
here 8
highly 1
honored 1
in 4
increased 1
is 3
it 2
larger 1
last 1
little 1
live 1
lives 1
living 2
long 2
measure 1
men 2
met 1
might 1
nation 5
never 1
new 2
nobly 1
nor 1
not 5
note 1
of 5
on 2
or 2
our 2
people 3
perish 1
place 1
poor 1
portion 1
power 1
proper 1
proposition 1
rather 2
remaining 1
remember 1
resolve 1
resting 1
say 1
score 1
sense 1
seven 1
shall 3
should 1
so 3
struggled 1
take 1
task 1
testing 1
that 13
the 9
their 1
these 2
they 3
this 4
those 1
thus 1
to 8
under 1
unfinished 1
us 3
vain 1
war 2
we 8
what 2
whether 1
which 2
who 3
will 1
work 1
world 1
years 1
-----------------------
14/07/30 16:15:21 INFO util.Hadoop18TapUtil: deleting temp path /tmp/temp203622615933442583315708535973833/_temporary
nil
user=> 14/07/30 16:15:41 INFO util.Update: newer Cascading release available: 2.5.5
user=> (quit) [13G [8G [14G
Bye for now!
]0;XYZ@XYZ-VirtualBox: ~/clojure/cascalog XYZ@XYZ-VirtualBox:~/clojure/cascalog$ exit
exit
Script done on Wednesday 30 July 2014 04:29:33 PM IST
Comments
Post a Comment