Quicken to GnuCash Migration 2016 - Fail

In March of 2016, I tried again to migrate from Quicken (Windows) to GnuCash (Mac). This time it was with GnuCash 2.6.11. The motivator is that I use Quicken 2014, Q2014 is the last version to run on XP, so when Intuit expires my transaction downloads in 2017, I’ll have to purchase Windows 10 and keep running VM software.

The good:

  • They’ve added support for SQL back-ends, including MySQL and sqlite3.

The bad:

  • They treat the SQL back-end as if it were a big file. The read in the entire database at startup, just as if it were the XML data store. It loads slowly if you have 25 years of data.
  • If you download the binary version, you get SQL support, but you don’t get the Python API. If you install from MacPorts, you get the Python API, but not sqlite3. (No, you can’t successfully add sqlite3 to the MacPorts edition. I tried.) You can try a full build-from-source, but that is epically difficult and it will fail if you have MacPorts for something else. I use Homebrew, and Homebrew doesn’t play well with MacPorts. Note that piecash is an alternative python interface to some GnuCash data. (piecash is NOT the same as the GnuCash python interface!)
    • If you run “brew cask install gnucash”, you just get the same binary installed as if you downloaded it from gnucash.org. “brew cask” fetches and installs binaries, not source.
  • The database schema isn’t fully documented. The Python API isn’t fully documented.
  • Custom reporting requires learning Scheme and Guile or write custom code against the database.
  • You have to exit GnuCash to update prices or run external reports. Only one process at a time can touch the database. Security price updates are a separate, stand-alone app.
  • It doesn’t use the price at the time of a security purchase/sale to update the price history for that security.
  • There is no way (short of the API or SQL hacking) to to import security price history from Quicken to GnuCash.
  • Reports for portfolio tracking are… quite limited.
  • There is nothing comparable to the Quicken retirement planner.
  • After you import your Quicken QIF export, all of your securities have a price of $0 in the account overview. This means you can’t reconcile your account balances with your old Quicken balances until you figure out how to import your Quicken price history (which could include manual entry of the latest price for each security).
  • You practically can’t build from source on a real OS X installation. Building relies on gtk-osx. Building uses something called jhbuild. The doc for gtk-osx and jhbuild says: If you have HomeBrew, MacPorts, or Fink installed, you must remove all traces of them from your environment before you try to run jhbuild.
    • If you want a SQL other than sqlite3, you’re going to have to make jhbuild run (in order to build the DB driver), and jhbuild doesn’t like Homebrew, MacPorts, or Fink. Only the sqlite3 DB driver is included in the pre-built binary.
  • Whether you use GnuCash python bindings or piecash, GnuCash doc says, “Python-scripts should not be executed while GnuCash runs. GnuCash is designed as a single user application with only one program accessing the data at one time.”

Python info:

  • GnuCash: https://code.gnucash.org/docs/MASTER/python_bindings_page.html
  • GnuCash: https://wiki.gnucash.org/wiki/Python_Bindings
  • PieCash: https://media.readthedocs.org/pdf/piecash/latest/piecash.pdf
  • GnuCash says “Until GnuCash supports simultaneous multiuser use almost all users are better off with the XML backend” but piecash requires a SQL back-end.
  • GnuCash says “… you must not write to the GnuCash database except through the GnuCash libraries.”

Building on Mac:

  • https://wiki.gnucash.org/wiki/MacOSX/Quartz - it says “For virtually all users it is more appropriate to download the binary rather than to use the procedure described here.”
    • However: “Creating a separate user for building and packaging GnuCash is the easiest work-around if you’ve used any of those systems in your regular userid.”

Options for next time:

  • My advice if you’re on Mac is to start out with piecash and the pre-built GnuCash for OS X. If you find that you really need the python bindings, set up a linux VM, build from source there, and use that VM solely for your accounting.
  • This guy ( http://www.matt-thornton.net/tech/databases/gnucash-mysql-os-x-10-10-getting-it-running ) says “My advice is if you really want gnucash + mysql running on your Mac then build an Ubuntu virtual machine and run it in VMware fusion (or other). That way you can use the package managers to just build and install everything for you.”
    • After a brief review, I conclude he’s right. If you want to use GnuCash on Mac (or Windows) without the python bindings and with XML or sqlite3 data store, use the pre-built binaries. Otherwise, build and run it in a Linux VM.

Outline of things to address the next time:

  • Install GnuCash
  • Make transaction downloads work
  • Figure out how to update stock prices
  • Make my portfolio re-balancer work
  • Create a Quicken Retirement Planner equivalent

Uninstall a Perl Module

To uninstall a Perl module (delete, remove, etc.):

cpanp uninstall module_name

or maybe

sudo cpanp uninstall module_name

Python: Muli-level if statement on one line

I saw an answer for this on StackExchange, but nobody seemed to understand why it was desirable… (I’m not enough of a participant in StackExchange to have enough reputation points to post a comment there.)

How can I combine multiple if statements onto a single line? I want

if x > 89:
    y = 'A'
elif x > 79:
   y = 'B'
...
else:
   y = 'F'

on a single line.

Yes, it can be done. Yes, it doing so does serve a purpose. Consider the lambda.

lambda x: 'A' if x>89 else 'B' if x>79 else 'C' if x > 69 else 'D' if x > 59 else 'F'

Weather and Running and Clothing

My experience:

Temp (F) Wind MPH Wind Chill Clothing Assessment
66 5-10 65 Sweats, T-shirt Shorts would have been OK, except during gusts
60 7 59 Shorts, T-shirt Adequate
60 7 59 Sweats, T-shirt Comfortable
58 5 58 Sweats, T-shirt OK
58 10 56 Sweats, T-shirt Barely OK
56 5 55 Sweats, T-shirtx2 A little too warm
50 5 48 Sweats, T-shirtx2 Adequate
45 9 40 Sweats, T-shirtx2, hoodie, ski cap, cotton gloves OK
40 7 35 Sweats, T-shirtx2, hoodie, ski cap, cotton gloves Adequate
  • Note that T-shirtx2 = a long-sleeve T over a short-sleeve T
  • Note that “Sweats” = Sweatpants

Wind chill temperature = 35.74 + 0.6215T - 35.75V (0.16) + 0.4275TV(0.16)

In the formula, V is in the wind speed in statute miles per hour, and T is the temperature in degrees Fahrenheit.

Or calculate it at http://www.csgnetwork.com/windchillcalc.html

ggplot examples

ggplot a data frame with two series mixed together:

ggplot(df, aes(‘x-column-name’,’y-column-name’,color=’column-with-value-that-distinguishes-the-two-series’)) + geom_point() + geom_line() + ggtitle(‘title’) + xlab(‘xlab’) + ylab(‘ylab’)

e.g. For a data frame with a farmer’s data for count of apples and pears grown in each year, with columns year, count, fruit

ggplot(df, aes(‘year’,’count’,color=’fruit’)) + geom_point() + geom_line() + ggtitle(‘McDonald Farm Fruits’) + xlab(‘Year’) + ylab(‘Bushels’)

The transfer transaction is to an account that no longer exists in Quicken

I have a retirement account and a checking account (and more) in Quicken. I have many transfers from the checking account to the retirement account. When I right-click a transaction in the retirement account and select “Go To Transfer”, Quicken 2014 tells me “The transfer transaction is to an account that no longer exists in Quicken.” It is wrong. It is a bug in Quicken.

  • I can “Go To” from the transactions in the checking to the retirement account.
  • I cannot “Go To” from the transactions in the retirement account to the checking account.
  • If I rename the checking account, the renamed account shows up in the transactions in the retirement account, so the transactions are still linked.
  • I tried re-entering the transactions, and the situation still persists – “Go To” works in one direction but not the other.

The issue was reportedly acknowledged as a bug back in 2014 and as of 2016 it still isn’t fixed. I’m betting it won’t be fixed. When Intuit attempts to compel me to upgrade at the end of 2016 (they start blocking transaction downloads on a 3-year cycle), I’m going to have to consider moving to a real accounting package. The big things to look for when I do that will be:

  • Download of transactions from financial institutions
  • High-fidelity import of my existing Quicken data going back to 1990.
  • Ability to report on long-term spending/saving/investing (20-year historical reporting)

Pandas (Python)

I’m learning the Pandas library for Python. Here are some note.

Series

  • General form:
    • pandas.Series(data=None, index=None)
  • Example:
    • pandas.Series(['big', 'small', 'red', 42])
  • Example:
    • pandas.Series(data=['big', 'small', 'red', 42], index=['desired_wealth', 'actual_wealth', 'status', 'answer'])
  • Getting data out of a Series:
    • x = series['status']
  • Getting multiple data out:
    • y = series[['status','answer']]
  • Get a sub-series, where the value is transformed (in this case: multiplied by 3). Returns a series:
    • z = seriesName * 3
  • Get sub-series from a series where value meets a condition (in this case: != 3). Returns a series:
    • zz = seriesName[seriesName != 3]

DataFrame

  • General form:
    • pandas.DataFrame(data=None)
  • Example:
    • pandas.DataFrame({ 'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012], 'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'], 'wins': [11, 8, 10, 15, 11, 6, 10, 4], 'losses': [5, 8, 6, 1, 5, 10, 6, 12]})
  • Functions/selectors you can call on a DataFrame named d:
    • d.dtypes - what is the data type of each column
    • d.describe() - show some aggregated metrics about the values in the data frame
    • d.head() - show the first few rows
    • d.tail() - show the last few rows
    • d.head(2) - show the firs 2 rows
  • Load a CSV file:
    • d = pandas.read_csv('./file1.csv')
  • Create a data frame from a literal:
      my_cars = ['Prius', 'Corolla', 'Matchbox']
      horsepower = [120, 130, 0]
      colors = ['gold', 'silver', 'rainbow']
      t = DataFrame({
          'my_cars':Series(my_cars),
          'horsepower':Series(horsepower),
          'color':Series(colors)
          })
    
  • Create a data frame from a literal, with an index:
      d = DataFrame({
          'one':Series([1, 2, 3], index=['a','b','c']),
          'two':Series([1, 2, 3], index=['a','b','c'])
          })
    
  • Get a column from a data frame:
    • Treat it like a dictionary:
      • x = d['columnName']
    • Or treat it like a field:
      • x = d.columnName
  • Get multiple columns from a data frame:
    • x = d[['colName1', 'colName2']]
  • Get a row from a data frame:
    • d.loc[index]
    • d.loc[3]
    • d.loc['namedIndex']
  • Select rows from a data frame that meet a criteria:
    • d[d['horsepower'] >= 120]
  • Get rows 3-5 (Note: Last number is strictly less than)
    • d[3:6]
  • Select and project (one column)
    • d[‘color’][d[‘horsepower’] >= 120]
  • Select and project (multiple columns)
    • d[[‘color’,’horsepower’]][d[‘horsepower’] >= 120]
      • The boolean test return all the indices meeting that criteria. Then those indices are used to select the rows from the color+horsepower intermediate data frame.
  • Apply a function to every column:
    • d.apply(numpy.median)
  • Apply a function to each value one column:
    • d['colName'].map(lambda x: x+1)
  • Apply a function to every value in every row and column
    • d['colName'].applymap(lambda x: x+1)
  • Apply an aggregate function to a column:
    • numpy.mean(d['horsepower'])

Matrix Multiply Example From Udacity Course

import numpy
from pandas import DataFrame, Series

def numpy_dot():
    '''
    4 points for gold, 2 points per silver, one point per bronze.
    Using the numpy.dot function, return a new dataframe with
        a) a column called 'country_name' with the country name
        b) a column called 'points' with the total number of points the country earned.
    '''
    countries = ['Russian Fed.', 'Norway', 'Canada', 'United States',
                 'Netherlands', 'Germany', 'Switzerland', 'Belarus',
                 'Austria', 'France', 'Poland', 'China', 'Korea', 
                 'Sweden', 'Czech Republic', 'Slovenia', 'Japan',
                 'Finland', 'Great Britain', 'Ukraine', 'Slovakia',
                 'Italy', 'Latvia', 'Australia', 'Croatia', 'Kazakhstan']
    gold = [13, 11, 10, 9, 8, 8, 6, 5, 4, 4, 4, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
    silver = [11, 5, 10, 7, 7, 6, 3, 0, 8, 4, 1, 4, 3, 7, 4, 2, 4, 3, 1, 0, 0, 2, 2, 2, 1, 0]
    bronze = [9, 10, 5, 12, 9, 5, 2, 1, 5, 7, 1, 2, 2, 6, 2, 4, 3, 1, 2, 1, 0, 6, 2, 1, 0, 1]

    d = DataFrame({
        'country_name':Series(countries), 
        'gold':Series(gold),
        'silver':Series(silver),
        'bronze':Series(bronze)
        })[['gold','silver','bronze']]
    v = numpy.dot(d,[4,2,1])
    return DataFrame({
        'country_name':Series(countries),
        'points':Series(v)  })

</code>

Numpy (Python)

I’m learning Numpy. Here are some notes:

Dot Product

a = [1, 2, 3]
b = [10, 20, 30]
numpy.dot(a,b)
  • Reminder: Dot product of a,b is 110 + 220 + 3*30

Getting Started With AWS EMR 4.x

There have been some significant changes in Amazon Web Services (AWS) Elastic MapReduce (EMR) with the 4.x release. The Amazon doc hasn’t been rewritten for 4.x - just a summary of changes, making it tough to follow tutorials. Here are some tips for getting started with AWS EMR. I assume you are using Linux or a Mac. Windows users will have to make some minor adjustments, but the basic ideas apply.

##Sign-up for AWS

  • Visit http://aws.amazon.com/
  • If you are a customer of Amazon’s web store, I recommend setting up a separate account for AWS. (See next bullet.)
  • You will need to provide a credit card, even if you are using the ‘free’ edition. AWS might charge you more than you expect. You might want to put a limit on charges, and AWS does not provide this capability.
    • Some credit cards allow you to create a single-use card number with a billing limit and a specified expiration date. For example, Bank of America calls this “ShopSafe.” Citibank calls its “Virtual Account Numbers.” Use this to set a cap on how much AWS can charge you.
  • I think you care about rates even when using ‘AWS Free.’ The more hardware you use, the quicker you come to the end of your free period.

##Check Your Bill

##Create an SSH Key Pair

  • You will need an SSH key pair in order to open a command-line prompt on your cluster.
  • Choose a data center.
    • In the upper-right corner, you’ll see your name. To the right of it you’ll see a location name. This identifies the data center you will be using.
    • Click on the data center name and pick the one you want to use.
      • You might have to change data centers, even if you prefer not to. I found that some things worked in Oregon that didn’t work in Virginia. You may find the reverse. It depends on demand, upgrades in progress, etc.
      • Stick with one data center to the extent possible. Moving around complicates things.
  • Go to the EC2 Console and click the Create Key Pair link in the left-side panel of links.
    • If you already have a key pair, the link will be Key Pairs
  • Follow the wizard. Save the .pem file in your home directory on your laptop/desktop (or store it elsewhere and create a link to it in your home directory).

##Install the AWS Command-line Interface

  • See http://docs.aws.amazon.com/cli/latest/userguide/installing.html#install-bundle-other-os
  • Run “aws help” after installing, to confirm it is installed.
  • Visit https://us-west-2.console.aws.amazon.com/console/home.
    • Click Identify and Access Management.
    • Click Users
    • Create a user (and generate an access key and download it).
  • Run “aws configure” and answer the questions.
    • You get some answers from *Identity and Access Management”.
    • If you need the region name, you can visit the EC2 Console and copy it from the end of the URL.
    • Default output format should be json.
  • You won’t be able to do much unless you adjust security policies. The user you entered in response to a prompt from “aws configure” doesn’t have permission to do anything. You’ll have to grant permission or you’ll get AccessDenied errors. See Grant Permission to a User elsewhere on this page.

##Grant Permission to a User

  • There are many policies/permissions. I’ll document one (or a few) here.
  • Permit access to S3
    • Visit Identity and Access Management
    • Click the Policies link
    • Select Amazon S3 Full Access
    • Push the Attach button
    • Attach the user you created/entered for aws
  • Grant Administrator access
    • I dunno which policies are actually necessary for which aws command. The easy thing to do for a learning project is to grant yourself access to everything.
    • Visit Identity and Access Management
    • Click the Policies link
    • Select AdministratorAccess
    • Push the Attach button
    • Attach the user you created/entered for aws

##Create and Start an AWS Cluster for EMR

  • Be sure to Create an SSH Key Pair first. (See elsewhere on this page.)
  • Login to the EMR Cluster Console
  • Press the “Create Cluster” button to begin the cluster creation wizard.
    • Note: If you have no clusters yet, EMR will present a special, “You don’t have any clusters. Create one now,” screen with slightly different content. That’s OK. Click its cluster-creation button.
    • Stay out of the Advanced Options
  • Populate the following fields. (Any field not mentioned should be left at its default.)
    • Name = any_name_you_like_1
    • Logging = enabled.
    • S3 Cluster = (don’t change it)
    • Launch Mode = Cluster
    • Vendor = Amazon
    • Release = emr-4.2.0
    • Applications = All Applications
      • You could get by with just Core Hadoop, but some of the added apps for ‘All’ are helpful to monitor your cluster.
    • Instance Type = m1.medium
      • This defines the virtual hardware for each node of your cluster. (CPU speed and RAM.)
      • Pick the smallest “General Purpose (Previous Generation)” you can find, in order to minimize cost while learning. The bigger your cluster, the higher the hourly rate.
      • The pricing page lists smaller sizes, but none were available via the wizard.
      • You may find that you choose a cluster size and, when AWS actually creates the cluster, you get a message that clusters of that size are not available in your chosen data center. Pick a different size or a different data center and try again. I found that I could get a medium cluster at some hours but not at others, and in some data centers, but not in others.
    • Number of Instances: 1
      • Start with a cluster size of one node, to minimize expense. Grow to more nodes when you’re familiar with the platform and when you need more processing power.
    • EC2 Key Pair = (use the name of the key pair you created per instructions elsewhere on this page)
    • Permissions = Default
    • Press Create Cluster.
    • Starting a single node cluster:
      • It will say “Starting Provisioning Amazon EC2 capacity” for about 5 minutes.
      • You’ll see Provisioning for about 5 minutes, then “Running”.
      • Once “Master: Provisioning 1 m1.medium” turns from “Provisioning” to “Bootstrapping” or “Running”, you start getting billed.
    • Starting a 10 node cluster took 16 minutes on a Sunday afternoon.
    • If you lose this BROWSER TAB:
    • This will create a cluster which permits you to run small Hadoop jobs, run a Pig command line, run all sorts of Linux commands, etc.
    • If you would like to be able to re-create this cluster from your laptop/desktop command line,
      • Visit the Cluster Details for your cluster and click the AWS CLI Export button to get the command line. You can paste it into a bash script.
      • Before you run the bash script for the first time, run:
        • aws emr create-default-roles
      • When you start the cluster from the command line, it will print the cluster ID.
      • It will not show up in your EMR Cluster List web page until the Provisioning step completes (about 5 minutes). In the meantime, track it with:
        • aws emr list-clusters
      • This might be helpful (if you substitute your cluster ID for j-26AKLJTZ614D):
        • https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-26AKLJTZ614D

##Terminating a Cluster

  • You get billed for every minute your cluster is running. You’ll want to shut it down when you aren’t using it.
    • Note that you can’t really shut down a cluster – you destroy it. (See Recreating a Cluster, elsewhere on this page.)
    • Metadata about the cluster will remain, so you’ll see it in your list of clusters, but the cluster itself gets deleted.
  • Open your EMR Cluster List
  • Select the desired cluster.
  • Press the Terminate button.

##Recreating a Cluster

  • After you Terminate a cluster, you might want to recreate it.
  • Open your EMR Cluster List
  • Select the desired cluster.
  • Press the Clone button.
  • It will ask whether you want to include Steps. If you’re following my instructions, this choice is inconsequential.
    • Press the Clone button (again).

##Open a SSH Command-line Prompt On a Running EMR Cluster

  • You need a running cluster in order to do this!
  • When AWS creates a cluster, they choose secure defaults. You need to enable TCP/IP connections to your cluster before you can use SSH.
    • Open a NEW BROWSER tab to the EC2 Console and select Security Groups from the left-side list of links.
    • Select the ElasticMapReduce-master group
    • Select the Inbound tab and press Edit
    • Change an All TCP rule for all ports to Source = Anywhere.
      • This is insecure. For a production setup, you’ll want to add a rule to permit TCP connections from your laptop/desktop (or your company/workgroup) and not the whole world.
  • Open your EMR Cluster List
  • Click the name of the cluster.
  • There will be an SSH link on the page (to the right of the Master Public DNS Name heading). Click it.
  • This will provide a command line which you should copy and then paste into your Terminal window to run SSH and connect to your cluster.
    • I recommend adding “-D 1080” to the command line. (See the section of this page on using a SOCKS proxy.)

##Create a Bucket

  • A bucket is a directory into which you can store files using the Amazon S3 storage service.
  • One method is to use the S3 Console
    • Visit the S3 Console.
    • Click Create Bucket
    • Give it a name which nobody on the planet has used. (Bucket names are global to all S3 users.) Names can contain alphanumeric and periods. Maybe use com.yourdomain.bucket1.
      • You want a name which can be stuck on the front of “.s3.amazonaws.com” to create a valid DNS name.
  • Or use aws command line:
    • aws s3 mb s3://bucket-name

##Delete a Bucket and All Its Contents

  • You can do this via the S3 Console. If you have a million files, it will be very slow because you must delete each file before you can delete the bucket via the console.
  • Easy way, using aws cli (deletes the bucket and all files in it):
    • aws s3 rb s3://bucket-name –force

##List All of Your Buckets

  • aws s3 ls

##Copy a File From Your Laptop/Desktop to S3 Using the Console

  • Visit the S3 Console
  • Select the desired bucket. (See Create a Bucket elsewhere on this page if you do not have a bucket.)
  • Press the Upload button and follow the wizard.

##Copy a Folder/Directory From Your Laptop/Desktop to S3 Using the Command Line

  • aws s3 cp MyLocalFolder s3://bucket-name – recursive
    • To the default region.
  • aws s3 cp MyLocalFolder s3://bucket-name – recursive –region us-west-2
    • To the specified region.

##Copy a File From Cluster/Laptop/Desktop to S3

  • aws s3 cp myFile s3://mybucket
  • aws s3 cp myFile s3://mybucket/myfolder

##Move a Bucket to a Different Region

  • aws s3 sync s3://oldbucket s3://newbucket –source-region us-west-1 –region us-west-2

##Create a Folder on DFS to Store Data That Won’t Fit on the Cluster Server

  • hadoop dfs -mkdir /user/hadoop
    • I assume that you can use any name, and not just hadoop.
  • You can check the content of this folder via: “hadoop dfs -ls /user/hadoop”

##Save a File to S3 Before You Terminate a Cluster

  • When you Terminate a cluster, all files on the cluster are lost. You might have a file you want to save. Note that you get billed for Amazon storage space for saved files, but not as much as for a running cluster. If it is a small file, you might want to copy it to your laptop/desktop (see instructions elsewhere on this page). If it as a large file, you might want to store it on Amazon S3, to avoid consuming the time/bandwidth to download the file.
    • Amazon S3 Pricing
    • You need to use a ‘Bucket’. See Create a Bucket elsewhere on this page if you do not have a bucket.
    • See one of the related “Copy…” sections on this page to copy the file/folder to or from S3.

##Set Up a SOCKS Proxy

  • You need to connect your browser to several different web servers running on your cluster, in order to manage/monitor your cluster. Unfortunately, AWS sets up these servers such that they only accept connections from localhost (the cluster server – not your laptop/desktop). You either need to tunnel your browser traffic via SSH or via a SOCKS proxy. If you know all of the ports for all of the servers, you can specify ports on the SSH command line. I don’t know all of them, so I’m setting up to use SOCKS, so I can proxy all traffic destined for my cluster via SOCKS, regardless of which port.
  • Create a file named ~/etc/proxy.pac containing:

function FindProxyForURL(url, host) {
   if (shExpMatch(host, "*.compute.amazonaws.com"))
        return "SOCKS localhost:1080";
    return "DIRECT";
}
  • Configure your browser to use this PAC file. (It is in your proxy configuration settings.) For Safari:
    • Safari » Preferences » Advanced » Proxies (Change Settings) » Automatic Proxy Configuration
      • URL = file:///Users/your_user_id/etc/proxy.pac
    • Right away: Confirm that Safari still works for normal web sites (e.g. Google.com or msn.com). If it doesn’t, disable the SOCKS proxy setting and fix your PAC file.
  • When you want to connect to a web server on your cluster, open a Terminal window and:
    • ssh -D 1080 Master_public_DNS_address
      • e.g. ssh -D 1080 ec2-12-34-567-890.us-west-2.compute.amazonaws.com
    • For any web server running on your master cluster server, you can get there by entering the URL (including the port). Because that URL uses the Master public DNS name, the PAC file will send it through your SSH SOCKS proxy.

##List Files In HDFS File System

  • Deprecated method: hadoop dfs -mkdir /user/hadoop
  • Modern method: hdfs dfs -ls /user
  • Note that “hadoop dfs” is what is deprecated. You can get help on the replacement hdfs commands by running
    • hdfs
      • Lists all hdfs subcommands
    • hdfs subcommand
      • e.g. hdfs dfs
      • Shows help for the subcommand ##Copy a File to/from the Cluster to/from Your Laptop/Desktop
  • scp -i ~/Coursera-Data-Sci-Oregon.pem hadoop@:src_file local_dest_file
  • scp -i ~/Coursera-Data-Sci-Oregon.pem local_src_file hadoop@:dest_file
  • I set up a local script named scpi to do the “scp -i”, so I just type the src and the dest.

##How Many MapReduce Jobs Are Running?

  • I can’t find it. Here’s what I tried:
    • AWS 3.x Hadoop ResourceManager http://master-public-dns-name:9026/ is not running
    • AWS 3.x Hadoop HDFS NameNode http://master-public-dns-name:9101/ is not running
    • AWS 3.x Ganglia Metrics Reports http://master-public-dns-name/ganglia/ is running, but I can’t find MapReduce jobs there
    • AWS 3.x HBase Interface http://master-public-dns-name:60010/master-status is not running
    • AWS 3.x Hue Web Application http://master-public-dns-name:8888/ is running. What is it?
    • AWS 3.x Impala Statestore http://master-public-dns-name:25000 is not running.
    • AWS 3.x Impalad http://master-public-dns-name:25010 is not running
    • AWS 3.x Impala Catalog http://master-public-dns-name:25020 is not running
    • AWS 4.x: Amazon says this shows a metric called JobsRunning, but when I look there, the metric does not exist:
      • https://console.aws.amazon.com/cloudwatch/
      • Click EMR in the left-side nav panel
      • Maybe it is now AppsRunning?
      • Maybe it in MRActiveNodes?
    • On the Cluster Details page (select the cluster from EMR cluster list), there is a “Connections:” header. To the right of that is a list of web servers. The Resource Manager link takes you to a Hadoop Cluster Metrics page that is interesting.
      • It lists each running application. (It calls them “entries.”) Over near the right side, there is a progress bar. If you hover over the progress bar, it will give you a percent complete in the flyover text.
      • You can get to this without using a SOCKS or other SSH tunnel!
      • You might have to refresh this page manually.