Technical Newsletter: Ambari, Amazon EMR, and Other News


Ambari Views Now Available on HDInsight

Ambari software was developed through Apache to enable provisioning, managing and monitoring of Hadoop clusters with a GUI and API. Previously Ambari was only available through a plugin to Ambari View Framework. Now, it is available to be used on HDInsight, allowing for the deployment and management of Linux clusters. Two of the predefined views in Ambari are Pig and Hive views. Both can be launched through the Ambari portal.

Hive view allows one to browse databases, write an execute Hive query, look at job history, set Hive query execution parameters and debug Hive queries. An Ambari Views link and tab have been added to the portal to simplify the finding of this option. In addition, this portal will permit both Hive and Pig queries, changing of settings, provide a visual explanation of queries, allow the addition of UDFs, and allow monitoring and debugging of Tez jobs.

Issues with Extensible Web Resource Loading

Ilya Grigorik with has published on essay on some of the issues he has found with loading of extensible web resources. Loading is typically decided based on the request for an asset. Either the parser detects a tag with a resource URL, Javascript initiates a dynamic request or it is detected through CSS, and each type has their own loading protocol. Browser vendors often determine the order in which resources on loaded onto a page. For example, “HTML, CSS, and Javascript are considered critical; images and other asset types are less so; some browsers limit number of concurrent image downloads; and CSS assets are lazyloaded by default.”

This method works well for most application and webpages, but it does not work well for an extensible and perf-friendly platform. For this platform to function as developers need and users desire, the developer must be able to:

    • Modify default fetch settings of all requests initiated via JS, CSS, and HTML.
    • Define the preloader policy for any resource declared in CSS and HTML.
    • Define the fetch dispatch policy for any resource declared in CSS and HTML.

This enhanced functionality would allow developers to address issues that commonly occur when fetching images, fonts, and payloads. To begin to address these concerns, the author suggests that Fetch API exposes the reasons for resource fetching in the web platform, and a declarative mechanism to match the Javascript Fetch API is developed along with an API for interfacing with the preload scanner. Finally, there should be “control over resource dispatch policies.” The availability of these functions would be a start on the road to optimizing an extensible web resource.

FLARE’s pykd Project

FireEye Labs Advanced Reverse Engineering (FLARE) has built a new tool for debugging. This tool uses a scripting library on top of pykd for Windbg. Debuggers typically use a self-decoding or manual programming approach to deobfuscating strings from malware. In self-decoding, when library call emulation is performed, consistent and persistent emulation is necessary and challenging.

In self-decoding, the string decoder function must be detected and recorded at every instance and the arguments to those instances must also be recorded. Ideally, this process would occur semi-automatically. To understand the inputs and outputs of this function as well as its arguments, Python’s Vivisect can be used for binary analysis using heuristics, cross-referencing, and emulating and disassembling series of opcodes.

flare-dbg, which runs on top of pykd, aims to make scripting in Windbg simple by using the DebugUtils class of functions. These functions use Vivisect and provide memory, register manipulation, perform stack operations, debugger execution, and breakpoints and function calling. With these functions working together, once the call_list is generated, all associated strings and arguments are located and string_decoder is used by the DebugUtils call function. Once all strings are decoded, the utils script can be used to create IDA Python scripts that creates the comments in the IDB and the script can be fully debugged.

Launching Clusters in VPC Subnets Supported by Amazon EMR

Amazon EMR 4.2.0 now supports launching of clusters in the Amazon VPC private subnets with “Hadoop ecosystem applications, Spark, and Presto in the subnet” of the client’s choice. Clusters can be launched without IP addresses or Internet gateways, the cluster can have direct access to data in S3, and a Network Address Translation (NAT) can be created so that the cluster may interact with other AWS services.

In order to launch the Amazon EMR clusters, the permissions must be changed in the EMR service role and EC2 instance profile. A route to the S3 buckets must also be established to initialize the clusters. A NAT is not necessary to route to public endpoints if only the S3 functionality will be used for the cluster in AWS.

There are three methods for achieving resting security of the input and output results as well as the “Hadoop Distributed Filesystem (HDFS) distributed across [the] cluster and the Local Filesystem on each instance.” First, Amazon S3 using the EMR Filesystem (EMRFS), which works seamlessly with encrypted data in S3, Second, HFDS transparent encryption with Hadoop KMS can be installed on the master node of the EMR cluster.

Finally, local filesystems on each node can be used on each slave instance. For encryption in transit for Hadoop and Spark, Hadoop MapReduce Shuffle can be used through providing SSL certificates to each node, HFDS rebalancing will send blocks between DataNode processes, or Spark Shuffle will shuffle data between nodes during a job.

The calls that APIs can make can be limited through the use of the Identity and Access Managers (IAM) if a cluster is created with both an EMR service role and an EC2 instance profile. This limits its abilities, and the number of calls it makes can be monitored through AWS CloudTrail. Other existing security features can also be used. “Amazon EMR was also added to the AWS Business Associates Agreement (BAA) for running workloads which process PII data (including eligibility for HIPAA workloads).”

Scroll to Top