| 20,000+ Fresh Resumes Monthly | |
|
|
| | Click here or scroll down to respond to this candidate Candidate's Name
EMAIL AVAILABLEPHONE NUMBER AVAILABLELinkedInSenior Big Data EngineerBACKGROUND SUMMARY 10+ years of IT experience in a variety of industries working on Big Data technology using technologies such as Cloudera and Hortonworks distributions. Hadoop working environment includes Hadoop, Spark, Map Reduce, Kafka, Hive, Ambari, Sqoop, HBase, and Impala. Expertise with Python, Scala and Java in Design, Development, Administrating and Supporting of large-scale distributed systems.
Experience working with NoSQL databases like Cassandra and HBase and developed real-time read/write access to very large datasets via HBase. Extensive usage of Azure Portal, Azure PowerShell, Storage Accounts, Certificates and Azure Data Management. Significant experience writing custom UDF s in Hive and custom Input Formats in Map Reduce. Schedule nightly batch jobs using Oozie to perform schema validation and IVP transformation at larger scale to take the advantage of the power of Hadoop. Experience in Designing, Architecting and implementing scalable cloud-based web applications using AWS. Having good knowledge in writing Map Reduce jobs through Pig, Hive, and Sqoop. Strong experience creating end to end data pipelines on Hadoop platform. Experience using various Hadoop Distributions (Cloudera, Hortonworks, Amazon AWS EMR) to fully implement and utilize various Hadoop services. Configure Zookeeper to coordinate and support Kafka, Spark, Spark Streaming, HBase and HDFS Used Oozie and Zookeeper operational services for coordinating cluster and scheduling workflows. Experience working with NoSQL databases like MongoDB, Cassandra and HBase. Used Hive extensively for performing various data analytics required by business teams. Solid experience in working various data formats like Parquet, Orc, Avro, Json etc., Imported the customer data into Python using Pandas libraries and performed various data analysis - found patterns in data which helped in key decisions.
Hands-on experience in data structure, design and analysis using Machine Learning Technics and modules in PYTHON, R. Experience in architecting, designing, and building distributed data pipelines. Deep knowledge of troubleshooting and tuningSpark applications and Hive scripts to achieve optimal performance. Worked with real-time data processing and streaming techniques using Spark streaming and Kafka. Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop. Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory. Design and develop hive, HBase data structure and Oozie workflow. Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data. Developed Scala scripts using both Data frames/SQL and RDD/Map reduce in Spark for Data Aggregation, queries and writing data back into OLTP system through SQOOP. Good experience is designing and implementing end to end data security and governance within Hadoop Platform using Kerberos. Hands on experience in developing end to end Spark applications using Spark apis like RDD, Spark Data frame, Spark MLLib, Spark Streaming and Spark SQL. Good experience working with various data analytics and big data services in AWS Cloud like EMR, Redshift, S3, Athena, Glue etc., Good understanding of Spark ML algorithms such as Classification, Clustering, and Regression. Experienced in migrating data warehousing workloads into Hadoop based data lakes using MR, Hive, Pig and Sqoop.
Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables. Worked on Implementation of a log producer in Scala that watches for application logs transform incremental log and sends them to a Kafka and Zookeeper based log collection platform Experience in maintaining an Apache Tomcat MYSQL, LDAP, Web service environment. Designed ETL workflows on Tableau, Deployed data from various sources to HDFS. Experienced on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform. Experience in Software Design, Development and Implementation of Client/Server Web based.TECHNICAL SKILLS:BigData/Hadoop TechnologiesMap Reduce, Spark, SparkSQL,Azure,Spark Streaming, Kafka,PySpark, ,Pig, Hive,HBase, Flume, Yarn, Oozie, Zookeeper, Hue, Ambari ServerLanguagesHTML5,DHTML, WSDL, CSS3 ,C, C++, XML,R/R Studio, SAS Enterprise Guide, SAS ,R (Caret, Weka, ggplot) , Perl, MATLAB, Mathematica, FORTRAN, DTD, Schemas, Json, Ajax, Java, Scala, Python (NumPy, SciPy, Pandas, Gensim, Keras), Java Script, Shell ScriptingNO SQL DatabasesCassandra, HBase, MongoDB, MariaDBWeb Design ToolsHTML, CSS, JavaScript, JSP, jQuery, XMLDevelopment ToolsMicrosoft SQL Studio, IntelliJ, Azure Data bricks, Eclipse, NetBeans.Public CloudEC2, IAM, S3, Auto scaling, Cloud Watch, Route53, EMR, RedshiftDevelopment MethodologiesAgile/Scrum, UML, Design Patterns, WaterfallBuild ToolsJenkins, Toad, SQL Loader, PostgreSQL, Talend, Maven, ANT, RTC, RSA, Control-M, Oziee, Hue, SOAP UIReporting ToolsMS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal reports XI, SSRS, cognos.DatabasesMicrosoft SQL Server 2008,2010/2012, MySQL 4.x/5.x, Oracle 11g, 12c, DB2, Teradata, NetezzaOperating Systems
All versions of Windows, UNIX, LINUX, Macintosh HD, Sun SolarisWORK EXPERIENCE:Client: Experian, Costa Mesa, CA Oct 2023 till dateRole: Senior Big Data EngineerResponsibilities: Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop. Involved in HBASE setup and storing data into HBASE, which will be used for further analysis. Closely worked with Kafka Admin team to set up Kafka cluster and implemented Kafka producer and consumer applications on Kafka cluster setup with help of Zookeeper. Developed and maintained automated workflows for continuous collection and aggregation of credit data, leading to increase in operational efficiency. Experience in using Zookeeper and Oozie operational services to coordinate clusters and scheduling workflows Good Exposure on Map Reduce programming using Java, PIG Latin Scripting and Distributed Application and HDFS. Hands on experience in Capturing data from existing relational databases (Oracle, MySQL, SQL and Teradata) that provide SQL interfaces using Sqoop. Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse. Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala for data cleaning and preprocessing. Used HBase/Phoenix to support front end applications that retrieve data using row keys Designed and executed Oozie workflows in a manner that allowed for scheduling Sqoop and Hive job actions to extract, transform and load data Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL,postgreSQL,Data Frame,OpenShift, Talend,pair RDD's Collaborated with Business Analysts, SMEs across departments to gather business requirements, and identify workable items for further development.
Developed complex Talend ETL jobs to migrate the data from flat files to database. Pulled files from mainframe into Talend execution server using multiple ftp components. Used Sqoop to channel data from different sources of HDFS and RDBMS Architect and design server less application CI/CD by using AWS Server less (Lamda) application model. Developed stored procedures/views in Snowflake and use in Talend for loading Dimensions and Facts. Partnered with ETL developers to ensure that data is well cleaned and the data warehouse is up-to-date for reporting purpose by Pig.
Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
Responsible for importing data from Postgres to HDFS, HIVE using SQOOP tool. Experienced in migrating Hive QL into Impala to minimize query response time. Involved in Functional Testing, Integration testing, Regression Testing, Smoke testing and performance Testing. Tested Hadoop, Map Reduce developed in python, pig, Hive. Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics. Wrote Scala scripts to make spark streaming work with Kafka as part of spark Kafka integration efforts. Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to perform event driven processing. Created Lambda jobs and configured Roles using AWS CLI. Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.
Utilized Agile and Scrum methodology for team and project management.
Expertise in using Docker to run and deploy the applications in multiple containers like Docker Swarm and Docker Wave.Environment: Hdfs, Hive, Spark, Kafka, linux, Python, Numpy, Pandas, Tableau, GitHub, AWS EMR/EC2/S3/Redshift, Lambda, Pig, Map Reduce, Cassandra, Snowflake, Unix, Shell Scripting, Git.Client: Home site insurance, Boston, MA Apr 2021 Sep 2023Role: Big Data EngineerResponsibilities: Responsible for running Hadoop streaming jobs to process terabytes of xml's data, utilized cluster co-ordination services through Zookeeper. Wrote AZURE POWERSHELL scripts to copy or move data from local file system to HDFS Blob storage. Used Oozie workflow engine to manage independent Hadoop jobs and to automate several types of Hadoop such as java Map Reduce, Hive and Sqoop as well as system specific jobs Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards. Experienced of building Data Warehouse in Azure platform using Azure data bricks and data factory. Experience with leveraging Hadoop ecosystem components including Pig and Hive for data analysis, Sqoop for data migration, Oozie for scheduling and HBase as a NoSQL data store. Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL Worked on results from Kafka server output successfully. Developed Sqoop scripts to import and export data from relational sources by handling incremental data loading on the customer transaction data by date. Troubleshooting the Azure Development, configuration and Performance issues. Used Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala and NoSql databases such as HBase and Cassandra. Configured Oozie workflow to run multiple Hive and Pig jobs which run independently with time and data availability. Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python and utilized the engine to increase user lifetime by 45% and triple user conversations for target categories. Used Apache Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries. Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines Implemented Apache Drill on Hadoop to join data from SQL and No SQL databases and store it i Configured Spark streaming to receive real time data from Kafka and store the stream data to HDFS using Scala. Effectively used Sqoop to transfer data from databases (SQL, Oracle) to HDFS, Hive. Used Hive to analyse data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard. Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala. Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources Involved in Unit Testing the code and provided the feedback to the developers. Performed Unit Testing of the application by using NUnit.Environment: Hadoop, Azure, Kafka, Spark, Sqoop, Docker, Swamp, Spark SQL, TDD, Spark-Streaming, Hive, Scala, pig, NoSQL, Impala, Oozie, HBase, Data Lake, Zookeeper.Client: Official Recharge, Houston TX Jun 2019 Mar 2021Role: Big Data EngineerResponsibilities: Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala. Experience in configuring the Zookeeper to coordinate the servers in clusters and to maintain the data consistency which is important for decision making in the process. Developed Java Map Reduce programs for the analysis of sample log file stored in cluster. Saving HUM packet data in HBASE for future analytics purpose. Involved in Configuring Hadoop cluster and load balancing across the nodes. Developed Map Reduce programs in Java for parsing the raw data and populating staging Tables. Experience in writing SQOOP Scripts for importing and exporting data from RDBMS to HDFS. Worked closely with AWS EC2 infrastructure teams to troubleshoot complex issues Expertise in writing the Scala code using higher order functions for the iterative algorithms in spark for performance consideration Involved in Requirement gathering, Business Analysis and translated business requirements into Technical design in Hadoop and Big Data Involved in SQOOP implementation which helps in loading data from various RDBMS sources to Hadoop systems and vice versa.
Developed Python scripts to extract the data from the web server output files to load into HDFS. Worked on Cloud Health tool to generate AWS reports and dashboards for cost analysis. Written a python script which automates to launch the EMR cluster and configures the Hadoop applications. Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data Defined Kafka Zookeeper offset storage. Experienced in writing live Real-time Processing using Spark Streaming with Kafka Involved in managing and monitoring Hadoop cluster using Cloudera Manager. Used Python and Shell scripting to build pipelines. Developed data pipeline using sqoop, HQL, Spark and Kafka to ingest Enterprise message delivery data into HDFS. Developed workflow in Oozie also in Airflow to automate the tasks of loading data into HDFS and pre-processing with Pig and Hive. Assisted in creating and maintaining Technical documentation to launching HADOOP Clusters and even for executing Hive queries and Pig Scripts. Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS. Converted all Hadoop jobs to run in EMR by configuring the cluster according to the data size. Monitor and Troubleshoot Hadoop jobs using Yarn Resource Manager and EMR job logs using Genie and kibana.Environment: HDFS, Hive, Java, Sqoop, Spark, Yarn, Clouder Manager, Cloud Health, Splunk, Oracle, Elastic search, Kerberos, Impala, Jira, Confluence, Shell/Perl Scripting, Python, AVRO, Zookeeper, AWS (EC2, S3, EMR, S3, VPC, RDS Lambda, CLoudwatch etc), Ranger, Git, Airflow.Client: Northfield Bank, Staten Island, NY Jan 2017 May 2019Role: Data EngineerResponsibilities: Imported data using Sqoop to load data from MySQL to HDFS on regular basis. Configured Zookeeper to restart the failed jobs without human intervention.
Working knowledge of cluster security components like Kerberos, Sentry, SSL/TLS etc. Implemented data streaming capability using Kafka and Talend for multiple data sources. Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and Spark jobs on AWS.
Migrated Map reduce jobs to Spark jobs to achieve better performance. Using Sqoop utility, we transformed the data and populated on to Hadoop Ecosystem Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs. Involved in converting Map Reduce programs into Spark transformations using Spark RDD's using Scala and Python. Implemented SQOOP for large dataset transfer between Hadoop and RDBMS Utilized Pandas to create a data frames Build machine learning models to showcase Big data capabilities using Pyspark and MLlib. Implemented data streaming capability using Kafka and Talend for multiple data sources. Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu). Worked extensively on AWS Components such as Elastic Map Reduce (EMR) Imported a csv dataset into a data frame using pandas. Used Oracle collections in conjunction with analytical functions for efficient data aggregation, improving reporting and analytics processes. Wrote python code to manipulate and organize data frame such that all attributes in each field were formatted identically. Utilized Matplotlib to graph the manipulated data frames for further analysis. Graphs provided the data visualization needed to obtain information in a simple form. Exported manipulated data frames to Microsoft Excel and utilized its choropleth map feature. Leveraged Oracle collections to reduce data retrieval and manipulation times, leading to significant improvements in query performance for complex database operations.Environment: Hadoop, Map Reduce, HDFS, Pig, Hive QL, MySQL, UNIX Shell Scripting, Tableau, Java, Spark, SSIS.Client: Osmosys Software Solutions, India Nov 2013 Sep 2016Role: Data EngineerResponsibilities: Imported Legacy data from SQL Server and Teradata into Amazon S3. Created consumption views on top of metrics to reduce the running time for complex queries. Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3. As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake. Worked on to retrieve the data from FS to S3 using spark commands Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS Created Metric tables, End user views in Snowflake to feed data for Tableau refresh. Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs. Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file. Developed spark code and spark-SQL/streaming for faster testing and processing of data. Closely involved in scheduling Daily, Monthly jobs with Precondition/Postcondition based on the requirement. Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.Environment: AWS S3, GitHub, Service Now, HP Service Manager, EMR, Nebula, Teradata, SQL Server, Apache Spark, Sqoop |