| 20,000+ Fresh Resumes Monthly | |
|
|
| | Click here or scroll down to respond to this candidate Senior Big Data EngineerPAVAN KUMAR.KEMAIL AVAILABLEPHONE NUMBER AVAILABLEOverall Summary: Over 9+ Years of strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification, and Testing as per Cycle in DevOps environment, Scrum, Waterfall, and Agile methodologies. Have very strong interpersonal skills and the ability to work independently and with a group, can learn quickly and easily adapt to the working environment. Excellent programming skills with experience in Java, SQL, and Python Programming.
Evaluate innovative technologies and best practices for the team. Applied knowledge of modern software delivery methods like TDD, BDD, CI/CD, and applied knowledge of Infrastructure as Code(IAC) Expertise in working with Data Warehousing and business intelligence tools for developing various reports using Qlikview/Qliksense BI. Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
Experience in AWS Serverless services such as Fargate, SNS, SQS, Lambda Experience in Data modeling tools such as Erwin and Toad Data Modeler. Experience in writing distributed Scala code for efficient big data processing Experience in using Kafka and Kafka brokers to initiate spark context and processing live streaming.
Proficient in using Fivetran for automating data integration from various sources into cloud data warehouses. Utilized Python, including Pandas, Numpy, PySpark, and Impala, for data manipulation and analysis, resulting in streamlined processes and data-driven recommendations. Worked on FACETS configuration setup and maintenance. - Developed interactive dashboards and reports using Power BI, Tableau, and Qlik Sense to visualize complex datasets and communicate insights to stakeholders effectively. - Conducted data analysis and extraction on Hadoop, contributing to improved data accessibility and accuracy. - Collaborated with cross-functional teams to implement data integration solutions, leveraging tools like Apache NiFi, Azure Data Factory, and Pentaho. - Worked with Enterprise Business Intelligence Platform and Data Platform to support data-driven decision-making at an organizational level. Develop framework for converting existing PowerCenter mappings and to PySpark(Python and Spark) Jobs - Conducted hands-on data modeling, programming, and querying, handling large volumes of granular data to deliver custom reporting solutions. - Applied data mining and machine learning algorithms to identify patterns and trends, providing valuable insights to drive business strategies. - Assisted in collecting, standardizing, and summarizing data while identifying inconsistencies and suggesting data quality improvements. - Utilized strong SQL skills for data retrieval and analysis. Hands-on experience in developing web applications and RESTful web services and APIs using Python, Flask, and Django. Experienced in Working on Big Data Integration and Analytics based on Hadoop, PySpark, and No-SQL databases like HBase and MongoDB. Familiar with key data science concepts (statistics, data visualization, machine learning, etc.). Experienced in Python, R, MATLAB, SAS, and PySpark programming for statistical and quantitative analysis. Working Experience with Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES, Fargate, ECS, Glue, Athena. Extensive knowledge in writing Hadoop jobs for data analysis as per the business requirements using Hive and working on HiveQL queries for required data extraction, joining operations, writing custom UDFs as required, and having good experience in optimizing Hive Queries.
Performed analysis of Facets for benefits and medical definitions Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa and loading into Hive tables, which are partitioned.
Experience in Microsoft Azure/Cloud Services like SQL Data Warehouse, Azure SQL Server, Azure Databricks, Azure Data Lake,Azure Data lake Gen2 Azure Blob Storage, Azure Data Factory, Azure Cosmos DB, SQL Server, SSIS, and SSRS. Strong experience in core Java, Scala, SQL, PL/SQL, and Restful web services. Having extensive knowledge of RDBMS such as Oracle, DevOps, Microsoft SQL Server, MYSQL
Extensive experience working on various databases and database script development using SQL and PL/SQL.
Hands-on experience in writing Map Reduce programs using Java to handle different data sets using Map and Reduce tasks. Trizetto Facets Hosted Platform experience & troubleshooting batch job issues Create Pyspark frame to bring data from DB2 to Amazon S3 Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
Good understanding of distributed systems, HDFS architecture, and Internal working details of MapReduce and Spark processing frameworks.
Worked on various machine learning algorithms like Linear regression, logistic regression, Decision trees, random forests, K- K-means clustering, Support vector machines, and XGBoosting on client requirements Good experience in software development in Python (libraries used: libraries- Beautiful Soup, NumPy, SciPy, matplotlib, python-twitter, Pandas data frame, network, urllib2, MySQL dB for database connectivity). Knowledge of ETL methods for data extraction, transformation, and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
Excellent understanding and knowledge of job workflow scheduling and locking tools/services like Oozie and Zookeeper.
Hands-on experience with data ingestion tools Kafka, Flume, and workflow management tools Oozie. Used Spark Data Frames API over the Cloudera platform to perform analytics on Hive data and Used Spark Data Frame Operations to perform required Validations in the data.
Good understanding and knowledge of NoSQL databases like MongoDB, PostgreSQL, HBase, and Cassandra.
Worked with various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files, and XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats. Has a good understanding of various compression techniques used in Hadoop processing like G-zip, Snappy, LZO, etc.
Experience with ETL concepts using Informatica Power Center, and AB Initio. Leveraged Iceberg's capabilities to handle petabyte-scale data efficiently within Snowflake, ensuring high performance and scalability. Build and deploy modular data pipeline components such as Apache Airflow DAGs, AWS Glue jobs, and AWS Glue crawlers through a CI/CD process. Integrated Facets applications to achieve business requirements Experience in developing custom UDFs for Pig and Hive to incorporate methods and functionality of Python/Java into Pig Latin and HQL (HiveQL) and Used UDFs from Piggybank UDF Repository. Hands-on experience with Apache Airflow as a pipeline tool. Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources. Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLTP reporting.
Worked on various programming languages using IDEs like Eclipse, NetBeans, and IntelliJ, Putty, GIT.
Proficient in data modeling techniques and concepts to support data consumers AWS ECS, ECR, and Fargate to scale and organize these workloads. Worked on GitLab for implementing Continuous Integration and Continuous Deployment (CI/CD) pipelines, streamlining the process of building, testing, and deploying code changes.
Technical Skills:LanguagesShell scripting, SQL, PL/SQL, Python, R, PySpark, Pig, Hive QL, Scala, Regular ExpressionsHadoop DistributionCloudera CDH, Horton Works HDP, Apache, AWSBig Data EcosystemHDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper, Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Flume, ImpalaPython Libraries/PackagesNumPy, SciPy, Boto, Pickle, PySide, PyTables, Data Frames, Pandas, Matplotlib, SQLAlchemy, HTTPLib2, Urllib2, Beautiful Soup, Py QueryDatabasesOracle 10g/11g/12c, SQL Server, MySQL, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake, NoSQL Database (HBase, MongoDB), Dynamo DB, SnowFlake, RDS, T-SQLCloud TechnologiesAmazon Web Services (AWS), Microsoft AzureVersion ControlGIT, GIT HUB, GIT LABIDE & Tools, DesignEclipse, Visual Studio, Net Beans, Junit, CI/CD, SQL Developer, MySQL, SQL Developer, Workbench, TableauOperating SystemsWindows 98, 2000, XP, Windows 7,10, Mac OS, Unix, LinuxData Engineer/Big Data Tools / Cloud / ETL/ Visualization / Other ToolsDatabricks, Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, MapReduce, Spring Boot, Flume, YARN, Hortonworks, Cloudera, MLlib, Oozie, Zookeeper, etc. AWS, Azure Databricks, Azure Data Explorer,Azure Cosmos DB, Azure HDInsight, Linux, Bash Shell, Unix, etc., Tableau, Power BI, SAS, Crystal Reports, Dashboard Design, Glue, Athena, Lakeformation.Project Experience:
Client: Merck Pharma, Boston, MA Feb 2022 till nowRole: Senior Big Data EngineerResponsibilities Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data. Developed Automation Regression Scripts for validation of ETL process between multiple databases like Azure Synapse Analytics, Oracle, MongoDB, Cosmos DB, SnowFlake, PostgreSQL, Azure SQL Database, and SQL Server using Python. Experience in Python, SQL, YAML, Spark programming (PySpark). Applied knowledge of modern software delivery methods like TDD, BDD, CI/CD, and applied knowledge of Infrastructure as Code (IAC). Installation/Configuration of Qlik GeoAnalytics to connect QlikSense and QlikView. Involved as primary on-site ETL Developer during the analysis, planning, design, development, and implementation stages of projects using IBM WebSphere software (Quality Stage v9.1, Web Service, Information Analyzer, Profile Stage). Experience working with Facets batches, HIPAA Gateway, and EDI processing. Experience in Azure Serverless services such as Azure Functions, Event Grid, Azure Queue Storage, Azure Logic Apps. Proficient in creating, scheduling, and monitoring batch jobs using AutoSys, ensuring timely and accurate execution of scheduled tasks according to business requirements. Utilized Python, including Pandas, Numpy, PySpark, and Impala, for data manipulation and analysis, resulting in streamlined processes and data-driven recommendations. Experience building distributed high-performance systems using Spark and Scala. Developed interactive dashboards and reports using Power BI, Tableau, and Qlik Sense to visualize complex datasets and communicate insights to stakeholders effectively. Conducted data analysis and extraction on Hadoop, contributing to improved data accessibility and accuracy. Leveraged Iceberg's capabilities to handle petabyte-scale data efficiently within Snowflake, ensuring high performance and scalability. Experienced in using Fivetran for automating data integration from various sources into cloud data warehouses. Collaborated with cross-functional teams to implement data integration solutions, leveraging tools like Apache NiFi, Azure Data Factory, and Pentaho. Experienced in working with SAP tables, including understanding table structures, data relationships, and data extraction methods. Experienced in ABAP Development: Skilled in ABAP programming language, with a focus on developing custom solutions, reports, and enhancements within the SAP ecosystem. Expertise in ABAP CDS Views: Proven track record of designing and implementing ABAP CDS views to create semantically rich data models for reporting and analytics purposes. Worked with Enterprise Business Intelligence Platform and Data Platform to support data-driven decision-making at an organizational level. Experienced in Facets Implementation (Legacy to Facets platform). Developed machine learning models using recurrent neural networks - LSTM for time series, predictive analytics. Conducted hands-on data modeling, programming, and querying, handling large volumes of granular data to deliver custom reporting solutions. Applied data mining and machine learning algorithms to identify patterns and trends, providing valuable insights to drive business strategies. Extensively worked with different QlikSense components like charts, maps, filters, KPIs. Assisted in collecting, standardizing, and summarizing data while identifying inconsistencies and suggesting data quality improvements. Utilized strong SQL skills for data retrieval and analysis. Experience in Data modeling tools such as Erwin and Toad Data Modeler. Worked in Azure environment for development and deployment of custom Hadoop applications. Experience in Azure Data processing, Analytics, and Storage Services such as Azure Blob Storage, Azure Data Lake, Azure Synapse Analytics, Azure Databricks, and Azure Data Factory. Worked on GitLab for implementing Continuous Integration and Continuous Deployment (CI/CD) pipelines, streamlining the process of building, testing, and deploying code changes. Experience developing Scala applications for loading/streaming data into NoSQL databases (MongoDB) and HDFS. Developing Spark programs with Python, and applying principles of functional programming to process complex structured data sets. Developed machine learning models using Google TensorFlow Keras API Convolution neural networks for Classification problems, fine-tuned the model performance by adjusting the epochs, batch size, and Adam optimizer. Experience with Cognos Framework Manager. Trizetto Facets Hosted Platform experience & troubleshooting batch job issues. Developed internal APIs using Node.js and used MongoDB for fetching the schema. Worked on Node.js for developing server-side web applications. Implemented Python views and templates with Python and Django's view controller and Jinja templating language to create a user-friendly website interface. Working Experience with Microsoft Azure Cloud Platform which includes services like Azure VMs, Azure Blob Storage, Virtual Networks, Load Balancers, Azure IAM, Cosmos DB, Azure CDN, Azure Monitor, Route 53, Azure App Service, Auto Scaling, Security Groups, Azure Container Service (ACS), Azure DevOps, Azure Functions, Azure Logic Apps, Azure Synapse Analytics, Azure Data Lake, Azure SQL Database, Azure Monitor, Azure Data Factory, Azure DevOps, and more. Reduced access time by refactoring information models, query streamlining, and actualized Redis store to help Snowflake. Used Spark and Scala for developing machine learning algorithms which analyze click stream data. Strong experience in working with Azure HDInsight and setting up environments on Azure VM instances. Prepared Data Mapping Documents and Design the ETL jobs based on the DMD with required Tables in the Dev Environment. Generate metadata, create Talend ETL jobs, mappings to load data warehouse, data lake. Designed and Developed a Real-Time Stream Processing Application using Spark, Kafka, Scala, and Hive to perform Streaming ETL and apply Machine Learning. Filtering and cleaning data using Scala code and SQL Queries. Experience in data processing like collecting, aggregating, and moving the data using Apache Kafka. Used Kafka to load data into HDFS and move data back to Azure Blob Storage after data processing. Worked with Hadoop infrastructure to store data in HDFS storage and use Spark / HIVE SQL to migrate underlying SQL codebase in Azure. Used Test-driven approach for developing the application and Implemented the unit tests using Python Unit test framework and Development of Isomorphic ReactJS and Redux-driven API client applications. Converting Hive/SQL queries into Spark transformations using Spark RDDs and PySpark. Analyzing SQL scripts and designing the solution to implement using PySpark. Export tables from Teradata to HDFS using Sqoop and build tables in Hive. Implemented Installation and configuration of the multi-node cluster on the Cloud using Microsoft Azure on VMs. Use SparkSQL to load JSON data create Schema RDD and load it into Hive Tables and handle structured data using SparkSQL. Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies. Extracted Mega Data from Azure Synapse Analytics, Azure Blob Storage, and Azure Search engine using SQL Queries to create reports. Used Talend for Big Data Integration using Spark and Hadoop. Used Kafka and Kafka brokers, initiated the spark context processed live streaming information with RDD, and Used Kafka to load data into HDFS and NoSQL databases. Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds, and Created applications using Kafka, which monitors consumer lag within Apache Kafka clusters. Collected data using Spark Streaming from Azure Blob Storage in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persist the data in HDFS. Worked on SQL Server concepts SSIS (SQL Server Integration Services), SSAS (Analysis Services), and SSRS (Reporting Services). Using Informatica & SSIS, SPSS, and SAS to extract, transform & load source data from transaction systems. Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, and database triggers to be used by the team, and satisfy the business rules. Developed APIs in Python with SQLAlchemy for ORM along with MongoDB, documenting APIs in Swagger docs and deploying applications over Jenkins. Developed Restful APIs using Python Flask and SQLAlchemy data models as well as ensured code quality by writing unit tests using pytest. Experience in Microsoft Azure/Cloud Services like SQL Data Warehouse, Azure SQL Server, Azure Databricks, Azure Data Lake, Azure Blob Storage, Azure Data Factory, Azure Cosmos DB, SQL Server, SSIS, SSRS. Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables, and OLTP reporting. Environment: Python, Django, Flask, Hadoop, Spark, Scala, HBase, Hive, UNIX, Erwin, TOAD, MS SQL Server database, XML files, Azure, Cassandra, MongoDB, Kafka, IBM InfoSphere Data Stage, PL/SQL, Oracle 12c, Flat files, Autosys, MS Access database.Client: Chewy, Dania Beach, FL Mar 2020 Jan 2022Role: Big Data Engineer
Responsibilities:
Worked in an Agile environment, and used the rally tool to maintain the user stories and tasks. Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight Implemented Apache Sentry to restrict access to the hive tables on a group level. Good Exposure to Map Reduce programming using Java, PIG Latin Scripting Distributed Application, and HDFS. Experienced in using Tidal enterprise scheduler and Oozie Operational Services for coordinating the cluster and scheduling workflows. Developed Microservices by creating REST APIs and used them to access data from different suppliers and to gather network traffic data from servers. Wrote and executed various MYSQL database queries from Python using Python-MySQL connector and MySQL dB package. Designed and implemented by configuring Topics in the new Kafka cluster in all environments. Created multiple dashboards in Tableau for multiple business needs. Experience in Data modeling tools such as Erwin and Toad Data Modeler. Implemented Partitioning, Dynamic Partitions, and Buckets in HIVE for efficient data access. Architect & implement medium to large-scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB). Employed AVRO format for the entire data ingestion for faster operation and less space utilization. Working Experience with Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Redshift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES, Fargate, ECR, Glue, Ethena, Lake formation. Designed SSIS Packages to extract, transfer, and load (ETL) existing data into SQL Server from different environments for the SSAS cubes (OLAP) Involved in the development of Web Services using REST for sending and getting data from the external interface in the JSON format. Wrote and executed various MYSQL database queries from Python using Python-MySQL connector and MySQL dB package. Developed visualizations and dashboards using PowerBI Implemented Composite server for the data virtualization needs and created multiple views for restricted data access using a REST API. Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team Using Tableau. Installed Kerberos-secured Kafka cluster with no encryption on Dev and Prod. Also, set up Kafka ACLs into it Developed Apache Spark applications by using Spark for data processing from various streaming sources.
Strong Knowledge of architecture and components of TeaLeaf, and efficient in working with Spark Core, and SparkSQL. Designed and developed RDD Seeds using Scala and Cascading. Streaming data to Spark streaming using Kafka Exposure to Spark, Spark Streaming, Spark MLlib, snowflake, Scala, and Creating the Data Frames handled in Spark with Scala. Developed data pipeline using Spark, Hive, Pig, Python, Impala, and HBase to ingest customer
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python, and Scala.
Queried and analyzed data from Cassandra for quick searching, sorting, and grouping through CQL. Joined various tables in Cassandra using Spark and Scala and ran analytics on top of them. Applied Spark advanced procedures like text analytics and processing using in-memory processing. Implemented Apache Drill on Hadoop to join data from SQL and No SQL databases and store it in Hadoop. Brought data from various sources into Hadoop and Cassandra using Kafka. Migration of on-premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2). Worked on GitLab for implementing Continuous Integration and Continuous Deployment (CI/CD) pipelines, streamlining the process of building, testing, and deploying code changes SQL Server Reporting Services (SSRS). Created & formatted Cross-Tab, Conditional, Drill-down, Top N, Summary, Form, OLAP, Sub reports, ad-hoc reports, parameterized reports, interactive reports & custom reports Created action filters, parameters, and calculated sets for preparing dashboards and worksheets using PowerBI Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing Developed Restful APIs using Python Flask and SQLAlchemy data models as well as ensured code quality by writing unit tests using Pytest. Contributed to migrating data from Oracle database to Apache Cassandra (NoSQL database) using SS Table loader. Developed Spark applications using Spark -SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into customer usage patterns.Environment: Python, Pytest, Flask, SQLAlchemy, Map Reduce, HDFS, Hive, Pig, Impala, Kafka, Cassandra, spark, Scala, Solr, Azure (SQL, Databricks, Data lake, Data Storage, HDInsight), Java, SQL, Tableau, PIG, Zookeeper, Sqoop, Kafka, Teradata, Power BI, CentOS, Pentaho.Client: Qualcomm, San Diego, CA Nov 2017 Feb 2020Role: Data EngineerResponsibilities:
Developed Python utility to validate HDFS tables with source tables Conduct systems design, feasibility, and cost studies and recommend cost-effective cloud solutions such as Amazon Web Services (AWS). Implement code in Python to retrieve and manipulate data. Designed ETL Process using Informatica to load data from Flat Files, and Excel Files to target the Oracle Data Warehouse database. Worked on GitLab for implementing Continuous Integration and Continuous Deployment (CI/CD) pipelines, streamlining the process of building, testing, and deploying code changes Configured AWS Identity and Access Management (IAM) Groups and Users for improved login authentication. Loaded data into S3 buckets using AWS Glue and PySpark.
Involved in Web application penetration testing process, and web crawling process to detect and exploit SQL injection vulnerabilities. Wrote automated Python Script for testing program to store machine detection alarm when Pump experienced overloading to Amazon cloud. Automated all the jobs for pulling data from the FTP server to load data into Hive tables using Oozie workflows Responsible for developing Python wrapper scripts which will extract specific date ranges using Sqoop by passing custom properties required for the workflow Involved in filtering data stored in S3 buckets using Elasticsearch and loaded data into Hive external tables. Designed and developed UDF to extend the functionality in both PIG and HIVE Import and Export of data using Sqoop between MySQL to HDFS regularly Developed a shell script to create staging, and landing tables with the same schema as the source and generate the properties that are used by Oozie Jobs Used Python programming and Django for the backend development, Bootstrap and Angular for frontend connectivity, and MongoDB for a database. Developed a Django ORM module query that can pre-load data to reduce the number of database queries needed to retrieve the same amount of data. Experience in Microsoft Azure/Cloud Services like SQL Data Warehouse, Azure SQL Server, Azure Databricks, Azure Data Lake, Azure Blob Storage, Azure Data Factory, Azure Cosmos DB, SQL Server, SSIS, SSRS Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources Developed Oozie workflows for executing Sqoop and Hive actions Built various graphs for business decision-making using the Python Matplotlib library.Environment: Python, Django, HDFS, Spark, Hive, Sqoop, AWS, Oozie, ETL, Pig, Oracle 10g, My SQL, No SQL, Hbase, Windows.Client: Accel Frontline Technologies, India Oct 2016 Aug 2017Role: Hadoop DeveloperResponsibilities:
Installed and configured Hadoop Map Reduce, and HDFS, and developed multiple MapReduce jobs in Java and Scala for data cleaning and preprocessing. Installed and configured Hive wrote Hive UDFs and Used Map Reduce and Junit for unit testing. Queried and analyzed data from DataStax Cassandra for quick searching, sorting, and grouping. Experienced in working with various kinds of data sources such as Teradata and Oracle. Successfully loaded files to HDFS from Teradata, and load loaded from HDFS to hive and impala. Used Yarn Architecture and MapReduce in the development cluster for POC. Supported MapReduce Programs that are running on the cluster. Involved in loading data from UNIX file system to HDFS. Experienced in installing, configuring, and using Hadoop Ecosystem components. Designed and implemented a product search service using Apache Solr/Lucene. Involved in various NOSQL databases like Hbase, and Cassandra in implementation and integration. Experienced in Importing and exporting data into HDFS and Hive using Sqoop. Participated in development/implementation of the Cloudera Hadoop environment. Monitoring Python scripts run as daemons in the UNIX/Linux system background to collect trigger and feed arrival information. Created Python/MySQL backend for data entry from Flash. Implemented Kafka consumers to move data from Kafka partitions into Cassandra for near real-time analysis Worked in installing cluster, commissioning & decommissioning of Data nodes, Name node recovery, capacity planning, and slots configuration.Environment: Python, CDH, MapReduce, Scala, Kafka, Spark, Solr, HDFS, Hive, Pig, Impala, Cassandra, Java, SQL, Tableau, PIG, Zookeeper, Pentaho, Sqoop, Teradata, CentOS.Client: Pennant Technologies, India Dec 2014 Oct 2016Role: Data EngineerResponsibilities:
Responsible for gathering requirements from Business Analysts and identifying the data sources required for the requests. Wrote SAS Programs to convert Excel data into Teradata tables. Worked on importing/exporting large amounts of data from files to Teradata and vice versa. Created multi-set tables and volatile tables using existing tables and collected statistics on tables to improve performance. Wrote Python scripts to parse files and load the data in the database, used Python to extract weekly information from the files, and Developed Python scripts to clean the raw data. Worked on data pre-processing and cleaning the data to perform feature engineering and data imputation techniques for the missing values in the dataset using Python. Developed Teradata SQL scripts using OLAP functions like rank and rank () to improve the query performance while pulling the data from large tables. Designed and developed weekly, and monthly reports related to the Logistics and manufacturing departments using Teradata SQL.Environment: Teradata, BTEQ, VBA, Python, SAS, FLOAD MLOAD, UNIX, SQL, Windows XP, Business Objects.Qualification: Bachelor of Technology in Civil Engineering from Kakatiya Institute of Technology and Science, Warangal in 2014. |