'IT' 카테고리의 글 목록 (2 Page)

« 2025/08 »
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

ElastiCache 분산 인 메모리 캐시(In Memory Cache)를 손쉽게 생성하고 확장할 수 있는 서비스 읽기 중심의 서비스(소셜 네트워크, 게임, 추천엔진 등) 을 제공해야 하는 환경, 고속으로 데이터를 분석해야 하는 환경에 적합 두가지 캐신 엔진을 지원 (Memcached : 분산 메모리 캐시 시스템, Redis : String, Hash, List, Set, Sorted Set 등 키-값 데이터 저장소) 주의사항 : RDS의 데이터베이스 엔진은 AWS 외부(인터넷)에서 접속이 허용되지만, ElastiCache 의 캐시 엔진은 AWS외부에서 접속할 수 없음. Security Group을 생성해 모든 IP 대역에 대해 접속을 허용하더라도 동일한 VPC에 속한 EC2 인스턴스 에서만 접속할 수 ..

IT/AWS 2024. 1. 24. 14:08

Route53

1. DNS (Domain Name Server) 사람이 읽을 수 있는 도메인 이름을 컴퓨터나 스마트폰 등 다양한 디바이스에서 읽을 수 있는 IP주소로 변환하는 역할 AWS 에서는 Route53(DNS서비스)를 이용해 도메인을 IP로 변환하여 컴퓨터가 통신하도록 함 2. Amazon Route 53 의 주요 특징 가용성과 확장성이 뛰어난 클라우드 기반 DNS 웹 서비스 동적으로 사용자에게 노출될 DNS 레코드 타입과 값 조정 다양한 로드밸런싱 기능 지원 사용자의 요청을 EC2, ELB, S3 버킷 등 인프라로 직접 연결 가능 외부의 인프라로 라우팅하는데 사용 가능 Route53 트래픽 흐름을 사용하면 지연시간기반 라우팅 가능 Route53에서는 도메인 이름 등록도 지원 가능 3. Route53의 기능 연..

IT/AWS 2024. 1. 24. 09:25

unzip

import json, gzip import shutil tdc_gz = '/data001/jupyter-docker/jooeun.kim/mon/raw/integration-parser-v1_9_9_027fc400-f327-8e67-2fee-ac46c2111cda_6992502d-31eb-9081-c152-70de02e65ce1_gz-0.gz' # with fs.open(tdc_gz) as s3fp: with gzip.open(tdc_gz, 'rb') as f_in: with open('tdc.json', 'wb') as f_out: shutil.copyfileobj(f_in, f_out)

IT/Python 2023. 8. 26. 19:20

PySpark Preprocessing

# convert each row in DataFrame to list of integer df.col_2 = df.col2.map(lambda x: [int(e) for e in x]) df_spark = spark.createDataFrame(df) df_spark.select('col_1', explode(col('col_2')).alias('cols_2')).show(10)

IT/spark 2023. 8. 25. 11:29

PySpark json flatten case

data1 = spark.read.parquet(path) json_schema = spark.read.json(data1.rdd.map(lambda row: row.json_col)).schema data2 = data1.withColumn("data", from_json("json_col", json_schema)) col1 = datat2.columns col1.remove("data") col2 = data2.select("data.*").columns append_str = "data." col3 = [append_str + val for val in col2] col_list = col1 + col3 data3 = data2.select(*col_list).drop("json_col")

IT/spark 2023. 8. 25. 11:26

PySpark의 UDF 예제

from pyspark.sql.functions import lit df = sqlContext.createDataFrame( [(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3")) df_with_x4 = df.withColumn("x4", lit(0)) df_with_x4.show() ## +---+---+-----+---+ ## | x1| x2| x3| x4| ## +---+---+-----+---+ ## | 1| a| 23.0| 0| ## | 3| B|-23.0| 0| ## +---+---+-----+---+ from pyspark.sql.functions import exp df_with_x5 = df_with_x4.withColumn("x5", exp(..

IT/spark 2023. 7. 31. 16:55

PySpark 특징 및 장점

스파크에서 사용되는 분산 콜렉션 API DataFrame : DataFrame is like a partitioned table. Row 타입의 레코드와 칼럼으로 구성 됨 DataFrame 의 파티셔닝은 DataFrame이 클러스터에서 물리적으로 배치되는 형태를 정의 (물리적 분산저장) 데이터 타입 검증 시기 : 런타임에서 확인 지원 : 모두 지원 Row 접근 : Row 타입을 직렬화된 바이너리 구조로 변환. 스파크의 최적화된 내부 포맷: 일반적으로 가장 빠름 DataFrame의 최적화 sortWithinPartitions : 트랜스포메이션 처리 전에 파티션별 정렬을 수행하여 DataFrame 을 최적화 함 spark.read.load('file.json').sortWithinPartitions('cou..

IT/spark 2023. 7. 31. 14:40

S3 file 지우기

aws s3 filter 조건의 file 지우기 import boto3 sess = boto3.Session() s3 = sess.resource('s3') operation_bucket = s3.Bucket('MY_BUCKET') for prd in prd_list: file_list_to_delete = [] for file_key in operation_bucket.objects.filter(Prefix=f'MY_PREFIX/product={prd}'): # print(file) if 'some_condition' in str(file_key): file_list_to_delete.append(file_key) s3.Object('MY_BUCKET', file_key._key).delete() pr..

IT/Python 2023. 6. 29. 08:41

특정 이름의 S3 파일을 지우고 싶을 때

특정 이름을 가진 S3 파일 리스트 필터링 import boto3 s3 = boto3.resource('s3') bucket = s3.Bucket('bucket-name') for file_key in file_list: print(file_key) s3.Object('s3-an2-op-hlp-etl', file_key).delete() 필터링 된 S3 리스트 내 파일 삭제 import boto3 sess = boto3.Session() s3 = sess.resource('s3') operation_bucket = s3.Bucket('bucket-name') file_list = [] for file in operation_bucket.objects.filter(Prefix=f'test_prefix/')..

IT/Python 2023. 6. 9. 15:37

Convert GroupBy Series to DataFrame

# Below are a quick example # Example 1: Convert groupby Series # Using groupby() & count() on multiple column grouped_ser = df.groupby(['Courses', 'Duration'])['Fee'].count() # Example 2: Convert groupby object to DataFrame grouped_df = grouped_ser.reset_index() # Example 3: Use the as_index attribute to get groupby DataFrame grouped_df = df.groupby(['Courses', 'Duration'], as_index = False)['F..

IT/Python 2023. 6. 5. 15:41

진지한 개발자

목록IT (42)

진지한 개발자

티스토리툴바