🇪🇸 Leer en Español 🇺🇸 English
Data Types: EOD, Intraday, and Tick
End-of-Day (EOD) Data
What Is It?
Data with a single point per day: Open, High, Low, Close, Volume (OHLCV).
# Example EOD data
date open high low close volume
2024-01-15 175.00 178.50 174.25 177.80 45000000
2024-01-16 177.80 179.00 176.00 178.25 42000000
When to Use
- Swing trading (holding days/weeks)
- Long-term trend analysis
- Initial idea screening
- Position strategy backtesting
Pros and Cons
✅ Pros:
- Free or cheap
- Easy to handle
- Less noise
- Fast backtests
❌ Cons:
- Not useful for day trading
- Loses intraday information
- Can’t optimize entries/exits
Example Code
import yfinance as yf
import pandas as pd
# Get EOD data
ticker = 'AAPL'
eod_data = yf.download(ticker, start='2023-01-01', end='2024-01-01')
# Calculate simple metrics
eod_data['SMA20'] = eod_data['Close'].rolling(20).mean()
eod_data['Daily_Range'] = ((eod_data['High'] - eod_data['Low']) / eod_data['Low'] * 100)
eod_data['Gap'] = (eod_data['Open'] / eod_data['Close'].shift(1) - 1) * 100
Intraday Data (Minute Bars)
What Is It?
OHLCV for specific intervals: 1min, 5min, 15min, etc.
# Example 5-min bars
datetime open high low close volume
2024-01-15 09:30:00 175.00 175.50 174.95 175.20 500000
2024-01-15 09:35:00 175.20 175.80 175.10 175.75 450000
2024-01-15 09:40:00 175.75 176.00 175.50 175.55 380000
When to Use
- Day trading
- Precise entries/exits
- Intraday patterns (VWAP, breakouts)
- Intraday risk management
Common Resolutions
RESOLUTIONS = {
'scalping': '1min',
'day_trading': '5min',
'swing_entries': '15min',
'trend_confirmation': '60min'
}
Data Handling
# With Polygon.io
from polygon import RESTClient
client = RESTClient("YOUR_API_KEY")
# 5-minute bars
bars = client.get_aggs(
ticker="AAPL",
multiplier=5,
timespan="minute",
from_="2024-01-15",
to="2024-01-15"
)
# Convert to DataFrame
df = pd.DataFrame(bars)
df['datetime'] = pd.to_datetime(df['timestamp'], unit='ms')
df.set_index('datetime', inplace=True)
# Calculate VWAP
df['cum_vol'] = df['volume'].cumsum()
df['cum_vol_price'] = (df['close'] * df['volume']).cumsum()
df['vwap'] = df['cum_vol_price'] / df['cum_vol']
Tick Data
What Is It?
Every individual transaction with exact timestamp.
# Example tick data
timestamp price size exchange conditions
2024-01-15 09:30:00.123 175.00 100 NYSE ['regular']
2024-01-15 09:30:00.125 175.01 500 NASDAQ ['regular']
2024-01-15 09:30:00.127 175.00 200 ARCA ['odd_lot']
When to Use
- High frequency trading
- Microstructure analysis
- Block/dark pool detection
- Exact slippage analysis
Considerations
- Size: 1GB+ per day for liquid stocks
- Processing: You need optimized code
- Cost: $100-500+/month for quality data
Working with Tick Data
# Example with Polygon tick data
trades = client.list_trades(
ticker="AAPL",
timestamp="2024-01-15",
limit=50000
)
# Process for analysis
tick_df = pd.DataFrame(trades)
tick_df['timestamp'] = pd.to_datetime(tick_df['sip_timestamp'], unit='ns')
# Detect large prints
large_prints = tick_df[tick_df['size'] >= 10000]
# Analyze by exchange
exchange_volume = tick_df.groupby('exchange')['size'].sum()
# Create time bars from ticks
def create_time_bars(ticks, bar_size='5T'):
ticks.set_index('timestamp', inplace=True)
bars = ticks.resample(bar_size).agg({
'price': ['first', 'max', 'min', 'last'],
'size': 'sum'
})
bars.columns = ['open', 'high', 'low', 'close', 'volume']
return bars
Practical Comparison
| Type | Size/Day | Cost | Use Case | Latency |
|---|---|---|---|---|
| EOD | 1 row | $0 | Swing/Position | N/A |
| 1-min | 390 rows | $20-50 | Day trading | 1 min |
| Tick | 100k-1M rows | $100+ | HFT/Analysis | Real-time |
Data Aggregation
From Tick to Minute
def aggregate_ticks_to_bars(ticks, bar_type='time', bar_size=60):
if bar_type == 'time':
# Time bars (every 60 seconds)
bars = ticks.resample(f'{bar_size}S').agg({
'price': ['first', 'max', 'min', 'last'],
'size': 'sum'
})
elif bar_type == 'volume':
# Volume bars (every N shares)
bars = aggregate_volume_bars(ticks, bar_size)
elif bar_type == 'dollar':
# Dollar bars (every $N traded)
ticks['dollar_vol'] = ticks['price'] * ticks['size']
bars = aggregate_dollar_bars(ticks, bar_size)
return bars
Volume Bars (Advanced)
def create_volume_bars(ticks, volume_per_bar=100000):
bars = []
current_bar = {'volume': 0, 'high': 0, 'low': float('inf')}
for _, tick in ticks.iterrows():
current_bar['volume'] += tick['size']
current_bar['high'] = max(current_bar['high'], tick['price'])
current_bar['low'] = min(current_bar['low'], tick['price'])
if current_bar['volume'] >= volume_per_bar:
bars.append(current_bar)
current_bar = {'volume': 0, 'high': 0, 'low': float('inf')}
return pd.DataFrame(bars)
Data Quality
Validation Checklist
def validate_intraday_data(df):
issues = []
# 1. Temporal gaps
expected_bars = pd.date_range(
start=df.index[0].replace(hour=9, minute=30),
end=df.index[0].replace(hour=16, minute=0),
freq='1min'
)
missing = expected_bars.difference(df.index)
if len(missing) > 0:
issues.append(f"Missing {len(missing)} bars")
# 2. Negative or zero prices
if (df[['open', 'high', 'low', 'close']] <= 0).any().any():
issues.append("Zero or negative prices found")
# 3. High/Low consistency
invalid_hl = df['high'] < df['low']
if invalid_hl.any():
issues.append(f"{invalid_hl.sum()} bars with high < low")
# 4. Suspicious volume
if (df['volume'] == 0).sum() > len(df) * 0.1:
issues.append("Too many zero volume bars")
return issues
My Personal Approach
# I use different types depending on the strategy
DATA_CONFIG = {
'gap_scanner': {
'type': 'EOD',
'source': 'yahoo',
'reason': 'Only need overnight gap %'
},
'vwap_trading': {
'type': '1min',
'source': 'polygon',
'reason': 'VWAP accuracy + entry timing'
},
'tape_reading': {
'type': 'tick',
'source': 'polygon_websocket',
'reason': 'See order flow in real time'
}
}
Practical Tips
- Start with EOD, it’s free and sufficient for learning
- Upgrade to 5-min when you do day trading
- Tick data only if you do HFT or deep analysis
- Store locally data you use frequently
- Timestamp timezone always in Eastern (NYSE time)
Efficient Storage
# Save efficiently
df.to_parquet('data/AAPL_2024_1min.parquet') # Better than CSV
df.to_hdf('data/ticks.h5', key='AAPL') # For large datasets
# Read efficiently
df = pd.read_parquet('data/AAPL_2024_1min.parquet')
df = pd.read_hdf('data/ticks.h5', key='AAPL',
where='timestamp >= "2024-01-15" & timestamp < "2024-01-16"')
Next Step
Now that you understand data types, let’s move on to Data Cleaning to ensure your data is reliable.